Skip to main content

SQL computation

Decentriq’s Data Clean Rooms are equipped with a SQLite-based engine, powered by confidential computing to guarantee a high level of security. You can query on provisioned datasets, and only results will be available. Sensitive data is never revealed to anyone. Results of SQL computations can be used as input to other computations such as Python, R and Synthetic Data Generation.

Current version

decentriq.python-ml-worker-32-64:v23

note

The Python worker is used for SQL computations, as it has SQLite natively available, resulting in a better performance.

Documentation

Please refer to the official SQLite documentation for details of the supported syntax and functions.

Querying computation results

In the Data Clean Room, SQL queries can reference the output of other computations in the FROM statement to compose nested queries. You can create independent computations or chains of them, using the output of one computation as the input of the next. This approach also allows for combining SQL and Python, as well as other types like Synthetic Data Generation. To use nested queries within the same block, you can use the WITH..AS() approach, which will run the entire block at once.

# Computation_1 querying table1.
SELECT sum(col_a) AS sum_col_a
FROM table1
# Computation_2 querying Computation_1.
SELECT sum_col_a
FROM Computation_1
WHERE sum_col_a>5

Create a SQL computation in the Decentriq UI

Follow the Decentriq UI tutorial

Create a SQL computation using the Python SDK

In this example, we will use the Python container to run a SQL computation.

Assume we want to create a Data Clean Room that simply calculates the sum of salaries from a SQL table. Using the Python SDK to accomplish this task could look as follows:

First, we set up a connection to an enclave and create a AnalyticsDcrBuilder object:

import decentriq_platform as dq
from decentriq_platform.analytics import AnalyticsDcrBuilder

user_email = "@@ YOUR EMAIL HERE @@"
api_token = "@@ YOUR TOKEN HERE @@"

client = dq.create_client(user_email, api_token)

builder = AnalyticsDcrBuilder(client=client)
builder.\
with_name("SQL Data Clean Room").\
with_owner(user_email)

Then we create a table with the specified schema:

from decentriq_platform.analytics import (
Column,
FormatType,
TableDataNodeDefinition,
)

columns = [
Column(
name="name",
format_type=FormatType.STRING,
is_nullable=False,
),
Column(
name="salary",
format_type=FormatType.FLOAT,
is_nullable=False,
),
]


builder.add_node_definition(
TableDataNodeDefinition(
name="salary_data", columns=columns, is_required=True
)
)

The SQL query to sum all salaries could look like this:

my_query_content = """
SELECT SUM(salary) FROM salary_data
"""

Here we defined the query within a multi-line string. For larger statements, however, defining them in a file would likely be easier.

Now we can add the node that will actually execute our script. To use this script in a Data Clean Room, it first has to be loaded into a SqliteComputeNodeDefinition node, which is then added to the DCR configuration. This makes the SQL query visible to all the participants in the DCR.

# If you wrote your query in a separate file, you can simply open
# the file using `with open`:
#
# with open("my_query.sql", "r") as data:
# my_query_content = data.read()

from decentriq_platform.analytics import SqliteComputeNodeDefinition

builder.add_node_definition(
SqliteComputeNodeDefinition(
name="sum_salary",
query=my_query_content,
dependencies=["salary_data"]
)
)

Finally, we add the permissions for the user to execute the computation and retrieve the results (in this example, the Data Clean Room only has one user).

builder.add_participant(
user_email,
data_owner_of=["salary_data"],
analyst_of=["sum_salary"]
)

dcr_definition = builder.build()
dcr = client.publish_analytics_dcr(dcr_definition)

data_room_id = dcr.id

Once published, it's possible to upload and provision the dataset, and then run the computation. Please follow the steps from the section Provision datasets via SDK in the Datasets guide.

For this example we provisioned the following dataset to the salary_data table node:

Salary,Name
Alice,10.0
Bob,5.0
John,14.0
import io
from decentriq_platform import Key

# Usually you would want to load your data from a file.
# In this case you would do something like:
#
# with open("/my/file.csv", "rb") as data:
# ...
#
# Here, instead, we use io.BytesIO to make a string "act" like
# it was loaded from a file.

data = io.BytesIO("""Alice,10.0
Bob,5.0
John,14.0
""".encode())

data_node = dcr.get_node("salary_data")
data_node.upload_and_publish_dataset(data, key=Key(), name="my_salary_data.csv")

Now we can run the computation and retrieve the results:

sql_node = dcr.get_node("sum_salary")
results_csv = sql_node.run_computation_and_get_results_as_string()
print("results: ", results_csv)

# results: 29.0
note

When the referenced Data Clean Room was created using the Decentriq UI:

The name argument of the get_node() should be the value that you see when hovering your mouse pointer over the name of that computation.