Skip to main content

SQL computation

Decentriq’s Data Clean Rooms are equipped with a SQLite-based engine, powered by confidential computing to guarantee a high level of security. You can query on provisioned datasets, and only results will be available. Sensitive data is never revealed to anyone. Results of SQL computations can be used as input to other computations such as Python, R and Synthetic Data Generation.

Current version

decentriq.python-ml-worker-32-64:v21

note

The Python worker is used for SQL computations, as it has SQLite natively available, resulting in a better performance.

Documentation

Please refer to the official SQLite documentation for details of the supported syntax and functions.

Querying computation results

In the Data Clean Room, SQL queries can reference the output of other computations in the FROM statement to compose nested queries. You can create independent computations or chains of them, using the output of one computation as the input of the next. This approach also allows for combining SQL and Python, as well as other types like Synthetic Data Generation. To use nested queries within the same block, you can use the WITH..AS() approach, which will run the entire block at once.

# Computation_1 querying table1.
SELECT sum(col_a) AS sum_col_a
FROM table1
# Computation_2 querying Computation_1.
SELECT sum_col_a
FROM Computation_1
WHERE sum_col_a>5

Create a SQL computation in the Decentriq UI

Follow the Decentriq UI tutorial

Create a SQL computation using the Python SDK

The decentriq_platform.container module provides functionality to run computations within containers. In this example, we will use the Python container to run a SQL computation.

Assume we want to create a Data Clean Room that simply calculates the sum of salaries from a SQL table. Using the Python SDK to accomplish this task could look as follows:

First, we set up a connection to an enclave and create a DataRoomBuilder object:

import decentriq_platform as dq
import decentriq_platform.container as dqc
import decentriq_platform.sql as dqsql
import decentriq_platform.proto.compute_sql_pb2 as sqlpb
import decentriq_platform.proto as dqpb

user_email = "test_user@company.com"
api_token = "@@ YOUR TOKEN HERE @@"

client = dq.create_client(user_email, api_token)
enclave_specs = dq.enclave_specifications.versions([
"decentriq.driver:v20",
"decentriq.sql-worker:v12"
"decentriq.python-ml-worker-32-64:v21"
])
auth, _ = client.create_auth_using_decentriq_pki(enclave_specs)
session = client.create_session(auth, enclave_specs)
builder = dq.DataRoomBuilder(
"SQL Data Clean Room",
enclave_specs=enclave_specs
)
note

In addition to the driver, we request the enclave specification named decentriq.sql-worker to create SQL tables and the decentriq.python-ml-worker-32-64 to run SQL computations. The Python worker is more performant because it uses SQLite natively to run computations.

note

It is recommended to always use the most up to date enclave driver and workers. Please check the Release Notes for the latest versions.

Then we create a table with the specified schema:

data_node_builder = dq.sql.TabularDataNodeBuilder(
"salary_data",
schema=[
# The name of the columns, together with their type, and a flag
# for whether values of that column are nullable.
("name", dqsql.PrimitiveType.STRING, False),
("salary", dqsql.PrimitiveType.FLOAT64, False)
]
)

And add the permissions for uploading and validating the data. The add_to_builder method returns the IDs of the data node and the validated data node, which we will need later to reference the data node in the computation.

# Add all the nodes, as well as the permissions for uploading data and validating
# it in one call
[data_node_id, validated_data_node_id] = data_node_builder.add_to_builder(
builder,
authentication=client.decentriq_pki_authentication,
users=[user_email]
)

Next, we create a configuration that holds the table name, the SQL statement, and the dependency to the validated data node. This configuration is then serialized and added to the builder.

sql_statement = "SELECT SUM(salary) FROM salary_data"

sql_worker_config = sqlpb.SqlWorkerConfiguration(
computation=sqlpb.ComputationConfiguration(
sqlStatement=sql_statement,
tableDependencyMappings=[sqlpb.TableDependencyMapping(
table="salary_data",
dependency=validated_data_node_id,
),]
)
)

sql_worker_config_node = dq.StaticContent("sql_worker_config", dqpb.serialize_length_delimited(sql_worker_config))
sql_worker_config_node_id = builder.add_compute_node(
sql_worker_config_node
)

Now we can add the computation to the Data Clean Room, taking the configuration and dependencies from above.

sql_worker = dqc.StaticContainerCompute(
name="sql_container_worker",
command=
"run-sql-worker --input salary_data --config config".split()
,
mount_points=[
dqc.proto.MountPoint(
path="salary_data",
dependency=validated_data_node_id
),
dqc.proto.MountPoint(
path="config",
dependency=sql_worker_config_node_id,
)
],
output_path="/output",
enclave_type="decentriq.python-ml-worker-32-64",
)
sql_worker_node_id = builder.add_compute_node(sql_worker)

Finally, we add the permissions for the user to execute the computation and retrieve the results (in this example, the Data Clean Room only has one user).

builder.add_user_permission(
email=user_email,
authentication_method=client.decentriq_pki_authentication,
permissions=[
dq.Permissions.leaf_crud(data_node_id),
dq.Permissions.execute_compute(sql_worker_node_id),
dq.Permissions.retrieve_compute_result(sql_worker_node_id),
]
)

All set, the Data Clean Room can be encrypted and published.

data_room = builder.build()
dcr_id = session.publish_data_room(data_room)

Once published, it's possible to upload and provision the dataset, and then run the computation. Please follow the steps from the section Provision datasets via SDK in the Datasets guide.

For this example we provisioned the following dataset to the salary_data table node:

Salary,Name
Alice,10.0
Bob,5.0
John,14.0

Now we can run the computation and retrieve the results:

results = session.run_computation_and_get_results(data_room_id, sql_worker_node_id, interval=1)
results_csv, _ = dqsql.helpers.read_sql_query_result_as_list(results)
print("results_csv: ", results_csv)

# results_csv: [['29.0']]