Getting started with the Python SDK (legacy version)
Starting with version 0.26.0 of the Decentriq Python SDK, a new API for creating and interacting with Data Rooms has been released. This tutorial assumes you want to use the old API. The same tutorial for the new API can be found under Getting started.
In this tutorial we will:
- establish a connection to an enclave running on the Decentriq platform.
- create a Data Clean Room (DCR) instance given a specific configuration.
- upload and publish data to the DCR.
- run a computation for the given DCR and fetch the results.
- inspect the tamper-proof audit log.
To follow this tutorial, first install the our Python library (see the Installation section), then execute the following commands either interactively (e.g. in a Jupyter notebook) or concatenated as a Python script.
Establish connection to an enclave
First, you need to authenticate with the platform. Please specify your user email as well as a valid API token generated for that account.
user_email = "test_user@company.com"
api_token = "@@ YOUR TOKEN HERE @@"
Create and manage your API tokens via the Decentriq UI - API tokens page.
Next, we import the necessary dependencies into our program.
The Python SDK consists of the main package decentriq_platform
that provides most of the tools needed to interact with the Decentriq platform.
Extra functionality, such as the ability to define SQL-based computations to be run in an enclave, are part of submodules (also called compute modules).
In this example, we will define SQL queries and thus we will need to import the sql
module:
import decentriq_platform as dq
import decentriq_platform.legacy.sql as dqsql
We create a Client
object which we use as a starting point for interacting with the platform.
We will also create the list of enclave specifications we want to use. These are objects that control what type of enclaves the SDK trusts and will be used in the remote attestation procedure, i.e. the procedure in which the client code makes sure that the enclave it tries to connect to is the one that you expect. These specifications are provided as part of the SDK and are numbered by their release version. Refer to the reference documentation of the main package and compute modules to find the latest version available. Since each enclave type is identified by a different specification object, we need to also get the enclave specifications for the worker enclaves that perform the specific computation we want (SQL in this case). Regardless of what types of computation you want to perform within your DCR, you will need to include a specification for the driver enclave. This is the enclave with which the SDK communicates and that splits your computation into executable tasks.
client = dq.create_client(user_email, api_token)
enclave_specs = dq.enclave_specifications.versions([
"decentriq.driver:v20",
"decentriq.sql-worker:v12"
])
Next, we will create an Auth
object that defines how the current user will authenticate itself when communicating with the enclave.
Authentication of users by enclaves is performed using a user-defined root certificate that is part of the DCR configuration.
Only users that can provide a certificate signed by the corresponding private key can connect to this particular DCR.
With the following function, we can quickly create such an object using Decentriq as the CA to issue user certificates.
With this Auth
object we will be able to finally create a Session
, the object that takes care of all communication from and to the driver enclave.
auth, _ = client.create_auth_using_decentriq_pki(enclave_specs)
session = client.create_session(auth, enclave_specs)
To create or interact with an existing password-protected DCR, use Auth
in combination with Endorser
:
auth, endorser = client.create_auth_using_decentriq_pki(enclave_specs)
dcr_secret_endorsement, dcr_secret_id = endorser.dcr_secret_endorsement("your_dcr_password_here")
auth.attach_endorsement(dcr_secret=dcr_secret_endorsement)
The Session
creation remains untouched, as stated above.
Creation of a Data Clean Room (DCR)
A Data Clean Room running on the platform can be seen as an instantiation of a DCR configuration. This configuration strictly defines the schemas of all datasets associated with a DCR. In the Decentriq platform, computations and the data they depend on are arranged in a compute graph with nodes being either data nodes (also called leaves) or compute nodes. Eventually, users will upload their datasets to the data nodes that were defined in the DCR configuration. Similarly, users will be able to run the computations (defined by the compute nodes) by making the appropriate method calls. Which user is able to upload data to which data node and trigger which computation is controlled using our permission system (see below).
We can define a DCR configuration using the DataRoomBuilder
class.
We supply it with the name of the Data Clean Room we want to build and the enclave specifications to use for the worker enclaves that will eventually execute our computations.
builder = dq.legacy.DataRoomBuilder("My DCR", enclave_specs=enclave_specs)
Data and compute nodes need to be added to the Data Clean Room builder by calling the appropriate method. For tabular datasets that have a pre-defined schema, a special helper class exists that will (besides adding a data node) also add a compute node for verifying the schema of our tabular data. The builder object will be given a name that serves as the node identifier of the just mentioned verification computation. When adding additional computations for processing your data (such as an SQL query), you will have these computations taking as input the output of the verification computation, rather than the dataset directly. This way, the SQL engine is able to understand the schema of your data and knows that it's structured as expected.
data_node_builder = dqsql.TabularDataNodeBuilder(
"salary_data",
schema=[
# The name of the columns, together with their type, and a flag
# for whether values of that column are nullable.
("name", dqsql.PrimitiveType.STRING, False),
("salary", dqsql.PrimitiveType.FLOAT64, False)
]
)
# Add all the nodes, as well as the permissions for uploading data and validating
# it in one call
data_node_builder.add_to_builder(
builder,
authentication=client.decentriq_pki_authentication,
users=[user_email]
)
Next, we define the actual SQL-based computation to be run on our data, as well as the permissions that need to be checked by the enclave. In this example we only add permissions for a single user. We can, however, add permissions for as many users as we like. Some basic permissions are given to each user automatically (this behavior can be turned off by setting the appropriate flag when constructing the Data Clean Room builder object). These are:
- Permission to retrieve the DCR configuration and inspect its contents.
- Permission to retrieve the audit log that contains a history of what operations have been performed in the DCR and by whom.
- Permission to inspect the status of a DCR (whether it is active or it has been stopped).
- Permission to retrieve the list of datasets provisioned to a Data Clean Room.
- Permission to run development computations.
The permission to upload files to the tabular data node has also already been granted when adding the data node to the builder.
When not using the TabularDataNodeBuilder
class, this permission would need to be granted explicitly.
Compute nodes for SQL-based computations are provided by the SqlCompute
class that can
be instantiated as follows:
The SqlCompute
class' constructor accepts a name
argument that simply serves as a human-readable identifier.
Within the enclave, each compute node is identified by an automatically generated ID.
We also pass the query to be executed as a string.
In this query, we can refer to tables (or rather compute nodes representing them) that are part of the same Data Clean Room.
Because the enclave uses identifiers rather than the human-readable names to address
compute nodes, we also have to tell the computation which upstream compute node provides
data for which table name. The class TabularDataNodeBuilder
sets up the Data Clean Room in such a way that the Data Owner (the verification computation) has its own ID. The dependencies
argument is therefore a mapping
from the table name to the verification computation node ID.
query_node = dqsql.SqlCompute(
# A human-readable name for the computation
name="salary_sum",
# The query to be executed
sql_statement=f"""
SELECT SUM(salary)
FROM salary_data
""",
# A list of tuples, each containing the following two values:
# 1. The table name as it appears in the query string.
# 2. The ID of the verification node that provides data for this table
dependencies=[
("salary_data", data_node_builder.output_node_id)
]
)
When adding a new data or compute node, the builder will assign the newly added node an identifier and return it. This identifier is needed when interacting with the node (for example when running a particular computation or when defining permissions affecting this node).
query_node_id = builder.add_compute_node(query_node)
# Adding the permissions
builder.add_user_permission(
email=user_email,
# We are again using the Decentriq PKI as the DCR authentication method
authentication_method=client.decentriq_pki_authentication,
permissions=[
# Permission to execute the actual computation
dq.legacy.Permissions.execute_compute(query_node_id),
# Permission to retrieve the result of the computations
dq.legacy.Permissions.retrieve_compute_result(query_node_id)
]
)
The Data Clean Room can now be built. Note that this is at first not yet a Data Clean Room with which you can interact, but only the initial configuration of a DCR (technically it's a list of modifications that will be applied to an empty DCR configuration by a secure enclave). Only after publishing the Data Clean Room configuration the enclaves will be able to perform the defined computations in there.
data_room = builder.build()
data_room_id = session.publish_data_room(data_room)
The ID of the published DCR will be returned from the publishing method, and it will be needed for all future interactions with the DCR. You can fetch a list of descriptions of your existing Data Clean Rooms like this:
client.get_data_room_descriptions()
In this output you will find for each DCR its id
, which is the id that should
be used when referring to the Data Clean Room.
Upload and publish data to a DCR
Let's create some example data which we want to ingest. Given our table schema from above, we define some names and salaries that we want to sum.
name | salary |
---|---|
Bob | 10.0 |
Alice | 5.0 |
Jack | 14.0 |
We can define this table as a CSV string in Python and read it with one of the
helper functions provided by the sql
compute module (a similar function exists
for reading directly from CSV files, refer to the reference docs to learn more):
my_csv_string = """
Name,Salary
Alice,10.0
Bob,5.0
John,14.0
"""
data = dqsql.read_input_csv_string(my_csv_string, has_header=True, delimiter=",")
The data can be encrypted and uploaded to the enclave as follows:
encryption_key = dq.Key()
dataset_id = dqsql.upload_and_publish_tabular_dataset(
data, encryption_key, data_room_id,
table="salary_data",
session=session,
description="salary",
validate=True
)
This is a convenience method that takes care of encrypting the data, uploading it, connecting it to a DCR (called "publishing"), as well as validating its schema.
Normally, uploading datasets and publishing them are two separate steps (see client.upload_dataset
and session.publish_dataset
), with uploading not requiring an active enclave session.
Use whatever makes more sense for your use case.
When the referenced Data Clean Room was created using the Decentriq UI:
- the
table
argument will have the format<NODE_ID>
, where<NODE_ID>
corresponds to the value that you see when hovering your mouse pointer over the name of the data node.
Run the query and retrieve results
After ingesting the data, we can now run the pre-defined query on the DCR by
calling the run_computation
method and providing it with the id of the compute
node we added earlier.
# Trigger the computation
job_id = session.run_computation(data_room_id, query_node_id)
# Start polling the platform every 5 seconds and fetch the results as
# soon as the computation finished.
results = session.get_computation_result(job_id)
Computation results will always be binary strings and we will need to interpret them according to the type of computation that produced them. To help with this, each compute module provides helper functions exactly for this purpose:
csv = dqsql.read_sql_query_result_as_string(results)
print(csv)
#> V1
#> 29.0
# Write the output to a CSV file:
with open('output.csv', 'w') as f:
f.write(csv)
Inspect audit log
At any time we can also obtain a tamper-proof audit log of all events that happened with respect to the DCR:
audit_log = session.retrieve_audit_log(data_room_id)
print(audit_log.log.decode())