Getting started with the Python SDK (legacy version)

note

Starting with version 0.26.0 of the Decentriq Python SDK, a new API for creating and interacting with Data Rooms has been released. This tutorial assumes you want to use the old API. The same tutorial for the new API can be found under Getting started.

In this tutorial we will:

establish a connection to an enclave running on the Decentriq platform.
create a Data Clean Room (DCR) instance given a specific configuration.
upload and publish data to the DCR.
run a computation for the given DCR and fetch the results.
inspect the tamper-proof audit log.

To follow this tutorial, first install the our Python library (see the Installation section), then execute the following commands either interactively (e.g. in a Jupyter notebook) or concatenated as a Python script.

Establish connection to an enclave

First, you need to authenticate with the platform. Please specify your user email as well as a valid API token generated for that account.

user_email = "test_user@company.com"
api_token = "@@ YOUR TOKEN HERE @@"

note

Create and manage your API tokens via the Decentriq UI - API tokens page.

Next, we import the necessary dependencies into our program. The Python SDK consists of the main package decentriq_platform that provides most of the tools needed to interact with the Decentriq platform. Extra functionality, such as the ability to define SQL-based computations to be run in an enclave, are part of submodules (also called compute modules). In this example, we will define SQL queries and thus we will need to import the sql module:

import decentriq_platform as dq
import decentriq_platform.legacy.sql as dqsql

We create a Client object which we use as a starting point for interacting with the platform.

We will also create the list of enclave specifications we want to use. These are objects that control what type of enclaves the SDK trusts and will be used in the remote attestation procedure, i.e. the procedure in which the client code makes sure that the enclave it tries to connect to is the one that you expect. These specifications are provided as part of the SDK and are numbered by their release version. Refer to the reference documentation of the main package and compute modules to find the latest version available. Since each enclave type is identified by a different specification object, we need to also get the enclave specifications for the worker enclaves that perform the specific computation we want (SQL in this case). Regardless of what types of computation you want to perform within your DCR, you will need to include a specification for the driver enclave. This is the enclave with which the SDK communicates and that splits your computation into executable tasks.

client = dq.create_client(user_email, api_token)
enclave_specs = dq.enclave_specifications.versions([
    "decentriq.driver:v20",
    "decentriq.sql-worker:v12"
])

Next, we will create an Auth object that defines how the current user will authenticate itself when communicating with the enclave. Authentication of users by enclaves is performed using a user-defined root certificate that is part of the DCR configuration. Only users that can provide a certificate signed by the corresponding private key can connect to this particular DCR. With the following function, we can quickly create such an object using Decentriq as the CA to issue user certificates.

With this Auth object we will be able to finally create a Session, the object that takes care of all communication from and to the driver enclave.

auth, _ = client.create_auth_using_decentriq_pki(enclave_specs)
session = client.create_session(auth, enclave_specs)

note

To create or interact with an existing password-protected DCR, use Auth in combination with Endorser:
auth, endorser = client.create_auth_using_decentriq_pki(enclave_specs)
dcr_secret_endorsement, dcr_secret_id = endorser.dcr_secret_endorsement("your_dcr_password_here")
auth.attach_endorsement(dcr_secret=dcr_secret_endorsement)
The Session creation remains untouched, as stated above.

Creation of a Data Clean Room (DCR)

A Data Clean Room running on the platform can be seen as an instantiation of a DCR configuration. This configuration strictly defines the schemas of all datasets associated with a DCR. In the Decentriq platform, computations and the data they depend on are arranged in a compute graph with nodes being either data nodes (also called leaves) or compute nodes. Eventually, users will upload their datasets to the data nodes that were defined in the DCR configuration. Similarly, users will be able to run the computations (defined by the compute nodes) by making the appropriate method calls. Which user is able to upload data to which data node and trigger which computation is controlled using our permission system (see below).

We can define a DCR configuration using the DataRoomBuilder class. We supply it with the name of the Data Clean Room we want to build and the enclave specifications to use for the worker enclaves that will eventually execute our computations.

builder = dq.legacy.DataRoomBuilder("My DCR", enclave_specs=enclave_specs)

Data and compute nodes need to be added to the Data Clean Room builder by calling the appropriate method. For tabular datasets that have a pre-defined schema, a special helper class exists that will (besides adding a data node) also add a compute node for verifying the schema of our tabular data. The builder object will be given a name that serves as the node identifier of the just mentioned verification computation. When adding additional computations for processing your data (such as an SQL query), you will have these computations taking as input the output of the verification computation, rather than the dataset directly. This way, the SQL engine is able to understand the schema of your data and knows that it's structured as expected.

data_node_builder = dqsql.TabularDataNodeBuilder(
    "salary_data",
    schema=[
        # The name of the columns, together with their type, and a flag
        # for whether values of that column are nullable.
        ("name", dqsql.PrimitiveType.STRING, False),
        ("salary", dqsql.PrimitiveType.FLOAT64, False)
    ]
)

# Add all the nodes, as well as the permissions for uploading data and validating
# it in one call
data_node_builder.add_to_builder(
    builder,
    authentication=client.decentriq_pki_authentication,
    users=[user_email]
)

Next, we define the actual SQL-based computation to be run on our data, as well as the permissions that need to be checked by the enclave. In this example we only add permissions for a single user. We can, however, add permissions for as many users as we like. Some basic permissions are given to each user automatically (this behavior can be turned off by setting the appropriate flag when constructing the Data Clean Room builder object). These are:

Permission to retrieve the DCR configuration and inspect its contents.
Permission to retrieve the audit log that contains a history of what operations have been performed in the DCR and by whom.
Permission to inspect the status of a DCR (whether it is active or it has been stopped).
Permission to retrieve the list of datasets provisioned to a Data Clean Room.
Permission to run development computations.

The permission to upload files to the tabular data node has also already been granted when adding the data node to the builder. When not using the TabularDataNodeBuilder class, this permission would need to be granted explicitly.

Compute nodes for SQL-based computations are provided by the SqlCompute class that can be instantiated as follows:

The SqlCompute class' constructor accepts a name argument that simply serves as a human-readable identifier. Within the enclave, each compute node is identified by an automatically generated ID. We also pass the query to be executed as a string. In this query, we can refer to tables (or rather compute nodes representing them) that are part of the same Data Clean Room. Because the enclave uses identifiers rather than the human-readable names to address compute nodes, we also have to tell the computation which upstream compute node provides data for which table name. The class TabularDataNodeBuilder sets up the Data Clean Room in such a way that the Data Owner (the verification computation) has its own ID. The dependencies argument is therefore a mapping from the table name to the verification computation node ID.

query_node = dqsql.SqlCompute(
    # A human-readable name for the computation
    name="salary_sum",
    # The query to be executed
    sql_statement=f"""
    SELECT SUM(salary)
    FROM salary_data
    """,
    # A list of tuples, each containing the following two values:
    # 1. The table name as it appears in the query string.
    # 2. The ID of the verification node that provides data for this table
    dependencies=[
        ("salary_data", data_node_builder.output_node_id)
    ]
)

When adding a new data or compute node, the builder will assign the newly added node an identifier and return it. This identifier is needed when interacting with the node (for example when running a particular computation or when defining permissions affecting this node).

query_node_id = builder.add_compute_node(query_node)

# Adding the permissions
builder.add_user_permission(
    email=user_email,
    # We are again using the Decentriq PKI as the DCR authentication method
    authentication_method=client.decentriq_pki_authentication,
    permissions=[
        # Permission to execute the actual computation
        dq.legacy.Permissions.execute_compute(query_node_id),
        # Permission to retrieve the result of the computations
        dq.legacy.Permissions.retrieve_compute_result(query_node_id)
    ]
)

The Data Clean Room can now be built. Note that this is at first not yet a Data Clean Room with which you can interact, but only the initial configuration of a DCR (technically it's a list of modifications that will be applied to an empty DCR configuration by a secure enclave). Only after publishing the Data Clean Room configuration the enclaves will be able to perform the defined computations in there.

data_room = builder.build()
data_room_id = session.publish_data_room(data_room)

The ID of the published DCR will be returned from the publishing method, and it will be needed for all future interactions with the DCR. You can fetch a list of descriptions of your existing Data Clean Rooms like this:

client.get_data_room_descriptions()

In this output you will find for each DCR its id, which is the id that should be used when referring to the Data Clean Room.

Upload and publish data to a DCR

Let's create some example data which we want to ingest. Given our table schema from above, we define some names and salaries that we want to sum.

name	salary
Bob	10.0
Alice	5.0
Jack	14.0

We can define this table as a CSV string in Python and read it with one of the helper functions provided by the sql compute module (a similar function exists for reading directly from CSV files, refer to the reference docs to learn more):

my_csv_string = """
Name,Salary
Alice,10.0
Bob,5.0
John,14.0
"""

data = dqsql.read_input_csv_string(my_csv_string, has_header=True, delimiter=",")

The data can be encrypted and uploaded to the enclave as follows:

encryption_key = dq.Key()

dataset_id = dqsql.upload_and_publish_tabular_dataset(
    data, encryption_key, data_room_id,
    table="salary_data",
    session=session,
    description="salary",
    validate=True
)

This is a convenience method that takes care of encrypting the data, uploading it, connecting it to a DCR (called "publishing"), as well as validating its schema. Normally, uploading datasets and publishing them are two separate steps (see client.upload_dataset and session.publish_dataset), with uploading not requiring an active enclave session. Use whatever makes more sense for your use case.

note

When the referenced Data Clean Room was created using the Decentriq UI:

the table argument will have the format <NODE_ID>, where <NODE_ID> corresponds to the value that you see when hovering your mouse pointer over the name of the data node.

Run the query and retrieve results

After ingesting the data, we can now run the pre-defined query on the DCR by calling the run_computation method and providing it with the id of the compute node we added earlier.

# Trigger the computation
job_id = session.run_computation(data_room_id, query_node_id)

# Start polling the platform every 5 seconds and fetch the results as
# soon as the computation finished.
results = session.get_computation_result(job_id)

Computation results will always be binary strings and we will need to interpret them according to the type of computation that produced them. To help with this, each compute module provides helper functions exactly for this purpose:

csv = dqsql.read_sql_query_result_as_string(results)

print(csv)
#> V1
#> 29.0

# Write the output to a CSV file:
with open('output.csv', 'w') as f:
    f.write(csv)

Inspect audit log

At any time we can also obtain a tamper-proof audit log of all events that happened with respect to the DCR:

audit_log = session.retrieve_audit_log(data_room_id)
print(audit_log.log.decode())

Establish connection to an enclave​

Creation of a Data Clean Room (DCR)​

Upload and publish data to a DCR​

Run the query and retrieve results​

Inspect audit log​

Establish connection to an enclave

Creation of a Data Clean Room (DCR)

Upload and publish data to a DCR

Run the query and retrieve results

Inspect audit log