Python SDK: Quick Start

This page provides tutorials on how to use the Python SDK. In these tutorials we will:

  • establish a connection to an enclave running on Decentriq's platform,

  • create a data clean room (DCR) instance given a specific definition,

  • upload and publish data to the DCR,

  • run a computation for the given DCR and fetch the results, and

  • inspect the tamper-proof audit log.

To follow these tutorials, first install the SDK (see the reference documentation below) and then run the code from a Python file or from an interactive Python session (such as ipython or a Jupyter Notebook).


Example 1 - Run SQL queries within a data clean room

Establish connection to an enclave

First, you need to authenticate with the platform. Please specify your user email as well as a valid API token generated for that account. You can create and manage your API tokens on your API token page in the platform UI.

user_email = "test_user@company.com"
api_token = "@@ YOUR TOKEN HERE @@"

Next, we import the necessary dependencies into our program. The Python SDK consists of the main package

decentriq_platform
that provides most of the tools needed to interact with the Decentriq platform. Extra functionality, such as the ability to define SQL-based computations to be run in an enclave, are part of submodules (also called compute modules). In this example, we will define SQL queries and thus we will need to import the
sql
module:

import decentriq_platform as dq
import decentriq_platform.sql as dqsql

We create a client object which we use as a starting point for interacting with the platform. To be cross-compatible with the Decentriq user interface, we set

integrate_with_platform = True
. What this means is that DCRs created with the SDK will also be visible on our web platform. This means, however, that some metadata about the data room (such as who participates in which data room) needs to be stored outside of the enclave environment in a database that is accessible by Decentriq. Please note that the data and results are always encrypted, and Decentriq is not able to access them even when this parameter is set to
True
.

We will also create the list of enclave specifications we want to use. These are objects that control what type of enclaves the SDK trusts and will be used in the remote attestation procedure, i.e. the procedure in which the client code makes sure that the enclave it tries to connect to is the one that you expect. These specifications are provided as part of the SDK and are numbered by their release version. Refer to the reference documentation of the main package and compute modules to find the latest version available. Since each enclave type is identified by a different specification object, we need to also get the enclave specifications for the worker enclaves that perform the specific computation we want (SQL in this case). Regardless of what types of computation you want to perform within your DCR, you will need to include a specification for the driver enclave. This is the enclave with which the SDK communicates and that splits your computation into executable tasks.

client = dq.create_client(user_email, api_token, integrate_with_platform=True)
enclave_specs = dq.enclave_specifications.versions(["decentriq.driver:v2", "decentriq.sql-worker:v2"])

Next, we will create an

auth
object that defines how the current user will authenticate itself when communicating with the enclave. Authentication of users by enclaves is performed using a user-defined root certificate that is part of the DCR definition. Only users that can provide a certificate signed by the corresponding private key can connect to this particular DCR. With the following function, we can quickly create such an object using Decentriq as the CA to issue user certificates.

With this

auth
object we will be able to finally create a
Session
, the object that takes care of all communication from and to the driver enclave.

auth = client.platform.create_auth_using_decentriq_pki()
session = client.create_session(auth, enclave_specs)
Creation of a data clean room (DCR)

A data clean room running on the platform can be seen as an instantiation of a DCR definition. This definition strictly defines the schemas of all datasets associated with a DCR. In the Decentriq platform, computations and the data they depend on are arranged in a compute graph with nodes being either data nodes or compute nodes. Eventually, users will upload their datasets to the data nodes that were defined in the DCR definition. Similarly, users will be able to run the computations (defined by the compute nodes) by making the appropriate method calls. Which user is able to upload data to which data node and trigger which computation is controlled using our permission system (see below).

We can define a DCR definition using the

DataRoomBuilder
class. We supply it with the name of the data clean room we want to build and the enclave specifications to use for the worker enclaves that will eventually execute our computations.

builder = dq.DataRoomBuilder("My DCR", enclave_specs=enclave_specs)

Data and compute nodes need to be added to the data room builder by calling the appropriate method. For tabular datasets that have a pre-defined schema, a special helper class exists that will (besides adding a data node) also add a compute node for verifying the schema of our tabular data.

data_node_builder = dqsql.TabularDataNodeBuilder(
    "salary_data",
    schema=[
        ("name", dqsql.PrimitiveType.STRING, False),
        ("salary", dqsql.PrimitiveType.FLOAT64, False)
    ]
)

# Add all the nodes, as well as the permissions for uploading data and validating
# it in one call
data_node_builder.add_to_builder(
    builder,
    authentication=client.platform.decentriq_pki_authentication,
    users=[user_email]
)

Next, we define the actual SQL-based computation to be run on our data, as well as the permissions that need to be checked by the enclave. In this example we only add permissions for a single user, we can, however, add permissions for as many users as we like. Some basic permissions are given to each user automatically (this behavior can be turned off by setting the appropriate flag when constructing the data room builder object). These are:

  1. The permission to retrieve the DCR definition and inspect its contents,

  2. the permission to retrieve the audit log that contains a history of what operations have been performed in the DCR and by whom,

  3. the permission to inspect the status of a DCR (whether it is active or it has been stopped), and

  4. the permission to retrieve the list of datasets connected to a data room.

The permission to upload files to the tabular data node has also already been granted when adding the data node to the builder. When not using the

TabularDataNodeBuilder
class, this permission would need to be granted explicitly.

# Note how the name of the table we read from matches the name of the data node builder
# we used earlier.
query_node = dqsql.SqlCompute(
    "salary_sum",
    f"""
    SELECT SUM(salary)
    FROM salary_data
    """
)

builder.add_compute_node(query_node)

# Adding the permissions
builder.add_user_permission(
    email=user_email,
    # We are again using the Decentriq PKI as the DCR authentication method
    authentication_method=client.platform.decentriq_pki_authentication,
    permissions=[
        # Permission to execute the actual computation
        dq.Permissions.execute_compute("salary_sum")
    ]
)

The data room can now be built. Note that this is at first not yet a data room with which you can interact, but only the definition of a DCR. Only after publishing the DCR will enclaves be able to perform the therein defined computations.

data_room = builder.build()
data_room_id = session.publish_data_room(data_room)

The ID of the published DCR will be returned from the publishing method, and it will be needed for all further interactions with the DCR. You can fetch a list of descriptions of your existing data rooms like this:

client.platform.get_data_room_descriptions()

# With the output looking similar to the following list:
#
#> [{'status': 'Active',
#>   'name': 'My DCR',
#>   'description': '',
#>   'mrenclave': 'b79eaa8e51f2f9b9144c458fb7519818169cb9200f9560af048e02e846e125f8',
#>   'ownerEmail': 'user_1@decentriq.ch',
#>   'dataRoomId': '9df100d0cc65a9b71b226b2643945f0728ade020e2d817b85ce076c617a6ee0c'}]
Upload and publish data to a DCR

Let's create some example data which we want to ingest. Given our table schema from above, we define some names and salaries that we want to sum.

name

salary

Bob

10.0

Alice

5.0

Jack

14.0

We can define this table as a CSV string in Python and read it with one of the helper functions provided by the

sql
compute module (a similar function exists for reading directly from CSV files, refer to the reference docs to learn more):

my_csv_string = """
Name,Salary
Alice,10.0
Bob,5.0
John,14.0
"""

data = dqsql.read_input_csv_string(my_csv_string, has_header=True, delimiter=",")

The data can be encrypted and uploaded to the enclave as follows:

encryption_key = dq.Key()

dataset_id = dqsql.upload_and_publish_tabular_dataset(
    data, encryption_key, data_room_id,
    table="salary_data",
    session=session,
    description="salary",
    validate=True
)

This is a convenience method that takes care of encrypting the data, uploading it, connecting it to a DCR (called "publishing"), as well as validating its schema. Normally, uploading datasets and publishing them are two separate steps (see

client.upload_dataset
and
session.publish_dataset
), with uploading not requiring an active enclave session. Use whatever makes more sense for your use case.

Run the query and retrieve results

After ingesting the data, we can now run the pre-defined query on the DCR:

# Trigger the computation
job_id = session.run_computation(data_room_id, "salary_sum")

# Start polling the platform every 5 seconds and fetch the results as 
# soon as the computation finished.
results = session.get_computation_result(job_id)

Computation results will always be binary strings and we will need to interpret them according to the type of computation that produced them. To help with this, each compute module provides helper functions exactly for this purpose:

csv = dqsql.read_sql_query_result_as_string(results)

print(csv)
#> V1
#> 29.0

# Write the output to a CSV file:
with open('output.csv', 'w') as f:
    f.write(csv)
Inspect audit log

At any time we can also obtain a tamper-proof audit log of all events that happened with respect to the DCR:

audit_log = session.retrieve_audit_log(data_room_id)
print(audit_log.log.decode())

Example 2 - Run Python code in a data clean room

The

decentriq_platform.container
module provides functionality to run computations within containers. It enables, for example, the execution of Python scripts within the trusted execution environment and process both structured and unstructured input data.

How to use this functionality is illustrated in this example.

Assume we want to create a data room that simply converts some text in an input file to uppercase. Using the Python SDK to accomplish this task could look as follows:

First, we set up a connection to an enclave and create a

DataRoomBuilder
object:

import decentriq_platform as dq
import decentriq_platform.container as dqc

user_email = "test_user@company.com"
api_token = "@@ YOUR TOKEN HERE @@"

client = dq.create_client(user_email, api_token, integrate_with_platform=True)
enclave_specs = dq.enclave_specifications.versions(["decentriq.driver:v2", "decentriq.python-ml-worker:v1"])
auth = client.platform.create_auth_using_decentriq_pki()
session = client.create_session(auth, enclave_specs)
builder = dq.DataRoomBuilder(
    "Secure Uppercase",
    enclave_specs=enclave_specs,
    owner_email=user_email
)

Note how, in contrast to the previous example, we request the enclave specification named

"decentriq.python-ml-worker:v1"
in addition to the driver enclave. This enclave allows us to run Python code and provides common machine learning libraries such as pandas and scikit-learn as part of its Python environment.

The script to uppercase text contained in an input file could look something like this:

my_script_content = b"""
with open("/input/lowercase.txt", "r") as input_file:
    input_data = input_file.read()
with open("/output/uppercase.txt", "w") as output_file:
    output_file.write(input_data.upper())
"""

Here we defined the script within a multi-line string. For larger scripts, however, defining them in a file would likely be easier.

To use this script in a data clean room, it first has to be loaded into a

StaticContent
node, which is then added to the DCR definition. This makes the Python script visible to all the participants in the DCR.

# If you wrote your script in a separate file, you can simply open
# the file using `with open` (note that we specify the "b" flag to read
# the file as a binary string), like so:
#
# with open("my_script.py", "rb") as data:
#     my_script_content = data.read()

script_node = dq.StaticContent("script_node", my_script_content)

builder.add_compute_node(script_node)

The

StaticContent
node will not be tasked with running the computation, it simply provides the script to be executed. Before worrying about execution, however, we need to add a data node to which we can upload the input file whose content should be converted to uppercase:

builder.add_data_node("input_data_node", is_required=True)

Now we can add the node that will actually execute our script. The compute node class capable of executing such scripts is called

StaticContainerCompute
.

When creating this node, we need to specify at which path the script should be made available (called "mounting") so that we can refer to it from within the enclave. The same also holds true for any input data, that we provide either with additional

StaticContent
nodes or with data nodes, as we do here. This is achieved using
MountPoint
objects, that live in the
proto
namespace (these are low-level objects used in client-enclave communication). We also specify the output path, i.e. the directory in which we store all our output files. The Decentriq platform will automatically zip all the files in this location and provide them as the result of this computation.

Finally, we tell the platform what particular enclave to use for executing our script (remember that we requested the corresponding enclave specification earlier when creating the data room builder object).

from decentriq_platform.container.proto import MountPoint

uppercase_text_node = dqc.StaticContainerCompute(
    name="uppercase_text_node",
    command=["python", "/input/my_script.py"],
    mount_points=[
        MountPoint(path="/input/my_script.py", dependency="script_node"),
        MountPoint(path="/input/lowercase.txt", dependency="input_data_node")
    ],
    output_path="/output",
    enclave_type="decentriq.python-ml-worker"
)

builder.add_compute_node(uppercase_text_node)

builder.add_user_permission(
    email=user_email,
    authentication_method=client.platform.decentriq_pki_authentication,
    permissions=[
        dq.Permissions.leaf_crud("input_data_node"),
        dq.Permissions.execute_compute("uppercase_text_node")
    ]
)

data_room = builder.build()
data_room_id = session.publish_data_room(data_room)

After building and publishing the DCR, we can upload data and connect it to our input node.

key = dq.Key()

# Here again you can use `with open(path, "rb") as data` to read
# the data in the right format from a file.
import io
data = io.BytesIO(b"hello world")

dataset_id = client.upload_dataset(data, key, "myfile")
session.publish_dataset(data_room_id, dataset_id, "input_data_node", key)

When retrieving results for the computation you will get a binary file that represents a

zipfile.ZipFile
object containing all the files you wrote to the specified
output_path
:

raw_result = session.run_computation_and_get_results(data_room_id, "uppercase_text_node")
zip_result = dqc.read_result_as_zipfile(raw_result)

result = zip_result.read("uppercase.txt").decode()

assert result == "HELLO WORLD"