Skip to main content

Getting started with the Python SDK

note

Starting with version 0.26.0 of the Decentriq Python SDK, a new API for creating and interacting with Analytics Data Rooms (formerly just called "Data Rooms") has been released. This tutorial assumes you want to use the new API. The same tutorial for the old API can be found under Getting started (legacy version).

This tutorial will show the steps required to build and run an Analytics Data Clean Room (DCR) from scratch.

An Analytics DCR can be thought of as a graph-like structure that defines computations and their input data in terms of "nodes". A node is either a Data Node to which a dataset can be provisioned or a Compute Node that is able to run a computation on a set of input nodes. These input nodes can be either Data Nodes or other Compute Nodes.

As part of this tutorial you will learn how to

  • create a Client for interacting with the Decentriq Platform,
  • create an AnalyticsDcrBuilder object for constructing an Analytics DCR,
  • add a Data Node to the DCR,
  • add a Compute Node to the DCR,
  • make data accessible to the DCR,
  • run a computation in the DCR, and
  • interact with an existing DCR.

Creating a Client

Before a DCR can be built, a Client object must be constructed with the necessary user credentials. The Client object is responsible for communicating with the Decentriq platform and can be used to retrieve information about existing DCRs or available datasets etc.

import decentriq_platform as dq

user_email = "@@ YOUR EMAIL HERE @@"
api_token = "@@ YOUR TOKEN HERE @@"

client = dq.create_client(user_email, api_token)
note

Create and manage your API tokens via the Decentriq UI - API tokens page.

Creating an AnalyticsDcrBuilder

The AnalyticsDcrBuilder provides a convenient way of constructing an Analytics DCR. It provides a number of builder functions which can be used to parameterise the DCR. Examples of such parameterisation include:

  • Setting the DCR name
  • Setting the DCR description
  • Specifying the participants of the DCR

An example of building an Analytics DCR is shown below:

# These will be filled in with the specs running in the test environment.
enclave_specs = dq.enclave_specifications.versions([...])

# Patch the available enclave specs to the ones currently running in the test
# environment.
# From now on everytime code accesses the latest specs in will use the test specs.
dq.enclave_specifications.specifications = {
f"{name}:v0": spec
for name, spec in enclave_specs.items()
}
import decentriq_platform as dq
from decentriq_platform.analytics import AnalyticsDcrBuilder

builder = AnalyticsDcrBuilder(client=client)
builder.\
with_name("My DCR").\
with_owner(user_email).\
with_description("My test DCR")

# The code could also be written on a single line, or, instead of
# using backslashes to escape the newlines, you could write it
# using parenthesis:
#
# builder = (
# AnalyticsDcrBuilder(client=client)
# .with_name("My DCR")
# .with_owner(user_email)
# .with_description("My test DCR")
# )

As you notice, the Client object we constructed earlier is passed to the AnalyticsDcrBuilder. This is required because the builder requires some information about available enclaves that are running in the platform.

Adding Data Nodes to a DCR

A Data Node provides access to a dataset from within a DCR.

A Data Node can be added to a DCR via the add_node_definition function provided by the AnalyticsDcrBuilder. The following types of Data Nodes are supported:

  1. RawDataNode:
    • Makes provisioned datasets available as raw files to any downstream Compute Nodes.
    • Useful for unstructed data such as images or binary data.
  2. TableDataNode
    • Verifies that the provisioned dataset conforms to a certain structure.
    • Required Data Node type for when processing data using SQL Compute Nodes.

Below is an example of adding a RawDataNode to a DCR.

from decentriq_platform.analytics import RawDataNodeDefinition

# Create a `RawDataNodeDefinition` and add it to the DCR right away:
builder.add_node_definition(
RawDataNodeDefinition(name="my-raw-data-node", is_required=True)
)

Below is an example of adding a TableDataNode to a DCR.

from decentriq_platform.analytics import (
Column,
FormatType,
TableDataNodeDefinition,
)

columns = [
Column(
name="value",
format_type=FormatType.FLOAT,
is_nullable=False,
),
Column(
name="name",
format_type=FormatType.STRING,
is_nullable=False,
),
]

builder.add_node_definition(
TableDataNodeDefinition(
name="tabular_data", columns=columns, is_required=False
)
)
note

When adding a node to a DCR, the class to be added always ends in Definition. For example TableDataNode, the class to be constructed is therefore called TableDataNodeDefinition. The "Definition" class serves as the blueprint for constructing nodes of that type.

The is_required=True flag tells the DCR that any downstream computations (i.e. computations that directly or indirectly read data from his node) can only be run if a dataset has been provisioned to that node.

Adding a Computation Node to a DCR

A Computation Node represents a computation that can be run within a DCR. It can be added to a DCR in the same way as a Data node. The following types of Computation Nodes are supported:

  1. PythonComputeNode
    • Used for running Python-based computations within an enclave
    • Can make use of a wide variety of data processing libraries such as pandas and scikit-learn
  2. RComputeNode
    • Used for running R-based computations within an enclave
    • Wide selection of R libraries available (including texlive)
  3. SqliteComputeNode
    • Used for running SQL-based queries
    • Based on sqlite
  4. SqlComputeNode
    • Also used for running SQL-based queries
    • Uses a custom SQL engine that runs on Intel SGX
    • If not otherwise required, we recommend using SqliteComputeNodeDefinition for running SQL workloads
  5. SyntheticDataComputeNode
    • Output synthetic data based on structured input data
    • Can mask columns containing sensitive information
    • Useful for testing downstream computations on real-looking data
  6. S3SinkComputeNode
    • Store the output of a computation in an S3 bucket
  7. MatchingComputeNode
    • Match two structured input datasets on a given key
  8. PreviewComputeNode
    • Restrict how much data can be read by another party from a particular Compute Node

Below is an example of adding a PythonComputeNode to a DCR.

from decentriq_platform.analytics import PythonComputeNodeDefinition

builder.add_node_definition(
PythonComputeNodeDefinition(
name="python-node",
script="""
import decentriq_util
import shutil

shutil.copyfile("/input/my-raw-data-node", "/output/result.txt")

df_table = decentriq_util.read_tabular_data("/input/tabular_data")
df_table.to_csv("/output/result.csv", index=False, header=True)
""",
dependencies=["my-raw-data-node", "tabular_data"]
)
)

Note how we are again adding a PythonComputeNodeDefinition to the builder in order to construct a PythonComputeNode.

The node is configured to depend on the Data Nodes we added earlier. The contents of the data nodes will be made available as files in the /input directory (the name of the file matches the name of the node).

In this case, the script will simply echo the contents of the input nodes but it could be much more complex and make use of any of the libraries that exist in its environment. Any file written to the /output directory will be considered to be part of the result of the node. There is no limit on the number of files that can be written to /output.

note

Please check the Computations section for examples of how to process data using Python, R, SQL, and more.

Adding permissions to the DCR

Next we need to define the list of participants in the DCR and specify what permissions each participant has.

A participant can be a Data Owner of a data node (which will give the user the right to provision datasets). A participant can also be an Analyst of a Compute Node (this makes it possible for the user to run the node and retrieve its results). Finally, a participant can also have no permissions configured. This makes the participant an Auditor of the DCR (they can see the DCR and inspect the computations, but they cannot interact with it).

builder.add_participant(
user_email,
data_owner_of=["my-raw-data-node", "tabular_data"],
analyst_of=["python-node"]
)

Publishing the DCR

A Data Clean Room needs to be built and published before it can be used. This will encrypt the DCR and send it to the enclave where it is stored.

By building the DCR, we create its definition (analoguous to the node definitions encountered earlier).

dcr_definition = builder.build()

This definition can then be published using the client which will return the final AnalyticsDcr object that can be used to interact with the live DCR.

dcr = client.publish_analytics_dcr(dcr_definition)

The id of the DCR can be obtained via its id field (the same id you also see in the address bar of the Decentriq UI).

dcr_id = dcr.id

Making data accessible to the DCR

A dataset is available for use by a DCR once it has been uploaded to the Decentriq Platform and published (or "provisioned") to a Data Node within a DCR. Using the AnalyticsDcr object we just received, we can obtain a handle on the data node and upload data to it as follows:

import io
from decentriq_platform import Key

key = Key() # generate an encryption key with which to encrypt the dataset
raw_data_node = dcr.get_node("my-raw-data-node")
data = io.BytesIO(b"my-dataset")
raw_data_node.upload_and_publish_dataset(data, key, "my-data.txt")

# For demo purposes we used a BytesIO wrapper around a string.
# In a real world use case, however, you would probably want to read some local file instead.
# In this case, use the following syntax (note the "rb" when reading the file):
#
# with open("local-file.txt", "rb") as data:
# raw_data_node.upload_and_publish_dataset(data, key, "my-data.txt")

Often it is useful to upload a dataset in a separate step. It can then simply be published to the Data Node using its publish_dataset method:

data = io.BytesIO(b"some-new-data-dataset")
key = Key()

# Upload the dataset to the Decentriq Platform in a separate step
manifest_hash = client.upload_dataset(data, key, "my-new-data.txt")

# Make the dataset available within a DCR.
raw_data_node.publish_dataset(manifest_hash, key)

Below is an example of uploading and publishing a tabular dataset to a TableDataNode.

key = Key()
tabular_dataset = io.BytesIO(b"""10.0,Alice
5.0,Bob
14.0,John
""")
# Or read the dataset from a file:
# with open("/path/to/dataset.csv", "rb") as tabular_dataset:
dcr.get_node("tabular_data").upload_and_publish_dataset(
tabular_dataset,
name="My Tabular Dataset",
key=key,
)
note

Please check the Datasets Cookbook section for more information on how to upload and provision datasets.

Running a computation in a DCR

A Compute Node represents a computation within a DCR. To run a computation and retrieve the results, call run_computation_and_get_results_as_zip on the Compute Node. This is a blocking call which waits for the results to become available before returning.

python_node = dcr.get_node("python-node")
results = python_node.run_computation_and_get_results_as_zip()
result_txt = results.read("result.txt").decode()
assert result_txt == "some-new-data-dataset"
note

Different Compute Nodes return different types of results. Each Compute Node however will have a method to return the results as a simple blob of bytes, that can then be interpreted in appropriate ways. Please check the Run computations Cookbook section for more information.

A computation can also be run in a non-blocking way by calling run_computation. This runs the computation but does not wait for the results. Results can be retrieved later by calling get_results_as_zip and by passing it the obtained JobId:

job_id = python_node.run_computation()
results_without_blocking = python_node.get_results_as_zip(job_id)

Interacting with an existing DCR

A DCR might have been created in the Decentriq UI or as part of another script. In this case it can easily be retrieved using the Client object.

Once retrieved, the AnalyticsDcr can be used to get the various Data/Compute Nodes that exist in the DCR. These nodes can be interacted with in the same way as they are when creating a new DCR.

dcr = client.retrieve_analytics_dcr(dcr_id)

data_node = dcr.get_node("my-raw-data-node")
data_node.upload_and_publish_dataset(
io.BytesIO(b"new dataset"), key=key, name="my-new-dataset.txt"
)

python_node = dcr.get_node("python-node")

result_zip = python_node.run_computation_and_get_results_as_zip()
result_txt = result_zip.read("result.txt").decode()

assert result_txt == "new dataset"