Getting started with the Python SDK
Starting with version 0.26.0 of the Decentriq Python SDK, a new API for creating and interacting with Analytics Data Rooms (formerly just called "Data Rooms") has been released. This tutorial assumes you want to use the new API. The same tutorial for the old API can be found under Getting started (legacy version).
This tutorial will show the steps required to build and run an Analytics Data Clean Room (DCR) from scratch.
An Analytics DCR can be thought of as a graph-like structure that defines computations and their input data in terms of "nodes". A node is either a Data Node to which a dataset can be provisioned or a Compute Node that is able to run a computation on a set of input nodes. These input nodes can be either Data Nodes or other Compute Nodes.
As part of this tutorial you will learn how to
- create a
Client
for interacting with the Decentriq Platform, - create an
AnalyticsDcrBuilder
object for constructing an Analytics DCR, - add a Data Node to the DCR,
- add a Compute Node to the DCR,
- make data accessible to the DCR,
- run a computation in the DCR, and
- interact with an existing DCR.
Creating a Client
Before a DCR can be built, a Client
object must be constructed with the necessary user credentials.
The Client
object is responsible for communicating with the Decentriq platform and can be used to retrieve information about existing DCRs or available datasets etc.
import decentriq_platform as dq
user_email = "@@ YOUR EMAIL HERE @@"
api_token = "@@ YOUR TOKEN HERE @@"
client = dq.create_client(user_email, api_token)
Create and manage your API tokens via the Decentriq UI - API tokens page.
Creating an AnalyticsDcrBuilder
The AnalyticsDcrBuilder
provides a convenient way of constructing an Analytics DCR. It provides a number of builder functions which can be used to parameterise the DCR. Examples of such parameterisation include:
- Setting the DCR name
- Setting the DCR description
- Specifying the participants of the DCR
An example of building an Analytics DCR is shown below:
# These will be filled in with the specs running in the test environment.
enclave_specs = dq.enclave_specifications.versions([...])
# Patch the available enclave specs to the ones currently running in the test
# environment.
# From now on everytime code accesses the latest specs in will use the test specs.
dq.enclave_specifications.specifications = {
f"{name}:v0": spec
for name, spec in enclave_specs.items()
}
import decentriq_platform as dq
from decentriq_platform.analytics import AnalyticsDcrBuilder
builder = AnalyticsDcrBuilder(client=client)
builder.\
with_name("My DCR").\
with_owner(user_email).\
with_description("My test DCR")
# The code could also be written on a single line, or, instead of
# using backslashes to escape the newlines, you could write it
# using parenthesis:
#
# builder = (
# AnalyticsDcrBuilder(client=client)
# .with_name("My DCR")
# .with_owner(user_email)
# .with_description("My test DCR")
# )
As you notice, the Client
object we constructed earlier is passed to the AnalyticsDcrBuilder
.
This is required because the builder requires some information about available enclaves that are running in the platform.
Adding Data Nodes to a DCR
A Data Node provides access to a dataset from within a DCR.
A Data Node can be added to a DCR via the add_node_definition
function provided by the AnalyticsDcrBuilder
.
The following types of Data Nodes are supported:
RawDataNode
:- Makes provisioned datasets available as raw files to any downstream Compute Nodes.
- Useful for unstructed data such as images or binary data.
TableDataNode
- Verifies that the provisioned dataset conforms to a certain structure.
- Required Data Node type for when processing data using SQL Compute Nodes.
Below is an example of adding a RawDataNode
to a DCR.
from decentriq_platform.analytics import RawDataNodeDefinition
# Create a `RawDataNodeDefinition` and add it to the DCR right away:
builder.add_node_definition(
RawDataNodeDefinition(name="my-raw-data-node", is_required=True)
)
Below is an example of adding a TableDataNode
to a DCR.
from decentriq_platform.analytics import (
Column,
FormatType,
TableDataNodeDefinition,
)
columns = [
Column(
name="value",
format_type=FormatType.FLOAT,
is_nullable=False,
),
Column(
name="name",
format_type=FormatType.STRING,
is_nullable=False,
),
]
builder.add_node_definition(
TableDataNodeDefinition(
name="tabular_data", columns=columns, is_required=False
)
)
When adding a node to a DCR, the class to be added always ends in
Definition. For example TableDataNode
, the class to be constructed is
therefore called TableDataNodeDefinition
.
The "Definition" class serves as the blueprint for constructing nodes of that
type.
The is_required=True
flag tells the DCR that any downstream computations
(i.e. computations that directly or indirectly read data from his node) can
only be run if a dataset has been provisioned to that node.
Adding a Computation Node to a DCR
A Computation Node represents a computation that can be run within a DCR. It can be added to a DCR in the same way as a Data node. The following types of Computation Nodes are supported:
PythonComputeNode
- Used for running Python-based computations within an enclave
- Can make use of a wide variety of data processing libraries such as pandas and scikit-learn
RComputeNode
- Used for running R-based computations within an enclave
- Wide selection of R libraries available (including texlive)
SqliteComputeNode
- Used for running SQL-based queries
- Based on sqlite
SqlComputeNode
- Also used for running SQL-based queries
- Uses a custom SQL engine that runs on Intel SGX
- If not otherwise required, we recommend using
SqliteComputeNodeDefinition
for running SQL workloads
SyntheticDataComputeNode
- Output synthetic data based on structured input data
- Can mask columns containing sensitive information
- Useful for testing downstream computations on real-looking data
S3SinkComputeNode
- Store the output of a computation in an S3 bucket
MatchingComputeNode
- Match two structured input datasets on a given key
PreviewComputeNode
- Restrict how much data can be read by another party from a particular Compute Node
Below is an example of adding a PythonComputeNode
to a DCR.
from decentriq_platform.analytics import PythonComputeNodeDefinition
builder.add_node_definition(
PythonComputeNodeDefinition(
name="python-node",
script="""
import decentriq_util
import shutil
shutil.copyfile("/input/my-raw-data-node", "/output/result.txt")
df_table = decentriq_util.read_tabular_data("/input/tabular_data")
df_table.to_csv("/output/result.csv", index=False, header=True)
""",
dependencies=["my-raw-data-node", "tabular_data"]
)
)
Note how we are again adding a PythonComputeNodeDefinition
to the builder in
order to construct a PythonComputeNode
.
The node is configured to depend on the Data Nodes we added earlier.
The contents of the data nodes will be made available as files in the /input
directory (the name of the file matches the name of the node).
In this case, the script will simply echo the contents of the input nodes but it could be much
more complex and make use of any of the libraries that exist in its environment.
Any file written to the /output
directory will be considered to be part of the result of the node.
There is no limit on the number of files that can be written to /output
.
Please check the Computations section for examples of how to process data using Python, R, SQL, and more.
Adding permissions to the DCR
Next we need to define the list of participants in the DCR and specify what permissions each participant has.
A participant can be a Data Owner of a data node (which will give the user the right to provision datasets). A participant can also be an Analyst of a Compute Node (this makes it possible for the user to run the node and retrieve its results). Finally, a participant can also have no permissions configured. This makes the participant an Auditor of the DCR (they can see the DCR and inspect the computations, but they cannot interact with it).
builder.add_participant(
user_email,
data_owner_of=["my-raw-data-node", "tabular_data"],
analyst_of=["python-node"]
)
Publishing the DCR
A Data Clean Room needs to be built and published before it can be used. This will encrypt the DCR and send it to the enclave where it is stored.
By building the DCR, we create its definition (analoguous to the node definitions encountered earlier).
dcr_definition = builder.build()
This definition can then be published using the client which will return the
final AnalyticsDcr
object that can be used to interact with the live DCR.
dcr = client.publish_analytics_dcr(dcr_definition)
The id of the DCR can be obtained via its id
field (the same id you also see
in the address bar of the Decentriq UI).
dcr_id = dcr.id
Making data accessible to the DCR
A dataset is available for use by a DCR once it has been uploaded to the Decentriq Platform and published (or "provisioned") to a Data Node within a DCR.
Using the AnalyticsDcr
object we just received, we can obtain a handle on the
data node and upload data to it as follows:
import io
from decentriq_platform import Key
key = Key() # generate an encryption key with which to encrypt the dataset
raw_data_node = dcr.get_node("my-raw-data-node")
data = io.BytesIO(b"my-dataset")
raw_data_node.upload_and_publish_dataset(data, key, "my-data.txt")
# For demo purposes we used a BytesIO wrapper around a string.
# In a real world use case, however, you would probably want to read some local file instead.
# In this case, use the following syntax (note the "rb" when reading the file):
#
# with open("local-file.txt", "rb") as data:
# raw_data_node.upload_and_publish_dataset(data, key, "my-data.txt")
Often it is useful to upload a dataset in a separate step. It can then simply be published to the Data Node using its publish_dataset
method:
data = io.BytesIO(b"some-new-data-dataset")
key = Key()
# Upload the dataset to the Decentriq Platform in a separate step
manifest_hash = client.upload_dataset(data, key, "my-new-data.txt")
# Make the dataset available within a DCR.
raw_data_node.publish_dataset(manifest_hash, key)
Below is an example of uploading and publishing a tabular dataset to a TableDataNode
.
key = Key()
tabular_dataset = io.BytesIO(b"""10.0,Alice
5.0,Bob
14.0,John
""")
# Or read the dataset from a file:
# with open("/path/to/dataset.csv", "rb") as tabular_dataset:
dcr.get_node("tabular_data").upload_and_publish_dataset(
tabular_dataset,
name="My Tabular Dataset",
key=key,
)
Please check the Datasets Cookbook section for more information on how to upload and provision datasets.
Running a computation in a DCR
A Compute Node represents a computation within a DCR. To run a computation and retrieve the results, call run_computation_and_get_results_as_zip
on the Compute Node.
This is a blocking call which waits for the results to become available before returning.
python_node = dcr.get_node("python-node")
results = python_node.run_computation_and_get_results_as_zip()
result_txt = results.read("result.txt").decode()
assert result_txt == "some-new-data-dataset"
Different Compute Nodes return different types of results. Each Compute Node however will have a method to return the results as a simple blob of bytes, that can then be interpreted in appropriate ways. Please check the Run computations Cookbook section for more information.
A computation can also be run in a non-blocking way by calling run_computation
.
This runs the computation but does not wait for the results.
Results can be retrieved later by calling get_results_as_zip
and by passing it
the obtained JobId
:
job_id = python_node.run_computation()
results_without_blocking = python_node.get_results_as_zip(job_id)
Interacting with an existing DCR
A DCR might have been created in the Decentriq UI or as part of another script.
In this case it can easily be retrieved using the Client
object.
Once retrieved, the AnalyticsDcr
can be used to get the various Data/Compute Nodes that exist in the DCR.
These nodes can be interacted with in the same way as they are when creating a new DCR.
dcr = client.retrieve_analytics_dcr(dcr_id)
data_node = dcr.get_node("my-raw-data-node")
data_node.upload_and_publish_dataset(
io.BytesIO(b"new dataset"), key=key, name="my-new-dataset.txt"
)
python_node = dcr.get_node("python-node")
result_zip = python_node.run_computation_and_get_results_as_zip()
result_txt = result_zip.read("result.txt").decode()
assert result_txt == "new dataset"