Getting started with the Python SDK
This is a step-by-step guide to build and run an Analytics Data Clean Room (DCR) from scratch.
An Analytics DCR is a graph relationships between computations and input data. Both comptations and data are "nodes". A data node may have a dataset can provisioned it. A compute node runs a computation based on inputs. These inputs can be either data nodes or other compute nodes.
As part of this tutorial you will learn how to
- create a
Client
for interacting with the Decentriq Platform, - create an
AnalyticsDcrBuilder
object for constructing an Analytics DCR, - add a data node to the DCR,
- add a Ccompute node to the DCR,
- make data accessible to the DCR,
- run a computation in the DCR, and
- interact with an existing DCR.
Creating a Client
A Client
object must be constructed with user credentials to create a DCR. The Client
object handles communicating with the Decentriq platform. It can retrieve information about existing DCRs, provision data, and run computations.
Create and manage your API tokens via the Decentriq UI - API tokens page.
import decentriq_platform as dq
from decentriq_platform.analytics import AnalyticsDcrBuilder
from decentriq_platform.analytics import RawDataNodeDefinition
from decentriq_platform.analytics import PythonComputeNodeDefinition
import io
user_email = "@@ YOUR EMAIL HERE @@"
api_token = "@@ YOUR TOKEN HERE @@"
client = dq.create_client(user_email, api_token)
enclave_specs = dq.enclave_specifications.latest()
builder = AnalyticsDcrBuilder(client=client)
builder.\
with_name("My DCR").\
with_owner(user_email).\
with_description("My test DCR")
# Create a `RawDataNodeDefinition` and add it to the DCR right away:
builder.add_node_definition(
RawDataNodeDefinition(name="my-raw-data-node", is_required=True)
)
builder.add_node_definition(
PythonComputeNodeDefinition(
name="python-node",
script="""
import shutil
shutil.copyfile("/input/my-raw-data-node", "/output/result.txt")
""",
dependencies=["my-raw-data-node"]
)
)
builder.add_participant(
user_email,
data_owner_of=["my-raw-data-node"],
analyst_of=["python-node"]
)
dcr_definition = builder.build()
dcr = client.publish_analytics_dcr(dcr_definition)
dcr_id = dcr.id
Running a computation in a DCR
A compute node represents a computation within a DCR. Call run_computation_and_get_results_as_zip
on a compute node to retrieve the results. This function will also run the computation.
key = dq.Key()
data_node = dcr.get_node("my-raw-data-node")
data_node.upload_and_publish_dataset(
io.BytesIO(b"some-new-data-dataset"), key=key, name="some-new-dataset.txt"
)
python_node = dcr.get_node("python-node")
results = python_node.run_computation_and_get_results_as_zip()
result_txt = results.read("result.txt").decode()
assert result_txt == "some-new-data-dataset"
Different compute nodes return different types of results. Each compute node however will have a method to return the results as a simple blob of bytes, that can then be interpreted in appropriate ways.
Call run_computation
to run a comptation without blocking. This runs the computation but does not wait for the results. Retrieved the results later by calling get_results_as_zip
:
job_id = python_node.run_computation()
results_without_blocking = python_node.get_results_as_zip(job_id)
Interacting with an existing DCR
You may wish to interact with a DCR that was created in the Decentriq UI or as part of another script. Use a Client
to retrieve an AnalyticsDcr
using the id
.
You may interact with the nodes of a DCR retreived this way the same way you can when creating a new DCR.
dcr = client.retrieve_analytics_dcr(dcr_id)
data_node = dcr.get_node("my-raw-data-node")
data_node.upload_and_publish_dataset(
io.BytesIO(b"new dataset"), key=key, name="my-new-dataset.txt"
)
python_node = dcr.get_node("python-node")
result_zip = python_node.run_computation_and_get_results_as_zip()
result_txt = result_zip.read("result.txt").decode()
assert result_txt == "new dataset"