Skip to main content

Creating a DCR

Overview

This guide covers all the steps to create a new DCR using the DCR Builder

Creating a Client

Client object must be constructed with user credentials to create a DCR. The Client object handles communicating with the Decentriq platform. It can retrieve information about existing DCRs, provision data, and run computations.

import decentriq_platform as dq

user_email = "@@ YOUR EMAIL HERE @@"
api_token = "@@ YOUR TOKEN HERE @@"

client = dq.create_client(user_email, api_token)
note

Create and manage your API tokens via the Decentriq UI - API tokens page.

Creating an AnalyticsDcrBuilder

The AnalyticsDcrBuilder provides a convenient way of constructing a Media DCR. The MediaDcrBuilder has functions you can use to create a new DCR. Examples parameters include:

  • Setting the DCR name
  • Setting the DCR description
  • Specifying the participants of the DCR

An example of building an Analytics DCR is shown below:

enclave_specs = dq.enclave_specifications.latest()
import decentriq_platform as dq
from decentriq_platform.analytics import AnalyticsDcrBuilder

builder = AnalyticsDcrBuilder(client=client)
builder.\
with_name("My DCR").\
with_owner(user_email).\
with_description("My test DCR")

# The code could also be written on a single line, or, instead of
# using backslashes to escape the newlines, you could write it
# using parenthesis:
#
# builder = (
# AnalyticsDcrBuilder(client=client)
# .with_name("My DCR")
# .with_owner(user_email)
# .with_description("My test DCR")
# )

The AnalyticsDcrBuilder depends on a Client to know which version of the enclave software to use.

Adding a data node to a DCR

A data node provides access to a dataset from within a DCR.

Use the the add_node_definition function in an AnalyticsDcrBuilder to add data notes. The following types of data nodes are supported:

  1. RawDataNode:
    • Makes connected datasets available as raw files to any downstream compute nodes.
    • Useful for unstructed data such as images or binary data.
  2. TableDataNode
    • Verifies that the connected dataset conforms to a certain structure.
    • Required data node type for when processing data using SQL compute nodes.

Below is an example of adding a RawDataNode to a DCR.

from decentriq_platform.analytics import RawDataNodeDefinition

# Create a `RawDataNodeDefinition` and add it to the DCR right away:
builder.add_node_definition(
RawDataNodeDefinition(name="my-raw-data-node", is_required=True)
)
note

When adding a node to a DCR, the class always ends in Definition. For instance, add a RawDataNodeDefinition to the builder to create a new RawDataNode. The "Definition" class serves as the blueprint for constructing nodes of that type.

There is one key flag, is_required=True. This tells the DCR that any downstream computations can only run if data has is provisioned to that node.

Adding a compute node to a DCR

A compute node represents a computation that can be run within a DCR. It can be added to a DCR in the same way as a data node. The following types of compute nodes are supported:

  1. PythonComputeNode
    • Used for running Python-based computations within an enclave
    • Can make use of a wide variety of data processing libraries such as pandas and scikit-learn
  2. RComputeNode
    • Used for running R-based computations within an enclave
    • Wide selection of R libraries available (including texlive)
  3. SqliteComputeNode
    • Used for running SQL-based queries
    • Based on sqlite
  4. SqlComputeNode
    • Also used for running SQL-based queries
    • Uses a custom SQL engine that runs on Intel SGX
    • If not otherwise required, we recommend using SqliteComputeNodeDefinition for running SQL workloads
  5. SyntheticDataComputeNode
    • Output synthetic data based on structured input data
    • Can mask columns containing sensitive information
    • Useful for testing downstream computations on real-looking data
  6. S3SinkComputeNode
    • Store the output of a computation in an S3 bucket
  7. MatchingComputeNode
    • Match two structured input datasets on a given key
  8. PreviewComputeNode
    • Restrict how much data can be read by another party from a particular Compute node

Below is an example of adding a PythonComputeNode to a DCR.

from decentriq_platform.analytics import PythonComputeNodeDefinition

builder.add_node_definition(
PythonComputeNodeDefinition(
name="python-node",
script="""
import shutil
shutil.copyfile("/input/my-raw-data-node", "/output/result.txt")
""",
dependencies=["my-raw-data-node"]
)
)
note

Note that we used PythonComputeNodeDefinition to add a PythonComputeNode.

This new node depends on the data node we added earlier. The PythonComputeNode can access the contents of the upstream data node the /input/[upstream_node_name] directory.

In this case, the script just copies the contents of the input node to the output. But you could pass it a script of any complexity, limted only by the libraries available in that node. Any file written to the /output directory will be part of the result of the node. There is no limit on the number of files that you can write to /output.

Adding permissions to the DCR

Next we need to define the list of participants in the DCR and specify what permissions each participant has.

A participant can be a data owner of a data node. This gives the user permission to provision data to that node. A participant can also be an analyst of a compute node. This gives the user permission to run the node and retrieve its results. Finally, a participant can also have no permissions configured. This makes the participant an auditor of the DCR. Auditors may neither provision data nor see results. All participants (including auditors) may see the DCR, inspect the computations, and read the audit log.

builder.add_participant(
user_email,
data_owner_of=["my-raw-data-node"],
analyst_of=["python-node"]
)

Publishing the DCR

A Data Clean Room needs to be built and published before it can be used. This will encrypt the DCR and send it to the enclave where it is stored.

By building the DCR, we create its definition (analoguous to the node definitions encountered earlier).

dcr_definition = builder.build()

You can publish this definition using a client. This will which will return an AnalyticsDcr object that can be used to interact with the live DCR.

dcr = client.publish_analytics_dcr(dcr_definition)

The id field of the of the DCR is a unique way to identify it that is consistent across the UI and SDK. This is the same id you also see in the address bar of the Decentriq UI.

dcr_id = dcr.id