Creating a DCR
Overview
This guide covers all the steps to create a new DCR using the DCR Builder
Creating a Client
A Client
object must be constructed with user credentials to create a DCR. The Client
object handles communicating with the Decentriq platform. It can retrieve information about existing DCRs, provision data, and run computations.
import decentriq_platform as dq
user_email = "@@ YOUR EMAIL HERE @@"
api_token = "@@ YOUR TOKEN HERE @@"
client = dq.create_client(user_email, api_token)
Create and manage your API tokens via the Decentriq UI - API tokens page.
Creating an AnalyticsDcrBuilder
The AnalyticsDcrBuilder
provides a convenient way of constructing a Media DCR. The MediaDcrBuilder has functions you can use to create a new DCR. Examples parameters include:
- Setting the DCR name
- Setting the DCR description
- Specifying the participants of the DCR
An example of building an Analytics DCR is shown below:
enclave_specs = dq.enclave_specifications.latest()
import decentriq_platform as dq
from decentriq_platform.analytics import AnalyticsDcrBuilder
builder = AnalyticsDcrBuilder(client=client)
builder.\
with_name("My DCR").\
with_owner(user_email).\
with_description("My test DCR")
# The code could also be written on a single line, or, instead of
# using backslashes to escape the newlines, you could write it
# using parenthesis:
#
# builder = (
# AnalyticsDcrBuilder(client=client)
# .with_name("My DCR")
# .with_owner(user_email)
# .with_description("My test DCR")
# )
The AnalyticsDcrBuilder
depends on a Client
to know which version of the enclave software to use.
Adding a data node to a DCR
A data node provides access to a dataset from within a DCR.
Use the the add_node_definition
function in an AnalyticsDcrBuilder
to add data notes.
The following types of data nodes are supported:
RawDataNode
:- Makes connected datasets available as raw files to any downstream compute nodes.
- Useful for unstructed data such as images or binary data.
TableDataNode
- Verifies that the connected dataset conforms to a certain structure.
- Required data node type for when processing data using SQL compute nodes.
Below is an example of adding a RawDataNode
to a DCR.
from decentriq_platform.analytics import RawDataNodeDefinition
# Create a `RawDataNodeDefinition` and add it to the DCR right away:
builder.add_node_definition(
RawDataNodeDefinition(name="my-raw-data-node", is_required=True)
)
When adding a node to a DCR, the class always ends in Definition.
For instance, add a RawDataNodeDefinition
to the builder to create a new RawDataNode
.
The "Definition" class serves as the blueprint for constructing nodes of that type.
There is one key flag, is_required=True
. This tells the DCR that any downstream
computations can only run if data has is provisioned to that node.
Adding a compute node to a DCR
A compute node represents a computation that can be run within a DCR. It can be added to a DCR in the same way as a data node. The following types of compute nodes are supported:
PythonComputeNode
- Used for running Python-based computations within an enclave
- Can make use of a wide variety of data processing libraries such as pandas and scikit-learn
RComputeNode
- Used for running R-based computations within an enclave
- Wide selection of R libraries available (including texlive)
SqliteComputeNode
- Used for running SQL-based queries
- Based on sqlite
SqlComputeNode
- Also used for running SQL-based queries
- Uses a custom SQL engine that runs on Intel SGX
- If not otherwise required, we recommend using
SqliteComputeNodeDefinition
for running SQL workloads
SyntheticDataComputeNode
- Output synthetic data based on structured input data
- Can mask columns containing sensitive information
- Useful for testing downstream computations on real-looking data
S3SinkComputeNode
- Store the output of a computation in an S3 bucket
MatchingComputeNode
- Match two structured input datasets on a given key
PreviewComputeNode
- Restrict how much data can be read by another party from a particular Compute node
Below is an example of adding a PythonComputeNode
to a DCR.
from decentriq_platform.analytics import PythonComputeNodeDefinition
builder.add_node_definition(
PythonComputeNodeDefinition(
name="python-node",
script="""
import shutil
shutil.copyfile("/input/my-raw-data-node", "/output/result.txt")
""",
dependencies=["my-raw-data-node"]
)
)
Note that we used PythonComputeNodeDefinition
to add a PythonComputeNode
.
This new node depends on the data node we added earlier. The PythonComputeNode
can access the contents of the upstream data node the /input/[upstream_node_name]
directory.
In this case, the script just copies the contents of the input node to the output. But you could pass it a script of any complexity, limted only by the libraries available in that node.
Any file written to the /output
directory will be part of the result of the node.
There is no limit on the number of files that you can write to /output
.
Adding permissions to the DCR
Next we need to define the list of participants in the DCR and specify what permissions each participant has.
A participant can be a data owner of a data node. This gives the user permission to provision data to that node. A participant can also be an analyst of a compute node. This gives the user permission to run the node and retrieve its results. Finally, a participant can also have no permissions configured. This makes the participant an auditor of the DCR. Auditors may neither provision data nor see results. All participants (including auditors) may see the DCR, inspect the computations, and read the audit log.
builder.add_participant(
user_email,
data_owner_of=["my-raw-data-node"],
analyst_of=["python-node"]
)
Publishing the DCR
A Data Clean Room needs to be built and published before it can be used. This will encrypt the DCR and send it to the enclave where it is stored.
By building the DCR, we create its definition (analoguous to the node definitions encountered earlier).
dcr_definition = builder.build()
You can publish this definition using a client. This will which will return an AnalyticsDcr
object that can be used to interact with the live DCR.
dcr = client.publish_analytics_dcr(dcr_definition)
The id
field of the of the DCR is a unique way to identify it that is consistent across the UI and SDK. This is the same id you also see in the address bar of the Decentriq UI.
dcr_id = dcr.id