Python Computation
import decentriq_platform as dq
user_email = "@@ YOUR EMAIL HERE @@"
api_token = "@@ YOUR TOKEN HERE @@"
client = dq.create_client(user_email, api_token)
enclave_specs = dq.enclave_specifications.latest()
Python computation
Decentriq’s Data Clean Rooms support running arbitrary Python scripts in confidential computing, guaranteeing a high level of security. You can train and deploy machine learning algorithms on sensitive data that is never revealed to anyone.
Current version
decentriq.python-ml-worker-32-64:v23
Available libraries
Python version: 3.11.6
numpy 1.26.1
pandas 2.1.1
statsmodels 0.14.0
seaborn 0.13.0
xgboost 2.0.1
auto-sklearn 0.15.0
pyreadstat 1.2.4
streamlit 1.28.1
imbalanced-learn 0.11.0
lifelines 0.27.8
scikit-survival 0.22.1
lime 0.2.0.1
python-gnup 0.5.1
The list of available libraries is not exhaustive.
How it works
The Python enclave worker is available as a confidential computing containerized environment inside a Data Clean Room, that takes datasets as input, executes an arbitrary script and generates an output. The container does not have any port open, therefore performing HTTP requests or accessing external resources is not possible. All relevant files must be mounted in advance, as explained below.
At the moment only CPU processing is supported. Please make sure your script does not require GPU to execute.
Input
Accessible read-only via the /input
directory.
Results of computations available in the Data Clean Room can be mounted to this directory. Once mounted, files are located at a specific path depending on the computation type.
- Python and R:
/input/<computation_name>/<filename>
- SQL and Synthetic data:
/input/<computation_name>/dataset.csv
Example:
SQL Computation named salary_sum
can be accessed at /input/sql_computation_result/dataset.csv
The dataset.csv
file does not include headers.
To read it with headers, please use decentriq_util.sql.read_sql_data_from_dir("/input/sql_computation_result/")
Optionally, you can also mount your own static files to support your script.
Example: /input/code/certificate.crt
Learn how to mount input files in the sections below, either via the Decentriq UI or the Python SDK.
Script
Bring your existing Python script.
Example logic to read, process and output data:
- Import libraries
import pandas as pd
- Read content from input file
table_headers = ['column name 1','column name 2']
table_data = pd.read_csv("/input/table_name/dataset.csv", names = table_headers)
The input files are only available after the Data Clean Room is published and a dataset is provisioned. Therefore, when validating this script (before publishing) the input files will be empty.
- Process data
results = table_data.groupby(['column name 1']).mean()
To overcome errors during validation due to empty dataset, it's recommended to wrap the data processing logic into a try/except
statement to handle expected issues, or mount a test dataset to the container and have an if
clause to use instead, in case the main dataset is empty.
- Write resulting files to output folder
results.to_csv('/output/result.csv', index=False, header=False)
The /tmp
directory is made available read/write during the script execution to support your logic.
It will be wiped once the execution is completed, and will not be available in the output.
Output
Accessible write-only via the /output
directory.
Write all resulting files of your computation to this directory. Sub-directories are also supported.
Once the execution is completed, the output becomes available as <computation_name>.zip
to be downloaded by users who have the required permissions.
Create a Python computation in the Decentriq UI
Access platform.decentriq.com with your credentials
Create a Data Clean Room
In the Computations tab, add a new computation of type Python and give it a name:
In the File browser on the right-side, mount the necessary input files (which will become available in the
/input
directory) by selecting existing computations, tables or files in the Data Clean Room:In the Main script tab, paste your existing script and adapt the file paths based on the selected dependencies in the file browser:
When clicking the
copy
icon in front of each file in the file browser, you will get a snippet that imports it into a dataframe or file. Just paste it directly to your script.noteIf necessary, add static text files to the container by clicking the
+
icon next to the Main script tab. These files will be available in the/input/code
directory.Press the
Test all computations
button to check for eventual errors in the script.Once the Data Clean Room is configured with data, computations and permissions, press the
Encrypt and publish
button.As soon as the Data Clean Room is published, your computation should be available in the Overview tab, where you can press
Run
and get the results:
Create a Python computation using the Python SDK
This example illustrates the execution of Python scripts within the trusted execution environment and the processing of unstructured input data.
How to use this functionality is illustrated in this example. It is also recommended to read through the Python SDK tutorial, as it introduces certain important concepts and terminology used in this example.
Assume we want to create a Data Clean Room that simply converts some text in an input file to uppercase. Using the Python SDK to accomplish this task could look as follows:
First, we set up a connection to an enclave and create a AnalyticsDcrBuilder
object:
import decentriq_platform as dq
from decentriq_platform.analytics import AnalyticsDcrBuilder
user_email = "@@ YOUR EMAIL HERE @@"
api_token = "@@ YOUR TOKEN HERE @@"
client = dq.create_client(user_email, api_token)
builder = AnalyticsDcrBuilder(client=client)
builder.\
with_name("Secure Uppercase").\
with_owner(user_email)
This enclave allows you to run Python code and provides common machine learning libraries such as pandas and scikit-learn as part of its Python environment.
Before worrying about execution, however, we need to add a data node to which we can upload the input file whose content should be converted to uppercase:
from decentriq_platform.analytics import RawDataNodeDefinition
builder.add_node_definition(
RawDataNodeDefinition(name="input_data_node", is_required=True)
)
Whenever we add a data or compute node, the builder object will assign an identifier to the newly added node. This identifier needs to be provided to the respective method whenever we want to interact with this node.
The script to uppercase text contained in an input file could look like this:
my_script_content = b"""
with open("/input/input_data_node", "r") as input_file:
input_data = input_file.read()
with open("/output/uppercase.txt", "w") as output_file:
output_file.write(input_data.upper())
"""
Here we defined the script within a multi-line string. For larger scripts, however, defining them in a file would likely be easier.
Now we can add the node that will actually execute our script.
To use this script in a Data Clean Room, it first has to be loaded into a PythonComputeNodeDefinition
node, which is then added to the DCR configuration.
This makes the Python script visible to all the participants in the DCR.
# If you wrote your script in a separate file, you can simply open
# the file using `with open` (note that we specify the "b" flag to read
# the file as a binary string), like so:
#
# with open("my_script.py", "rb") as data:
# my_script_content = data.read()
from decentriq_platform.analytics import PythonComputeNodeDefinition
builder.add_node_definition(
PythonComputeNodeDefinition(
name="uppercase_text_node",
script=my_script_content,
dependencies=["input_data_node"]
)
)
The name given to the compute node has no meaning to the enclave and only serves as a human-readable name.
The enclave addresses computations (and data nodes) using identifiers that are automatically generated by the Data Clean Room builder object when adding the node to the Data Clean Room. These ids are required when we want to interact with the node (e.g. triggering the computation or referring to the computation when adding user permissions).
builder.add_participant(
user_email,
data_owner_of=["input_data_node"],
analyst_of=["uppercase_text_node"]
)
dcr_definition = builder.build()
dcr = client.publish_analytics_dcr(dcr_definition)
data_room_id = dcr.id
After building and publishing the DCR, we can upload data and connect it to our input node.
# Here again, you can use the Python construct `with open(path, "rb") as data`
# to read the data in the right format from a file.
import io
from decentriq_platform import Key
key = Key() # generate an encryption key with which to encrypt the dataset
raw_data_node = dcr.get_node("input_data_node")
data = io.BytesIO(b"hello world")
raw_data_node.upload_and_publish_dataset(data, key, "my-data.txt")
When retrieving results for the computation, you will get a binary file that represents a zipfile.ZipFile
object containing all the files you wrote to the output directory in your script.
python_node = dcr.get_node("uppercase_text_node")
results = python_node.run_computation_and_get_results_as_zip()
result_txt = results.read("uppercase.txt").decode()
assert result_txt == "HELLO WORLD"
When the referenced Data Clean Room was created using the Decentriq UI:
The name
argument of the get_node()
should be the node name you see in the UI "Overview" tab.