Skip to main content

Python computation

Decentriq’s Data Clean Rooms support running arbitrary Python scripts in confidential computing, guaranteeing a high level of security. You can train and deploy machine learning algorithms on sensitive data that is never revealed to anyone.

Current version

decentriq.python-ml-worker-32-64:v23

Available libraries

Python version: 3.11.6

  • numpy 1.26.1
  • pandas 2.1.1
  • statsmodels 0.14.0
  • seaborn 0.13.0
  • xgboost 2.0.1
  • auto-sklearn 0.15.0
  • pyreadstat 1.2.4
  • streamlit 1.28.1
  • imbalanced-learn 0.11.0
  • lifelines 0.27.8
  • scikit-survival 0.22.1
  • lime 0.2.0.1
  • python-gnup 0.5.1
note

The list of available libraries is not exhaustive.

How it works

The Python enclave worker is available as a confidential computing containerized environment inside a Data Clean Room, that takes datasets as input, executes an arbitrary script and generates an output. The container does not have any port open, therefore performing HTTP requests or accessing external resources is not possible. All relevant files must be mounted in advance, as explained below.

At the moment only CPU processing is supported. Please make sure your script does not require GPU to execute.

Input

Accessible read-only via the /input directory.

Results of computations available in the Data Clean Room can be mounted to this directory. Once mounted, files are located at a specific path depending on the computation type.

  • Python and R: /input/<computation_name>/<filename>
  • SQL and Synthetic data: /input/<computation_name>/dataset.csv

Example: SQL Computation named salary_sum can be accessed at /input/sql_computation_result/dataset.csv

note

The dataset.csv file does not include headers. To read it with headers, please use decentriq_util.sql.read_sql_data_from_dir("/input/sql_computation_result/")

Optionally, you can also mount your own static files to support your script.

Example: /input/code/certificate.crt

Learn how to mount input files in the sections below, either via the Decentriq UI or the Python SDK.

Script

Bring your existing Python script.

Example logic to read, process and output data:

  1. Import libraries
import pandas as pd
  1. Read content from input file
table_headers = ['column name 1','column name 2']
table_data = pd.read_csv("/input/table_name/dataset.csv", names = table_headers)
note

The input files are only available after the Data Clean Room is published and a dataset is provisioned. Therefore, when validating this script (before publishing) the input files will be empty.

  1. Process data
results = table_data.groupby(['column name 1']).mean()

To overcome errors during validation due to empty dataset, it's recommended to wrap the data processing logic into a try/except statement to handle expected issues, or mount a test dataset to the container and have an if clause to use instead, in case the main dataset is empty.

  1. Write resulting files to output folder
results.to_csv('/output/result.csv', index=False, header=False)
note

The /tmp directory is made available read/write during the script execution to support your logic.
It will be wiped once the execution is completed, and will not be available in the output.

Output

Accessible write-only via the /output directory.

Write all resulting files of your computation to this directory. Sub-directories are also supported.

Once the execution is completed, the output becomes available as <computation_name>.zip to be downloaded by users who have the required permissions.

Create a Python computation in the Decentriq UI

  1. Access platform.decentriq.com with your credentials

  2. Create a Data Clean Room

  3. In the Computations tab, add a new computation of type Python and give it a name: Create Python computation

  4. In the File browser on the right-side, mount the necessary input files (which will become available in the /input directory) by selecting existing computations, tables or files in the Data Clean Room:

    Mount Python input

  5. In the Main script tab, paste your existing script and adapt the file paths based on the selected dependencies in the file browser:

    Python computation draft

    When clicking the copy icon in front of each file in the file browser, you will get a snippet that imports it into a dataframe or file. Just paste it directly to your script.

    note

    If necessary, add static text files to the container by clicking the + icon next to the Main script tab. These files will be available in the /input/code directory.

  6. Press the Test all computations button to check for eventual errors in the script.

  7. Once the Data Clean Room is configured with data, computations and permissions, press the Encrypt and publish button.

  8. As soon as the Data Clean Room is published, your computation should be available in the Overview tab, where you can press Run and get the results:

    Run Python computation

Create a Python computation using the Python SDK

This example illustrates the execution of Python scripts within the trusted execution environment and the processing of unstructured input data.

How to use this functionality is illustrated in this example. It is also recommended to read through the Python SDK tutorial, as it introduces certain important concepts and terminology used in this example.

Assume we want to create a Data Clean Room that simply converts some text in an input file to uppercase. Using the Python SDK to accomplish this task could look as follows:

First, we set up a connection to an enclave and create a AnalyticsDcrBuilder object:

import decentriq_platform as dq
from decentriq_platform.analytics import AnalyticsDcrBuilder

user_email = "@@ YOUR EMAIL HERE @@"
api_token = "@@ YOUR TOKEN HERE @@"

client = dq.create_client(user_email, api_token)

builder = AnalyticsDcrBuilder(client=client)
builder.\
with_name("Secure Uppercase").\
with_owner(user_email)

This enclave allows you to run Python code and provides common machine learning libraries such as pandas and scikit-learn as part of its Python environment.

Before worrying about execution, however, we need to add a data node to which we can upload the input file whose content should be converted to uppercase:

from decentriq_platform.analytics import RawDataNodeDefinition

builder.add_node_definition(
RawDataNodeDefinition(name="input_data_node", is_required=True)
)

Whenever we add a data or compute node, the builder object will assign an identifier to the newly added node. This identifier needs to be provided to the respective method whenever we want to interact with this node.

The script to uppercase text contained in an input file could look like this:

my_script_content = b"""
with open("/input/input_data_node", "r") as input_file:
input_data = input_file.read()
with open("/output/uppercase.txt", "w") as output_file:
output_file.write(input_data.upper())
"""

Here we defined the script within a multi-line string. For larger scripts, however, defining them in a file would likely be easier.

Now we can add the node that will actually execute our script. To use this script in a Data Clean Room, it first has to be loaded into a PythonComputeNodeDefinition node, which is then added to the DCR configuration. This makes the Python script visible to all the participants in the DCR.

# If you wrote your script in a separate file, you can simply open
# the file using `with open` (note that we specify the "b" flag to read
# the file as a binary string), like so:
#
# with open("my_script.py", "rb") as data:
# my_script_content = data.read()

from decentriq_platform.analytics import PythonComputeNodeDefinition

builder.add_node_definition(
PythonComputeNodeDefinition(
name="uppercase_text_node",
script=my_script_content,
dependencies=["input_data_node"]
)
)
note

The name given to the compute node has no meaning to the enclave and only serves as a human-readable name.

The enclave addresses computations (and data nodes) using identifiers that are automatically generated by the Data Clean Room builder object when adding the node to the Data Clean Room. These ids are required when we want to interact with the node (e.g. triggering the computation or referring to the computation when adding user permissions).

builder.add_participant(
user_email,
data_owner_of=["input_data_node"],
analyst_of=["uppercase_text_node"]
)

dcr_definition = builder.build()
dcr = client.publish_analytics_dcr(dcr_definition)

data_room_id = dcr.id

After building and publishing the DCR, we can upload data and connect it to our input node.

# Here again, you can use the Python construct `with open(path, "rb") as data`
# to read the data in the right format from a file.
import io
from decentriq_platform import Key

key = Key() # generate an encryption key with which to encrypt the dataset
raw_data_node = dcr.get_node("input_data_node")
data = io.BytesIO(b"hello world")
raw_data_node.upload_and_publish_dataset(data, key, "my-data.txt")

When retrieving results for the computation, you will get a binary file that represents a zipfile.ZipFile object containing all the files you wrote to the output directory in your script.

python_node = dcr.get_node("uppercase_text_node")
results = python_node.run_computation_and_get_results_as_zip()
result_txt = results.read("uppercase.txt").decode()
assert result_txt == "HELLO WORLD"
note

When the referenced Data Clean Room was created using the Decentriq UI:

The name argument of the get_node() should be the node name you see in the UI "Overview" tab.