Decentriq LogoDOCS

Step by step guide - Build your first data clean room

Below, you will go through a practical example showing the following use-case:

A bank and an insurance provider want to know what is the overlap of their customer base, but they cannot share the CRM data with the other party. This is an example of a collaboration workflow that the Decentriq platform can make possible. Using the Decentriq platform, the parties can securely connect sensitive customer data while keeping them private, and run the overlap computation on them with a straightforward workflow

In this example, we will define the computations in SQL. You can use the following files to reproduce the steps below and create your first data clean room:

Simple example material ➞

Step 1 - Access the platform
  • Navigate to https://platform.decentriq.com/

  • Log in with your credentials. If you do not have any credentials yet, please contact your reference person of the Decentriq team

Step 2 - Create a data clean room

Create a DCR

  • Click on New data clean room button.

  • Give a name. The example data clean room will be on a 'confidential computing overlap' between a bank and an insurance provider.

  • Here you can decide to start from scratch or to import a JSON template like the one in the 'simple example material' folder linked at the beginning of this page.

Step 3 - Define the datasets

Define the datasets to be provisioned by Data Owners.

Datasets

  • By default, datasets must be provisioned before allowing running computations that depend on them. This can be toggled via the checkbox.

  • Add a new table when using structured datasets (CSV) and define the expected schema by adding columns with types.

  • Add a new file when working with unstructured datasets (JSON, TXT, ZIP or any other kind).

Step 4 - Define the computations

These can be SQL, Python, R or Synthetic Data. In this example, we will use queries defined in SQL language to define the overlap computation. For a list of supported data types and SQL clauses, see here

SQL Computation

  • Here, you can also set up the privacy settings. The purpose of the privacy settings is guaranteeing that the output does not leak sensitive data. In the example, the privacy filter is activated, which guarantees that a minimum amount of rows is aggregated when the output is shown.

  • Type in the query content, use the Table browser for a quick reference of tables and columns available.

In this other example, add a new Synthetic Data computation, that takes a sensitive table as source: Synthetic data

  • Mask the columns where the value should not appear in the results - these will be replaced with a random value of each type.

  • All other columns will be synthesized using differential privacy while keeping similar statistical properties.

Add as many computations as you wish, combining different languages and referencing results from each other.

Once completed press the Test all computations button to make sure it will work once the data clean room is published. Note: this will test the computation with empty datasets and only return the expected result schema. After publishing, Data Owners can be provision datasets and the computation can be run.

Step 5 - Set permissions

Define the participants that will be invited to the collaboration, and assign them permissions to interact with the tables and/or computations.

Permissions

  • Enable data clean room interactivity to allow participants to request the new computations (to be approved by affected Data Owners) after it is published. Otherwise, the data clean room will be immutable by default.

  • Enable development environment to give participants access to tab where they can run arbitrary computations based on data and computation results where they have permissions.

  • Use the dropdown boxes to assign Data Owner and Analyst permissions to each participant on each dataset and computation.

  • Add a new participant by typing in their email - an invitation will be sent as soon as the data clean room is published.

Step 6 - Encrypt and publish the data clean room
  • Click the Encrypt and publish button at the top-right side.

  • The data clean room definition will be enforced in our confidential computing environment once published, and can only be changed after publishing if the interactivity feature is enabled.

  • Note that you can duplicate the DCR, or export its definition in .JSON format to save it offline at any moment.

  • Now, participants can start interacting with the published data clean room.

Step 7 - Provision datasets and run computations

The Actions tab contains all datasets where you are a Data Owner of, and all computations where you have Analyst permissions. To see the entire DCR definition, please refer to the Overview tab. Provision and run

  • The data providers can provision the datasets in CSV format to the tables, you just have to follow the guided wizard for a successful provisioning. Analysts can run the computations and get the results back.

  • You can find the necessary .CSVs to run the example in the .ZIP folder provided above. After you have provisioned the datasets, you can run your computation.

  • It is also possible to provision unstructured datasets if a file was defined in the Data tab when drafting the DCR.

  • Once all necessary data is available, click the Run button of each computation and get the results.

Step 8 - Browse provisioned datasets

From the sidebar, access the overview of all your provisioned data in the Datasets page:

Dataset statistics Here you can see the full list of the datasets you have uploaded, see in which data clean rooms they sit and also have some information about the dataset:

  • Description

  • size and # of rows

  • The columns of the dataset

  • The summary statistics that have been computed during the upload, if available. Please, note that the summary statistics can optionally be shared with the other participants within a given data clean room.

Step 9 - Check the tamper-proof audit log

All participants of DCR created via the UI have auditing permissions, for full transparency and to build trust among participants.

Audit log This means that all of them can:

  • Inspect the data clean room definition

  • Know who uploaded data

  • Access and download the audit log, that is the register of all the activities of the data clean room with the user that performed it

Data clean room interactivity
Step 10 - Develop a new computation

Create an arbitrary Python computation in the Development tab:

Develop new computation

  • On the right-side File browser select the Synthetic data computation as Available data and write the Python code logic in the editor.

  • Press Run to execute and see results.

  • Hint: for debugging, feel free to make use of

    print()
    for example, and inspect the output console.

  • Once confident with the results, press the Create request button to integrate this computation to the data clean room.

Step 11 - Request a new computation

In the Request tab, manage the new computations to be integrated in the data clean room: Create request

  • Choose which participants will be able to run this computation by assigning Analyst permissions.

  • On the right-side File browser select the original dataset for the computation as Available data and adapt the Python code in the editor to reference this dataset.

  • Press Validate to check the script and determine the affected Data Owners by this request.

  • Once ready, Submit for approval - this will ask the affected Data Owners to review and approve the request.

Step 12 - Approve and run a new computation

In the Requests tab, Data Owners can review the computation being requested:

Approve request

  • When only one Data Owner is affected, it is also possible to run and preview results before approving.

  • When multiple Data Owners are affected, all of them are required to approve the computation.

  • Once confident, they can press the Approve button and so that the computation gets integrated into the data clean room.

The new computation can be run directly via the Actions tab: Run new computation