Skip to main content

Datasets

Overview

Datasets are “tabular” (.CSV) or unstructured (any, for example .JSON, .TXT, .ZIP) files that can be uploaded to the Decentriq platform, either from your computer or from external sources. Any file uploaded from your computer to the Decentriq platform will automatically be encrypted using a state-of-the-art encryption algorithm before it is being transmitted over the internet.

Once in the Decentriq platform, datasets can be provisioned to Data Clean Rooms (DCR) to be read by computations that output results for analyses. These results can also stored as datasets.

Datasets can then be exported to external destinations or provisioned to other Data Clean Rooms.

To import and export datasets, check the Data connectors guide.

At any time, it is possible to deprovision a dataset from a DCR or to completely delete a dataset from the Decentriq platform and all its traces.

To manage all your datasets, access the Datasets page from the sidebar in the Decentriq UI.

Provision datasets

Datasets can be uploaded to the Decentriq platform encrypted via the UI or SDK, or imported from an external source.

An encryption key is generated locally at the time of upload, such that the dataset is transmitted already encrypted to the Decentriq platform.

The encryption key can be stored in the Keychain confidentially, enabling you to reuse datasets across Data Clean Rooms without having to re-upload them to the Decentriq platform.

Once in the Decentriq platform, datasets can be provisioned to Table or File nodes in DCRs and read by computations therein for further analyses.

To provision a dataset to a DCR, the encryption key — with which the dataset was originally encrypted — is required.

Please follow the steps below to provision a dataset via the Decentriq UI or via the SDK.

Deprovision and delete datasets

Deprovisioning a dataset from a DCR means it will no longer be available for computations that depend on it. Previous results derived from computations on that dataset will be erased. The dataset will not be deleted from the Decentriq platform, and will not be deprovisioned from other DCRs. It can be reprovisioned to other DCRs or even the same.

Deleting a dataset means it will be deprovisioned from all DCRs and completely deleted from the Decentriq platform. This is irreversible: All traces of your data, including all derived datasets and results will be deleted from the platform.

Please follow the steps below to provision a dataset via the Decentriq UI or via the SDK.

Provision datasets via the Decentriq UI

Access a DCR and notice that the Table or File nodes you where you have Data Owner permissions will appear with the option to Provision dataset.

Data node ready to be provisioned

Clicking the Provision dataset button leads to the upload wizard where it's possible to select a dataset from your computer or stored in the platform.

Choose the source of the dataset

Select a dataset from your computer

For file nodes, any file type can be selected. In this case, there is no validation and the dataset is provisioned immediately.

For table nodes, it's only possible to select CSV files. The Decentriq UI offers a powerful way to adjust your dataset to match the required table schema.

Preview and format dataset

It's possible to:

  • Indicate whether the dataset has the first row as header or not.
  • Select the column separator i.e. the CSV file delimiter.
  • Select the decimal separator for floating point numbers.
    • Note: the selected separator only applies if Autofix values is turned on.
  • Autofix values
    • Normalize emails to lower case and remove spaces.
    • Normalize phone numbers to match the E.164 format.
    • Apply the selected decimal separator (e.g. 1,23 converted to 1.23).
  • Drop rows containing invalid values.
  • Re-map columns to match the required schema.
  • Hash values with SHA256 (hex-encoded) before they are uploaded (if the schema requires it).
  • Visualize rows that failed validation, which columns have been mapped, and more.

By default, the generated encryption key for the selected dataset will be stored in the Keychain, such that the same dataset can be reprovisioned to other DCRs afterwards.

Once ready with the dataset formatting, click Encrypt and Provision to transmit the encrypted dataset to the Decentriq Platform and provision it to the Table or File nodes node in the DCR.

Select a stored dataset

Select stored dataset

  • all datasets that have an encryption key stored in the keychain will be listed.
  • select the desired dataset to immediately provision to the Table or File node.

Store a computation result as a dataset via the Decentriq UI

In a DCR, once a computation has run and the result is available, it can be downloaded or stored as a dataset.
When stored as a dataset, it can then be exported to external destinations or reprovisioned to other DCRs. To do so, locate the computation inside a DCR and click the retrieve result icon. Store computation result Click then Store as dataset.

Deprovision and delete datasets via the Decentriq UI

Data node dataset actions

Deprovision

From the DCR, click the arrow down in front of the dataset and click Deprovision.

Alternatively, from the Datasets page, locate the dataset and click the “unlink” icon in front of the DCR name.

Delete

From the DCR, click the arrow down in front of the dataset and click Delete.

Alternatively, from the Datasets page, locate the dataset, click the icon in top-right corner of the details panel and click Delete.

Provision datasets via SDK

To perform dataset operations programmatically using the Decentriq Python SDK, please follow the Provision Datasets Cookbook.