Skip to main content

Datasets

import decentriq_platform as dq
from decentriq_platform.analytics import *
from decentriq_platform.lookalike_media import LookalikeMediaDcrBuilder

USER_EMAIL = "@@ YOUR EMAIL HERE @@"
OTHER_EMAIL = "@@ OTHER EMAIL HERE @@"
API_TOKEN = "@@ YOUR TOKEN HERE @@"
KEYCHAIN_PASSWORD = "@@ YOUR KEYCHAIN PASSWORD HERE @@"

client = dq.create_client(USER_EMAIL, API_TOKEN)

enclave_specs = dq.enclave_specifications.versions([...])
# Patch the available enclave specs to the ones currently running in the test
# environment.
# From now on everytime code accesses the latest specs in will use the test specs.
dq.enclave_specifications.specifications = {
f"{name}:v0": spec
for name, spec in enclave_specs.items()
}

auth, _ = client.create_auth_using_decentriq_pki(enclave_specs)
session = client.create_session(auth, enclave_specs)

client = dq.create_client(USER_EMAIL, API_TOKEN)
builder = AnalyticsDcrBuilder(client=client)

dcr_definition = builder.\
with_name("My DCR").\
with_owner(USER_EMAIL).\
with_description("My test DCR").\
add_node_definitions([
RawDataNodeDefinition(name="my-raw-data-node", is_required=True),
TableDataNodeDefinition(
name="my-table-data-node",
columns=[
Column(
name="name",
format_type=FormatType.STRING,
is_nullable=False,
),
Column(
name="salary",
format_type=FormatType.INTEGER,
is_nullable=False,
),
],
is_required=True,
),
]).\
add_participant(
USER_EMAIL,
data_owner_of=[
"my-raw-data-node",
"my-table-data-node",
]
).\
build()
dcr = client.publish_analytics_dcr(dcr_definition)
DCR_ID = dcr.id


# Build a Lookalike DCR to which we provision the DataLab
builder = LookalikeMediaDcrBuilder(client)
builder.with_name("sdk-lmdcr")
builder.with_matching_id_format(dq.types.MatchingId.STRING)
builder.with_publisher_emails(OTHER_EMAIL)
builder.with_advertiser_emails(USER_EMAIL)
lmdcr = builder.build_and_publish()
LMDCR_HASH = lmdcr.id

List datasets and get details

datasets = client.get_available_datasets()
# Each such dataset is a dictionary:
dataset_name = datasets[0]["name"]
manifest_hash = datasets[0]["manifestHash"]
client.get_dataset(manifest_hash)

Provision datasets via SDK

To perform the dataset operations below, please follow the steps in the Get started with Python SDK tutorial to instantiate the client and establish a session with the Decentriq platform.

All examples below make use of the Keychain, enabling you to reuse datasets across DCRs without having to re-upload them to the Decentriq platform.

import decentriq_platform as dq

# Instantiate `client` and `session`
# as described in the Get started with Python SDK tutorial
# ...

# Initiate the Keychain with your Keychain password
keychain = dq.Keychain.get_or_create_unlocked_keychain(
client,
# Convert the password string to the required bytes object
password=KEYCHAIN_PASSWORD.encode()
)

Upload a dataset via the SDK and continue in the Decentriq UI

If your goal is to use the SDK only to encrypt and upload datasets, and then continue your workflow in the Decentriq UI, for example:

  • Create or participate as a Data Owner to a DCR in the Decentriq UI
  • Upload datasets via the SDK for better control and performance
  • Access the Decentriq UI and provision the dataset to one or more DCRs
# Generate an encryption key
encryption_key = dq.Key()

# Read dataset locally, encrypt, upload and provision it to DCR
with open("/path/to/dataset.csv", "rb") as dataset:
DATASET_ID = client.upload_dataset(
dataset,
encryption_key,
"dataset_name",
store_in_keychain=keychain
)
note

(Publishers only) use this method to upload a dataset and then provision it to a Data Lab via the Decentriq UI.

Check the examples below to perform not only the upload but also the provisioning of datasets directly from the SDK.

Upload and provision a dataset to an Analytics DCR

Tabular datasets (.CSV files) to table nodes:

# Generate an encryption key
encryption_key = dq.Key()

dcr = client.retrieve_analytics_dcr(DCR_ID)

# Read dataset locally, encrypt, upload and provision it to DCR
with open("/path/to/dataset.csv", "rb") as tabular_dataset:
DATASET_ID = dcr.get_node("my-table-data-node").upload_and_publish_dataset(
tabular_dataset,
name="My Dataset",
key=encryption_key,
# Optionally store the encryption key in the keychain.
# This way the dataset can be re-provisioned from within the
# Decentriq UI too.
store_in_keychain=keychain,
)

Unstructured (or “raw”) datasets (.JSON, .TXT, .ZIP, etc) to file nodes:

# Generate an encryption key
encryption_key = dq.Key()

dcr = client.retrieve_analytics_dcr(DCR_ID)

# Read dataset locally, encrypt, upload and provision it to DCR
with open("/path/to/file.json", "rb") as raw_dataset:
dcr.get_node("my-raw-data-node").upload_and_publish_dataset(
raw_dataset,
name="My Dataset",
key=encryption_key,
# Optionally store the encryption key in the keychain.
# This way the dataset can be re-provisioned from within the
# Decentriq UI too.
store_in_keychain=keychain
)

Upload and provision a dataset to a Lookalike Clean Room

As an Advertiser, it's possible to upload and provision an audience directly to a Lookalike Clean Room by calling the provision_audiences_dataset method

# Generate an encryption key
encryption_key = dq.Key()

# Read dataset locally, encrypt, upload and provision it to DCR
with open("/path/to/audiences_data.csv", "rb") as audience_dataset:
dq.lookalike_media.provision_dataset(
audience_dataset,
name="My Dataset",
session=session,
key=encryption_key,
# Data Clean Room ID copied from Decentriq UI
data_room_id=LMDCR_HASH,
store_in_keychain=keychain,
dataset_type=dq.lookalike_media.DatasetType.AUDIENCES
)

Provision an existing dataset to an Analytics DCR via the SDK

with open("/path/to/dataset.csv", "rb") as dataset:
DATASET_ID = client.upload_dataset(
dataset,
encryption_key,
"dataset_name",
store_in_keychain=keychain
)

Assuming you already uploaded the dataset using the Decentriq UI, then you can provision the dataset to a DCR as follows.

# Get the DCR via the ID seen in the Decentriq UI
dcr = client.retrieve_analytics_dcr(DCR_ID)

# Then, retrieve the encryption key stored in the Keychain.
# DATASET_ID is an id copied from the Decentriq UI "Datasets" page or
# from the list of datasets retrieved via the SDK.
retrieved_key = keychain.get("dataset_key", DATASET_ID)

# Reprovision the existing dataset to a DCR
dcr.get_node("my-raw-data-node").publish_dataset(
DATASET_ID,
dq.Key(retrieved_key.value)
)

Deprovision and delete datasets via SDK

Deprovision

# Deprovision the dataset from the Table or File node
dcr.get_node("my-raw-data-node").remove_published_dataset()

Delete

Before deleting a dataset from the Decentriq platform, please make sure it is deprovisioned from all DCRs first, by calling the method above or doing so via the Decentriq UI.

Once it's been completely deprovisioned, it can be deleted:

# Delete the dataset from the Decentriq platform.
# DATASET_ID is an id copied from the Decentriq UI Datasets page or one
# that was retrieved via the SDK.
client.delete_dataset(DATASET_ID)

The encryption key will remain in the Keychain and must be removed separately. Please check the Keychain guide for more details.

Copy IDs from the Decentriq UI to use in the SDK

To obtain a DCR ID

Access the DCR, click on the … icon in the top-right corner, then Copy ID.

Copy DCR ID

To obtain a Table or File node Name

Use the same node name as you see in the UI.

To obtain a dataset ID

Access the Datasets page, locate the dataset and copy the ID displayed at the bottom of the details panel. Copy dataset ID