Synthetic data generation
Generate from any table a differentially-private synthetic copy of the data with similar statistical properties as the original (compared in a dedicated report that our platform produces). This allows you to prototype your scripts locally before running them on the original data in a dedicated Data Clean Room.
Current version
decentriq.python-synth-data-worker-32-64:v12
How it works
The synthetic data generation takes a data source as input, alongside privacy mode and masking configuration, and produces artificial data with the same data schema, having similar statistical properties as the original data source. To assess the similarity, a quality report is optionally provided.
Differential privacy is applied during the synthetic data generation to ensure privacy in the result.
Data source
A tabular dataset: either a Table or a SQL computation result, with at least 50 rows.
Suggested maximum dataset sizes:
- 50.000.000 cells (e.g. 50 columns, 1.000.000 rows) for privacy mode
Balanced
- 20.000 cells (e.g. 5 columns, 4.000 rows) for privacy modes
Strict
andStrictest
Example:
age | attribute1 | attribute2 | attribute3 | attribute4 | |
---|---|---|---|---|---|
user1@email.com | 19 | 1 | 1 | 1 | 0 |
user2@email.com | 25 | 0 | 1 | 0 | 0 |
user3@email.com | 56 | 0 | 1 | 1 | 1 |
user4@email.com | 76 | 0 | 0 | 0 | 0 |
user5@email.com | 44 | 1 | 1 | 1 | 0 |
user6@email.com | 55 | 0 | 1 | 0 | 1 |
user7@email.com | 23 | 0 | 0 | 1 | 0 |
user8@email.com | 34 | 1 | 1 | 0 | 0 |
user9@email.com | 32 | 0 | 0 | 1 | 1 |
user10@email.com | 15 | 1 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... |
user50@email.com | 51 | 1 | 0 | 1 | 0 |
To generate synthetic data, a dataset of at least 50 rows is required.
Privacy mode
This changes epsilon parameter, which affects both speed and level of privacy.
Balanced
uses standard synthetic data techniques for best accuracy and quickest results.
Strict
uses a formal privacy budget (ε = 3) for extremely strong privacy guarantees, with somewhat less accuracy and at the cost of increased processing time.
Strictest
uses a formal privacy budget (ε = 1) for strongest privacy guarantees, with less accuracy and somewhat more processing time.
When using Strict
or Strictest
modes, at most 50.000 rows of the original data source are used as input to the synthetic data generation, and the suggested maximum is 20.000 cells.
Masking
By default, the synthetic data generation algorithm applies differential privacy on all columns such that it's extremely unlikely that an individual could be identified from the result. Still, to prevent values of a specific column to appear in the result, masking must be enabled for that column.
Masking replaces original values with random values before generating synthetic data. Random values of the following types are supported:
Mask type | Generated random value example |
---|---|
Generic string | 07752cd861d462d3c082cc432743c9e604679f51 |
Generic number | 87797244 |
Name | Andrew Perez |
Address | 011 Boyd Fields Apt. 421 |
Postcode | 85393 |
Phone number | 033-758-6384 |
Social Security Number | 033-758-6384 |
lmiller@sampson.biz | |
Date | 2014-10-07 |
Timestamp | 1475532549 |
IBAN | GB95PMXP51945293498083 |
Artificial data
Example output, based on data source mentioned above, with masking enabled only for the email
column and the strictest privacy mode:
age | attribute1 | attribute2 | attribute3 | attribute4 | |
---|---|---|---|---|---|
yolandarobertson@powell.com | 55 | 1 | 0 | 0 | 0 |
desiree72@gmail.com | 15 | 1 | 0 | 0 | 0 |
cynthia50@cruz-white.com | 55 | 1 | 1 | 0 | 1 |
devin42@harris-burch.net | 19 | 0 | 1 | 0 | 1 |
bmiller@berry.com | 25 | 1 | 1 | 0 | 0 |
devin42@harris-burch.net | 15 | 0 | 1 | 0 | 0 |
yolandarobertson@powell.com | 19 | 0 | 1 | 0 | 0 |
yolandarobertson@powell.com | 15 | 1 | 0 | 0 | 0 |
bmiller@berry.com | 32 | 1 | 1 | 0 | 1 |
... | ... | ... | ... | ... | ... |
Quality report
This is a density plot comparing the distributions of the age column in the example above, for both original and synthetic data:
The quality report shows density plots for all columns, as well as a pairwise correlation chart.
Create a SDG computation in the Decentriq UI
Access platform.decentriq.com with your credentials
Create a Data Clean Room
In the Computations tab, add a new computation of type Synthetic and give it a name:
Select the data source (either a table or a SQL computation) and the desired privacy mode, enable masking for sensitive columns and the type of random value to replace with.
noteWhen the data source changes (e.g. new column added to a table, SQL statement changed), the Synthetic computation schema needs to be updated. The Decentriq UI displays then a button for that.
Review participants permissions - in some cases, an Analyst might only have access to results based on Synthetic Data.
Press the
Test all computations
button to check for correct data sources and eventual schema changes.Once the Data Clean Room is configured with data, computations and permissions, press the
Encrypt and publish
button.As soon as the Data Clean Room is published, your computation should be available in the Overview tab, where you can press
Run
to get the results and access the quality report:
Create a SDG computation via the Python SDK
First, we set up a connection to an enclave and create a DataRoomBuilder
object:
import decentriq_platform as dq
import decentriq_platform.container as dqc
import decentriq_platform.sql as dqsql
from decentriq_platform.proto import (
SyntheticDataConf,
Mask,
Column,
serialize_length_delimited
)
user_email = "test_user@company.com"
api_token = "@@ YOUR TOKEN HERE @@"
client = dq.create_client(user_email, api_token)
enclave_specs = dq.enclave_specifications.versions([
"decentriq.driver:v14",
"decentriq.sql-worker:v10",
"decentriq.python-synth-data-worker-32-64:v12"
])
auth, _ = client.create_auth_using_decentriq_pki(enclave_specs)
session = client.create_session(auth, enclave_specs)
builder = dq.DataRoomBuilder(
"SDG Data Clean Room",
enclave_specs=enclave_specs
)
Then we add a table with the specified schema, that will be the data source holding the original data:
original_data_node_name = "original_data"
schema=[
# The name of the columns, together with their type, and a flag
# for whether values of that column are nullable.
("email", dqsql.PrimitiveType.STRING, False),
("age", dqsql.PrimitiveType.INT64, False),
("attribute1", dqsql.PrimitiveType.INT64, False),
("attribute2", dqsql.PrimitiveType.INT64, False),
("attribute3", dqsql.PrimitiveType.INT64, False),
("attribute4", dqsql.PrimitiveType.INT64, False),
]
# add data node
data_node_builder = dqsql.TabularDataNodeBuilder(
original_data_node_name,
schema=schema
)
data_node_id = data_node_builder.add_to_builder(
builder,
authentication=client.decentriq_pki_authentication,
users=[user_email]
)
To create a Synthetic Data Generation computation, we first need to configure the masking, privacy mode and optionally the quality report.
In this example, only the first column (index=0
) is being masked, with the type EMAIL
.
To enable masking on other columns, simply add the Column()
configuration to the list, specifying the index
that corresponds to the data source schema.
Set the epsilon
variable according to the desired privacy budget (e.g. 1
, 3
) or 0
(zero) to use the standard synthetic generation techniques.
To see statistics of the original dataset being compared with the synthetic dataset in the quality report, set outputOriginalDataStats
to True
.
conf = SyntheticDataConf(
columns=[
Column(
index=0,
type=dqsql.proto.ColumnType(primitiveType=dqsql.proto.PrimitiveType.STRING, nullable=False),
mask=Mask(format=Mask.MaskFormat.EMAIL)
)
],
epsilon=1,
outputOriginalDataStats=True
)
config_serialized = serialize_length_delimited(conf)
config_node = dq.StaticContent("synth_data_config", config_serialized)
config_node_id = builder.add_compute_node(config_node)
Now we can add the computation to the Data Clean Room, taking the configuration and dependencies from above.
synth_data = dqc.StaticContainerCompute(
name="my_synthetic_data",
command=[
"generate-synth-data"
],
mount_points=[
dqc.proto.MountPoint(path="input", dependency=original_data_node_name),
dqc.proto.MountPoint(path="config", dependency=config_node_id),
],
output_path="/output",
include_container_logs_on_error=True,
enclave_type="decentriq.python-synth-data-worker-32-64"
)
sdg_node_id = builder.add_compute_node(synth_data)
Set the necessary permissions (in this example, the Data Clean Room only has one user).
builder.add_user_permission(
email=user_email,
authentication_method=client.decentriq_pki_authentication,
permissions=[
dq.Permissions.execute_compute(sdg_node_id),
],
)
All set, the Data Clean Room can be encrypted and published.
data_room = builder.build()
dcr_id = session.publish_data_room(data_room)
Once published, it's possible to upload and provision the dataset, and then run the computation.
For this example we will use a dataset in this format:
user1@email.com,19,1,1,1,0
user2@email.com,25,0,1,0,0
user3@email.com,56,0,1,1,1
user4@email.com,76,0,0,0,0
user5@email.com,44,1,1,1,0
user6@email.com,55,0,1,0,1
user7@email.com,23,0,0,1,0
user8@email.com,34,1,1,0,0
user9@email.com,32,0,0,1,1
user10@email.com,15,1,1,0,0
...
user50@email.com,51,1,0,1,0
To generate synthetic data, a dataset of at least 50 rows is required.
Here's how to encrypt the dataset locally and transmit to the Decentriq platform:
data = dqsql.read_input_csv_file("/path/to/original_data.csv", has_header=False)
encryption_key = dq.Key()
dataset_id = dqsql.upload_and_publish_tabular_dataset(
data, encryption_key, dcr_id,
table=original_data_node_name,
session=session,
description="dataset provisioned via the SDK",
validate=True
)
Once the original dataset is provisioned, you can run the Synthetic Data Generation computation and get the resulting artificial data.
raw_result = session.run_computation_and_get_results(dcr_id, sdg_node_id)
zip_result = dqc.read_result_as_zipfile(raw_result)
result = zip_result.read("dataset.csv").decode()
print(result)
assert len(result.splitlines()) == 50
That's how the example synthetic data result looks like:
hsharp@gmail.com,44,1,1,0,0
mossjessica@wilkinson.com,32,0,1,1,0
sheri73@gmail.com,19,1,0,1,0
paulmack@turner.com,44,0,1,1,1
michael87@richards.com,32,0,0,1,1
wendynelson@pennington-moore.com,56,0,1,1,0
jchan@gmail.com,34,1,0,1,0
dianeclark@rodriguez.com,32,1,1,1,1
deannacrawford@yahoo.com,19,0,1,0,1
dianeclark@rodriguez.com,56,1,1,1,0
...
When the referenced Data Clean Room was created using the Decentriq UI:
The compute_node_id
argument of the run_computation_and_get_results()
will have the format <NODE_ID>_container
,
where <NODE_ID>
corresponds to the value that you see when hovering your mouse pointer over
the name of that computation.