Skip to main content

Synthetic data generation

Generate from any table a differentially-private synthetic copy of the data with similar statistical properties as the original (compared in a dedicated report that our platform produces). This allows you to prototype your scripts locally before running them on the original data in a dedicated Data Clean Room.

Current version

decentriq.python-synth-data-worker-32-64:v18

How it works

The synthetic data generation takes a data source as input, alongside privacy mode and masking configuration, and produces artificial data with the same data schema, having similar statistical properties as the original data source. To assess the similarity, a quality report is optionally provided.

Differential privacy is applied during the synthetic data generation to ensure privacy in the result.

Data source

A tabular dataset: either a Table or a SQL computation result, with at least 50 rows.

Suggested maximum dataset sizes:

  • 50.000.000 cells (e.g. 50 columns, 1.000.000 rows) for privacy mode Balanced
  • 20.000 cells (e.g. 5 columns, 4.000 rows) for privacy modes Strict and Strictest

Example:

emailageattribute1attribute2attribute3attribute4
user1@email.com191110
user2@email.com250100
user3@email.com560111
user4@email.com760000
user5@email.com441110
user6@email.com550101
user7@email.com230010
user8@email.com341100
user9@email.com320011
user10@email.com151100
..................
user50@email.com511010
note

To generate synthetic data, a dataset of at least 50 rows is required.

Privacy mode

This changes epsilon parameter, which affects both speed and level of privacy.

Balanced uses standard synthetic data techniques for best accuracy and quickest results.

Strict uses a formal privacy budget (ε = 3) for extremely strong privacy guarantees, with somewhat less accuracy and at the cost of increased processing time.

Strictest uses a formal privacy budget (ε = 1) for strongest privacy guarantees, with less accuracy and somewhat more processing time.

note

When using Strict or Strictest modes, at most 50.000 rows of the original data source are used as input to the synthetic data generation, and the suggested maximum is 20.000 cells.

Masking

By default, the synthetic data generation algorithm applies differential privacy on all columns such that it's extremely unlikely that an individual could be identified from the result. Still, to prevent values of a specific column to appear in the result, masking must be enabled for that column.

Masking replaces original values with random values before generating synthetic data. Random values of the following types are supported:

Mask typeGenerated random value example
Generic string07752cd861d462d3c082cc432743c9e604679f51
Generic number87797244
NameAndrew Perez
Address011 Boyd Fields Apt. 421
Postcode85393
Phone number033-758-6384
Social Security Number033-758-6384
Emaillmiller@sampson.biz
Date2014-10-07
Timestamp1475532549
IBANGB95PMXP51945293498083

Artificial data

Example output, based on data source mentioned above, with masking enabled only for the email column and the strictest privacy mode:

emailageattribute1attribute2attribute3attribute4
yolandarobertson@powell.com551000
desiree72@gmail.com151000
cynthia50@cruz-white.com551101
devin42@harris-burch.net190101
bmiller@berry.com251100
devin42@harris-burch.net150100
yolandarobertson@powell.com190100
yolandarobertson@powell.com151000
bmiller@berry.com321101
..................

Quality report

This is a density plot comparing the distributions of the age column in the example above, for both original and synthetic data:

SDG Quality report

The quality report shows density plots for all columns, as well as a pairwise correlation chart.

Create a SDG computation in the Decentriq UI

  1. Access platform.decentriq.com with your credentials

  2. Create a Data Clean Room

  3. In the Computations tab, add a new computation of type Synthetic and give it a name:

    Create SDG computation

  4. Select the data source (either a table or a SQL computation) and the desired privacy mode, enable masking for sensitive columns and the type of random value to replace with.

    Configure SDG computation

    note

    When the data source changes (e.g. new column added to a table, SQL statement changed), the Synthetic computation schema needs to be updated. The Decentriq UI displays then a button for that.

  5. Review participants permissions - in some cases, an Analyst might only have access to results based on Synthetic Data.

    SDG Permissions

  6. Press the Test all computations button to check for correct data sources and eventual schema changes.

  7. Once the Data Clean Room is configured with data, computations and permissions, press the Encrypt and publish button.

  8. As soon as the Data Clean Room is published, your computation should be available in the Overview tab, where you can press Run to get the results and access the quality report: Run SDG computation

Create a SDG computation via the Python SDK

First, we set up a connection to an enclave and create a AnalyticsDcrBuilder  object:

import decentriq_platform as dq
from decentriq_platform.analytics import AnalyticsDcrBuilder

user_email = "@@ YOUR EMAIL HERE @@"
api_token = "@@ YOUR TOKEN HERE @@"

client = dq.create_client(user_email, api_token)

builder = AnalyticsDcrBuilder(client=client)
builder.\
with_name("SDG Data Clean Room").\
with_owner(user_email)

Then we add a table with the specified schema, that will be the data source holding the original data:

from decentriq_platform.analytics import Column, FormatType, TableDataNodeDefinition

columns = [
Column(
name="email",
format_type=FormatType.STRING,
is_nullable=False,
),
Column(
name="age",
format_type=FormatType.INTEGER,
is_nullable=False,
),
Column(
name="attribute1",
format_type=FormatType.INTEGER,
is_nullable=False,
),
Column(
name="attribute2",
format_type=FormatType.INTEGER,
is_nullable=False,
),
Column(
name="attribute3",
format_type=FormatType.INTEGER,
is_nullable=False,
),
Column(
name="attribute4",
format_type=FormatType.INTEGER,
is_nullable=False,
),
]

builder.add_node_definition(
TableDataNodeDefinition(
name="original_data", columns=columns, is_required=True
)
)

To create a Synthetic Data Generation computation, we first need to configure the masking, privacy mode and optionally the quality report.

note

In this example, only the first column (index=0) is being masked, with the type EMAIL.

To enable masking on other columns, simply add the SyntheticNodeColumn() configuration to the list, specifying the index that corresponds to the data source schema.

Set the epsilon variable according to the desired privacy budget (e.g. 1, 3) or 0 (zero) to use the standard synthetic generation techniques.

To see statistics of the original dataset being compared with the synthetic dataset in the quality report, set outputOriginalDataStats to True.

from decentriq_platform.analytics import MaskType, PrimitiveType, SyntheticNodeColumn

synthetic_columns=[
SyntheticNodeColumn(
data_type=PrimitiveType.STRING,
index=0,
mask_type=MaskType.EMAIL,
should_mask_column=True,
is_nullable=False,
),
]

Now we can add the computation to the Data Clean Room, taking the configuration and dependencies from above.

from decentriq_platform.analytics import SyntheticDataComputeNodeDefinition

builder.add_node_definition(
SyntheticDataComputeNodeDefinition(
name="my_synthetic_data",
columns=synthetic_columns,
dependency="original_data",
epsilon=1,
output_original_data_statistics=True
)
)

Set the necessary permissions (in this example, the Data Clean Room only has one user).

builder.add_participant(
user_email,
data_owner_of=["original_data"],
analyst_of=["my_synthetic_data"]
)

All set, the Data Clean Room can be encrypted and published.

dcr_definition = builder.build()
dcr = client.publish_analytics_dcr(dcr_definition)

data_room_id = dcr.id

Once published, it's possible to upload and provision the dataset, and then run the computation. Please follow the steps from the section Provision datasets via SDK in the MISSING LINK guide.

For this example we will use a dataset in this format:

user1@email.com,19,1,1,1,0
user2@email.com,25,0,1,0,0
user3@email.com,56,0,1,1,1
user4@email.com,76,0,0,0,0
user5@email.com,44,1,1,1,0
user6@email.com,55,0,1,0,1
user7@email.com,23,0,0,1,0
user8@email.com,34,1,1,0,0
user9@email.com,32,0,0,1,1
user10@email.com,15,1,1,0,0
...
user50@email.com,51,1,0,1,0
note

To generate synthetic data, a dataset of at least 50 rows is required.

Once the original dataset is provisioned, you can run the Synthetic Data Generation computation and get the resulting artificial data.

from decentriq_platform import Key
import io
data = io.BytesIO("\n".join([f"user{i}@email.com,19,1,1,1,0" for i in range(100)]).encode())
data_node = dcr.get_node("original_data")
data_node.upload_and_publish_dataset(data, key=Key(), name="my_salary_data.csv")
synth_node = dcr.get_node("my_synthetic_data")
result = synth_node.run_computation_and_get_results_as_string()

That's how the example synthetic data result looks like:

hsharp@gmail.com,44,1,1,0,0
mossjessica@wilkinson.com,32,0,1,1,0
sheri73@gmail.com,19,1,0,1,0
paulmack@turner.com,44,0,1,1,1
michael87@richards.com,32,0,0,1,1
wendynelson@pennington-moore.com,56,0,1,1,0
jchan@gmail.com,34,1,0,1,0
dianeclark@rodriguez.com,32,1,1,1,1
deannacrawford@yahoo.com,19,0,1,0,1
dianeclark@rodriguez.com,56,1,1,1,0
...
note

When the referenced Data Clean Room was created using the Decentriq UI:

The name argument of the get_node() should be the node name you see in the UI "Overview" tab.