Required datasets
Dataset types
A Datalab can contain up to four datasets, each serving a specific purpose:
| Dataset | Required | Type description | Key columns |
|---|---|---|---|
| Identifiers | ✔ Required | Lists every identifier attached to each user (matching IDs and activation IDs). | Depends on the chosen schema |
| Segments | Conditional | Maps your internal user ID (userId) to segment labels (segment) for insights and targeting. Each user can have multiple rows with different labels. | userId, segment |
| Demographics | Optional | Maps your internal user ID (userId) to age brackets (age) and gender (gender) for richer analysis. | userId, age, gender |
| Embeddings | Conditional | Maps your internal user ID (userId) to numeric embeddings (emb_x) for AI lookalike modeling, optionally grouped by scope (scope). | userId, scope, emb_x |
All datasets must be in CSV format with UTF-8 encoding, without column headers or index columns.
Dataset requirements by collaboration type
Different Media DCR collaboration types require different datasets. The Identifiers dataset is always required; the table below shows which identifier roles and which optional datasets each collaboration type needs.
| Collaboration type | Matching ID | Activation ID | Segments | Demographics | Embeddings |
|---|---|---|---|---|---|
| Overlap analysis | must have | not needed | not needed | not needed | not needed |
| Insights | must have | not needed | must have | nice to have | not needed |
| Remarketing audiences | must have | must have | not needed | not needed | not needed |
| Rule-based audiences | must have | must have | must have | nice to have | not needed |
| AI lookalike audiences | must have | must have | must have one of ★ | nice to have | must have one of ★ |
★ For AI lookalike audiences, you must provide at least one of the Segments dataset or the Embeddings dataset. When an Embeddings dataset is provided, the lookalike model uses embeddings for training instead of segments.
Identifiers
The Identifiers dataset is the heart of the Datalab. It lists every identifier attached to each user — for example, a hashed email, a phone number, or a publisher-internal user ID — and records whether each identifier is used for matching, activation, or both.
Identifiers configuration
When you create the Datalab, you configure each identifier with:
- Identifier name — your label for the identifier. Must be unique within the Datalab. Exactly one identifier must be named
userId; this identifier is the join key with the Segments, Demographics, and Embeddings datasets. - Identifier type — the format of the identifier (see supported types).
- Matching ID — whether this identifier is used to match against the seed audience in a Media DCR.
- Activation ID — whether this identifier can be used as the export ID in audiences (remarketing, rule-based, AI lookalike).
The same identifier can serve any combination of roles. For example, with a single internal ID you can mark it as userId, matching ID, and activation ID all at once.
Constraints:
- At least one identifier must be a matching ID.
- Exactly one identifier must be named
userId. - For collaboration types that activate users (Remarketing, Rule-based audiences, AI lookalike audiences), at least one identifier must be an activation ID.
- Only one identifier of the same identifier type can be set as a matching ID. Each Media DCR matches on one ID type, so the choice has to be unambiguous.
Identifiers schema
You choose the schema for the Identifiers dataset during Datalab creation:
- Base schema is always available, supports any number of identifiers per user, and is selected automatically if you bring 3 or more identifiers.
- Simplified schemas are available when you bring 1 or 2 identifiers, to keep simple cases easy.
Base schema
Three columns: userId, idName, id. The userId column carries the type you configured for the userId identifier; idName and id are strings. All values are non-null.
- A user can have multiple rows with the same
idName(for example, two emails for one user). - The
idcolumn is unique across the whole table — different users can't share anid, and the sameidcan't appear under two differentidNamevalues.
| userId | idName | id |
|---|---|---|
ab0dc82c-f120-49d1-82d4-0ab994e8410c | userId | ab0dc82c-f120-49d1-82d4-0ab994e8410c |
ab0dc82c-f120-49d1-82d4-0ab994e8410c | hashedEmail | 3402d61a92a47021279b8b0d3625a6e84142f5352d381… |
7913b04c-5457-4d18-828d-46e6395428ab | userId | 7913b04c-5457-4d18-828d-46e6395428ab |
7913b04c-5457-4d18-828d-46e6395428ab | hashedEmail | 9cf17fbe88caad4715c1d7f2cc44901d28eb15bfa561… |
7913b04c-5457-4d18-828d-46e6395428ab | rampid | RH-abc123… |
Every user must have a row where idName is userId, since that identifier is the join key with the other datasets.
Simplified — one identifier
A single-column file with one value per user. The platform uses this column as the userId and as the matching ID. For collaboration types that require an activation ID (Remarketing, Rule-based audiences, AI lookalike audiences), the same column also serves as the activation ID.
| userId |
|---|
3402d61a92a47021279b8b0d3625a6e84142f5352d381… |
9cf17fbe88caad4715c1d7f2cc44901d28eb15bfa561… |
Values must be non-null and unique.
Simplified — two identifiers (one is userId)
A two-column file. One column is the userId (the join key); the other is your second identifier (typically a matching ID such as hashed email).
| userId | hashedEmail |
|---|---|
ab0dc82c-f120-49d1-82d4-0ab994e8410c | 3402d61a92a47021279b8b0d3625a6e84142f5352d381… |
ab0dc82c-f120-49d1-82d4-0ab994e8410c | 9cf17fbe88caad4715c1d7f2cc44901d28eb15bfa561… |
7913b04c-5457-4d18-828d-46e6395428ab | abc123ef45… |
The userId column is non-unique (a user can have multiple rows). The other column is non-null and unique across the file.
Segments
This dataset contains the segments a user belongs to. Segments that appear here also appear in the Audience Insights dashboard, so labels must be human-readable.
Segments configuration
userId— String | non-null- Same ID as the
userIdidentifier in the Identifiers dataset.userIdis not unique here: a user can be in more than one segment.
- Same ID as the
segment— String | non-null- Categorical variable that's expected to be human-understandable and consistently spelled (e.g., "Tech Enthusiasts" and "Technology Enthusiast" are treated as two separate segments).
- A minimum of 10 unique segments is required to train the lookalike model.
- A maximum of 2000 segments are allowed to be ingested.
- The maximum size of the Segments dataset is 200 GB.
The combination of userId and segment, (userId, segment), should be unique — no duplicate rows.
Segments schema
| userId | segment |
|---|---|
ab0dc82c-f120-49d1-82d4-0ab994e8410c | News_Reader |
7913b04c-5457-4d18-828d-46e6395428ab | Sports_Fan |
… | … |
Demographics
A dataset used to express the common demographic attributes of age and gender. They aren't used for lookalike modeling.
Demographics configuration
userId— String | non-null, unique- Same as the
userIdidentifier; matches across all Datalab datasets.
- Same as the
age— String | nullable- Any format can be used to express a user's age bucket, but it's expected to be human-understandable.
- The most common approach is to use age buckets that already exist in the publisher taxonomy, such as "18-25" and "26-35".
- Spelling should be consistent — "18-25" and "18 - 25" are treated as separate age brackets.
- Decentriq recommends not using more than 10 age buckets to keep bucket sizes large enough.
- May be left null to express missing information.
gender— String | nullable- Any format can be used to express gender, but it's expected to be human-understandable and consistently spelled.
- As with other categorical values, "M" and "Male" are treated as separate genders.
- Decentriq recommends not using more than 5 genders to keep bucket sizes large enough.
- May be left null to express missing information.
Demographics schema
| userId | age | gender |
|---|---|---|
ab0dc82c-f120-49d1-82d4-0ab994e8410c | 25-34 | M |
7913b04c-5457-4d18-828d-46e6395428ab | 35-44 | F |
… | … | … |
Embeddings
This dataset contains user attributes used to train a lookalike audience with multidimensional float numbers. It's not human-understandable. When an Embeddings dataset is provided, the AI lookalike model uses these embeddings instead of the Segments dataset for training.
Embeddings configuration
userId— String | non-null, unique- Same as the
userIdidentifier; matches across all Datalab datasets. - Unique here: each row corresponds to a single user.
- Same as the
scope— String | non-null- The scope these embeddings have been trained across.
- Rows that have the same scope use embeddings that are in the same embedding vector space.
- Each scope is trained and scored separately.
- If only a single modeling scope is used, you can leave this field with a dummy value or empty.
- The common use for this field is when the data provider has data collected from different brands and the attributes don't have the same meaning. In that example, each brand would be its own scope.
- It can also be used if models were retrained and the vector space changed — the scope could then be the version number of the model.
embed_0toembed_n— Float vector | non-null- This is an indeterminate number of columns.
- Typically used to record an embeddings vector, with one column per dimension.
- Every user should have the same number of dimensions, and the dimensions should have the same meaning and be in the same space for every user.
- Nulls can be used and are treated as a value separate from zero — they aren't dropped.
The combination of userId and scope, (userId, scope), should also be unique.
Embeddings schema
| userId | scope | emb_1 | emb_2 | … |
|---|---|---|---|---|
ab0dc82c-f120-49d1-82d4-0ab994e8410c | nightlynews | 0.12 | 0.83 | … |
7913b04c-5457-4d18-828d-46e6395428ab | nightlynews | 0.45 | 0.66 | … |
… | … | … | … | … |