Skip to main content

Required datasets

Dataset types

A Datalab can contain up to four datasets, each serving a specific purpose:

DatasetRequiredType descriptionKey columns
Identifiers✔ RequiredLists every identifier attached to each user (matching IDs and activation IDs).Depends on the chosen schema
SegmentsConditionalMaps your internal user ID (userId) to segment labels (segment) for insights and targeting. Each user can have multiple rows with different labels.userId, segment
DemographicsOptionalMaps your internal user ID (userId) to age brackets (age) and gender (gender) for richer analysis.userId, age, gender
EmbeddingsConditionalMaps your internal user ID (userId) to numeric embeddings (emb_x) for AI lookalike modeling, optionally grouped by scope (scope).userId, scope, emb_x
note

All datasets must be in CSV format with UTF-8 encoding, without column headers or index columns.

Dataset requirements by collaboration type

Different Media DCR collaboration types require different datasets. The Identifiers dataset is always required; the table below shows which identifier roles and which optional datasets each collaboration type needs.

Collaboration typeMatching IDActivation IDSegmentsDemographicsEmbeddings
Overlap analysismust havenot needednot needednot needednot needed
Insightsmust havenot neededmust havenice to havenot needed
Remarketing audiencesmust havemust havenot needednot needednot needed
Rule-based audiencesmust havemust havemust havenice to havenot needed
AI lookalike audiencesmust havemust havemust have one of ★nice to havemust have one of ★
note

★ For AI lookalike audiences, you must provide at least one of the Segments dataset or the Embeddings dataset. When an Embeddings dataset is provided, the lookalike model uses embeddings for training instead of segments.

Identifiers

The Identifiers dataset is the heart of the Datalab. It lists every identifier attached to each user — for example, a hashed email, a phone number, or a publisher-internal user ID — and records whether each identifier is used for matching, activation, or both.

Identifiers configuration

When you create the Datalab, you configure each identifier with:

  • Identifier name — your label for the identifier. Must be unique within the Datalab. Exactly one identifier must be named userId; this identifier is the join key with the Segments, Demographics, and Embeddings datasets.
  • Identifier type — the format of the identifier (see supported types).
  • Matching ID — whether this identifier is used to match against the seed audience in a Media DCR.
  • Activation ID — whether this identifier can be used as the export ID in audiences (remarketing, rule-based, AI lookalike).

The same identifier can serve any combination of roles. For example, with a single internal ID you can mark it as userId, matching ID, and activation ID all at once.

Constraints:

  • At least one identifier must be a matching ID.
  • Exactly one identifier must be named userId.
  • For collaboration types that activate users (Remarketing, Rule-based audiences, AI lookalike audiences), at least one identifier must be an activation ID.
  • Only one identifier of the same identifier type can be set as a matching ID. Each Media DCR matches on one ID type, so the choice has to be unambiguous.

Identifiers schema

You choose the schema for the Identifiers dataset during Datalab creation:

  • Base schema is always available, supports any number of identifiers per user, and is selected automatically if you bring 3 or more identifiers.
  • Simplified schemas are available when you bring 1 or 2 identifiers, to keep simple cases easy.

Base schema

Three columns: userId, idName, id. The userId column carries the type you configured for the userId identifier; idName and id are strings. All values are non-null.

  • A user can have multiple rows with the same idName (for example, two emails for one user).
  • The id column is unique across the whole table — different users can't share an id, and the same id can't appear under two different idName values.
userIdidNameid
ab0dc82c-f120-49d1-82d4-0ab994e8410cuserIdab0dc82c-f120-49d1-82d4-0ab994e8410c
ab0dc82c-f120-49d1-82d4-0ab994e8410chashedEmail3402d61a92a47021279b8b0d3625a6e84142f5352d381…
7913b04c-5457-4d18-828d-46e6395428abuserId7913b04c-5457-4d18-828d-46e6395428ab
7913b04c-5457-4d18-828d-46e6395428abhashedEmail9cf17fbe88caad4715c1d7f2cc44901d28eb15bfa561…
7913b04c-5457-4d18-828d-46e6395428abrampidRH-abc123…

Every user must have a row where idName is userId, since that identifier is the join key with the other datasets.

Simplified — one identifier

A single-column file with one value per user. The platform uses this column as the userId and as the matching ID. For collaboration types that require an activation ID (Remarketing, Rule-based audiences, AI lookalike audiences), the same column also serves as the activation ID.

userId
3402d61a92a47021279b8b0d3625a6e84142f5352d381…
9cf17fbe88caad4715c1d7f2cc44901d28eb15bfa561…

Values must be non-null and unique.

Simplified — two identifiers (one is userId)

A two-column file. One column is the userId (the join key); the other is your second identifier (typically a matching ID such as hashed email).

userIdhashedEmail
ab0dc82c-f120-49d1-82d4-0ab994e8410c3402d61a92a47021279b8b0d3625a6e84142f5352d381…
ab0dc82c-f120-49d1-82d4-0ab994e8410c9cf17fbe88caad4715c1d7f2cc44901d28eb15bfa561…
7913b04c-5457-4d18-828d-46e6395428ababc123ef45…

The userId column is non-unique (a user can have multiple rows). The other column is non-null and unique across the file.

Segments

This dataset contains the segments a user belongs to. Segments that appear here also appear in the Audience Insights dashboard, so labels must be human-readable.

Segments configuration

  • userIdString | non-null
    • Same ID as the userId identifier in the Identifiers dataset. userId is not unique here: a user can be in more than one segment.
  • segmentString | non-null
    • Categorical variable that's expected to be human-understandable and consistently spelled (e.g., "Tech Enthusiasts" and "Technology Enthusiast" are treated as two separate segments).
    • A minimum of 10 unique segments is required to train the lookalike model.
    • A maximum of 2000 segments are allowed to be ingested.
    • The maximum size of the Segments dataset is 200 GB.

The combination of userId and segment, (userId, segment), should be unique — no duplicate rows.

Segments schema

userIdsegment
ab0dc82c-f120-49d1-82d4-0ab994e8410cNews_Reader
7913b04c-5457-4d18-828d-46e6395428abSports_Fan

Demographics

A dataset used to express the common demographic attributes of age and gender. They aren't used for lookalike modeling.

Demographics configuration

  • userIdString | non-null, unique
    • Same as the userId identifier; matches across all Datalab datasets.
  • ageString | nullable
    • Any format can be used to express a user's age bucket, but it's expected to be human-understandable.
    • The most common approach is to use age buckets that already exist in the publisher taxonomy, such as "18-25" and "26-35".
    • Spelling should be consistent — "18-25" and "18 - 25" are treated as separate age brackets.
    • Decentriq recommends not using more than 10 age buckets to keep bucket sizes large enough.
    • May be left null to express missing information.
  • genderString | nullable
    • Any format can be used to express gender, but it's expected to be human-understandable and consistently spelled.
    • As with other categorical values, "M" and "Male" are treated as separate genders.
    • Decentriq recommends not using more than 5 genders to keep bucket sizes large enough.
    • May be left null to express missing information.

Demographics schema

userIdagegender
ab0dc82c-f120-49d1-82d4-0ab994e8410c25-34M
7913b04c-5457-4d18-828d-46e6395428ab35-44F

Embeddings

This dataset contains user attributes used to train a lookalike audience with multidimensional float numbers. It's not human-understandable. When an Embeddings dataset is provided, the AI lookalike model uses these embeddings instead of the Segments dataset for training.

Embeddings configuration

  • userIdString | non-null, unique
    • Same as the userId identifier; matches across all Datalab datasets.
    • Unique here: each row corresponds to a single user.
  • scopeString | non-null
    • The scope these embeddings have been trained across.
    • Rows that have the same scope use embeddings that are in the same embedding vector space.
    • Each scope is trained and scored separately.
    • If only a single modeling scope is used, you can leave this field with a dummy value or empty.
    • The common use for this field is when the data provider has data collected from different brands and the attributes don't have the same meaning. In that example, each brand would be its own scope.
    • It can also be used if models were retrained and the vector space changed — the scope could then be the version number of the model.
  • embed_0 to embed_nFloat vector | non-null
    • This is an indeterminate number of columns.
    • Typically used to record an embeddings vector, with one column per dimension.
    • Every user should have the same number of dimensions, and the dimensions should have the same meaning and be in the same space for every user.
    • Nulls can be used and are treated as a value separate from zero — they aren't dropped.

The combination of userId and scope, (userId, scope), should also be unique.

Embeddings schema

userIdscopeemb_1emb_2
ab0dc82c-f120-49d1-82d4-0ab994e8410cnightlynews0.120.83
7913b04c-5457-4d18-828d-46e6395428abnightlynews0.450.66