Dataset types and formatting guide

Dataset types

A Datalab can contain up to four different datasets, each serving a specific purpose:

Dataset	Required	Type description	Key columns
Matching	✔ Required	Maps your internal user ID (userId) to matching identifiers (matchingId) such as hashed email (see supported types) used to join datasets in the DCR.	userId, matchingId
Segments	✔ Required	Maps your internal user ID (userId) to segment labels (segment) for insights and targeting. Each user can have multiple rows with different labels.	userId, segment
Demographics	Optional	Maps your internal user ID (userId) to age brackets (age) and gender (gender) for richer analysis.	userId, age, gender
Embeddings	Optional	Maps your internal user ID (userId) to numeric embeddings (emb_x) for AI lookalike modeling, optionally grouped by scope (scope).	userId, scope, emb_x

important

All datasets must be in CSV format with UTF-8 encoding, without column headers or index columns.

Dataset requirements by collaboration type

Different Media DCR collaboration types require different datasets:

Collaboration type	Matching	Segments	Demographics	Embeddings
Overlap analysis	must have	not needed	not needed	not needed
Insights	must have	must have	nice to have	not needed
Remarketing audiences	must have	not needed	not needed	not needed
Rule-based audiences	must have	must have	nice to have	not needed
AI lookalike audiences	must have	must have	not needed	nice to have

Detailed formatting guide and examples

Matching table

The table is used to map a matchingId to a userId.

Fields:

userId — String | non-null, non-unique
- This is an ID assigned by the data owner. For publishers, it is assigned to each user visiting their properties. It is typically generated by a DMP and derived from a 1st party cookie ID.
- It needs to match across all the Datalab tables (e.g., a given user should have the same userId in the Matching table as they do in the Segments table).
- When entertaining only an overlap analysis, the userId can be the same as the matchingId.
matchingId — String / Email / HashedEmail / PhoneNr / HashedPhoneNr | non-null, unique
- This is the ID that will be used for matching. The most common format is hashed email, which is a SHA-256 hash of an email address.
- Decentriq supports any type of matching ID, including from ID providers such as OneID, Utiq, netID, First-id, RampID, even 3rd party cookie IDs.
- Note that the matchingId must be unique: two different user IDs should not have the same matchingId.
- For details on the differences between ID types and their validation requirements, see matching ID type.

Example:

userId	matchingId (hashed e-mail)
ab0dc82c-f120-49d1-82d4-0ab994e8410c	3402d61a92a47021279b8b0d3625a6e84142f5352d381…
7913b04c-5457-4d18-828d-46e6395428ab	9cf17fbe88caad4715c1d7f2cc44901d28eb15bfa561…
....	....

Segments table

This table contains the segments in which a user belongs to. Segments that appear in this table will appear in the Audience Insights dashboard so it's important that the labels are human-readable.

Fields:

userId — String | non-null
- Same ID as in the Matching table. Note that userId is not unique: a user may be in more than one segment (often the case in practice).
segment — String | non-null
- Categorical variable that is expected to be human-understandable and consistently spelled (e.g., "Tech Enthusiasts" and "Technology Enthusiast" will be considered two separate segments).
- A minimum of 10 unique segments is required to train the lookalike model.
- A maximum of 2000 segments are allowed to be ingested.

Note that the combination of userId and segment, (userId, segment), should be unique. Another way of saying this is: there should be no duplicate rows in this table.

Example:

userId	segment
ab0dc82c-f120-49d1-82d4-0ab994e8410c	News_Reader
7913b04c-5457-4d18-828d-46e6395428ab	Sports_Fan
....	....

Demographics table

A table used to express the common demographic attributes of age and gender. They are not used for lookalike modeling.

Fields:

userId — String | non-null, unique
- Same as before, it needs to match across all Datalab input tables (i.e., with Matching.userId, Segments.userId).
age — String | nullable
- Any format can be used to express a user's age bucket, but it is expected to be human-understandable.
- The most common approach is to use age buckets that already exist in the publisher taxonomy, such as "18-25" and "26-35".
- Note that spelling should be consistent, "18-25" and "18 - 25" will be treated as separate age brackets.
- Decentriq recommends not using more than 10 age buckets to keep bucket sizes large enough.
- Note that this may be left null to express missing information.
gender — String | nullable
- Any format can be used to express gender, but it is expected to be human-understandable and consistently spelled.
- As with other categorical values, "M" and "Male" will be treated as separate genders.
- Decentriq recommends not using more than 5 genders to keep bucket sizes large enough.
- Note that this may be left null to express missing information.

Example:

userId	age	gender
ab0dc82c-f120-49d1-82d4-0ab994e8410c	25-34	M
7913b04c-5457-4d18-828d-46e6395428ab	35-44	F
....	....	....

Embeddings table

This table contains user attributes that are used to train a lookalike audience with multidimensional float numbers. It is not human-understandable.

Fields:

userId — String | non-null, unique
- Same as before, it needs to match across all Datalab input tables (i.e., with Matching.userId, Segments.userId, Demographics.userId).
- It is unique: each row corresponds to a single user.
scope — String | non-null
- The scope that these embeddings have been trained across.
- Rows that have the same scope use embeddings that are in the same embedding vector space.
- Each scope is trained and scored separately.
- Please note: if only a single modeling scope is used, the data provider may use a dummy value or leave this field empty.
- The common use for this field is if the data provider has data that was collected from different brands and the attributes don't have the same meaning. In that example, each brand would probably be its own scope.
- It can also be used if models were retrained and the vector space changed — in this case the scope could be the version number of the model.
embed_0 to embed_n — Float vector | non-null
- This is an indeterminate number of columns.
- Typically this is used to record an embeddings vector, with one column per dimension of the embedding vector.
- Every user should have the same number of dimensions, and the dimensions should have the same meaning and be in the same space for every user.
- Nulls can be used, and are treated as a value separate from zero and are not dropped.

Note that the combination of userId and scope, (userId, scope), should also be unique.

Example:

userId	scope	emb_1	emb_2	...
ab0dc82c-f120-49d1-82d4-0ab994e8410c	nightlynews	0.12	0.83	...
7913b04c-5457-4d18-828d-46e6395428ab	nightlynews	0.45	0.66	...
...	...	...	...	...

Dataset types and formatting guide

Dataset types​

Dataset requirements by collaboration type​

Detailed formatting guide and examples​

Dataset types

Dataset requirements by collaboration type

Detailed formatting guide and examples