Dataset types and formatting guide
Dataset types
A Datalab can contain up to four different datasets, each serving a specific purpose:
| Dataset | Required | Type description | Key columns |
|---|---|---|---|
| Matching | ✔ Required | Maps your internal user ID (userId) to matching identifiers (matchingId) such as hashed email (see supported types) used to join datasets in the DCR. | userId, matchingId |
| Segments | ✔ Required | Maps your internal user ID (userId) to segment labels (segment) for insights and targeting. Each user can have multiple rows with different labels. | userId, segment |
| Demographics | Optional | Maps your internal user ID (userId) to age brackets (age) and gender (gender) for richer analysis. | userId, age, gender |
| Embeddings | Optional | Maps your internal user ID (userId) to numeric embeddings (emb_x) for AI lookalike modeling, optionally grouped by scope (scope). | userId, scope, emb_x |
All datasets must be in CSV format with UTF-8 encoding, without column headers or index columns.
Dataset requirements by collaboration type
Different Media DCR collaboration types require different datasets:
| Collaboration type | Matching | Segments | Demographics | Embeddings |
|---|---|---|---|---|
| Overlap analysis | must have | not needed | not needed | not needed |
| Insights | must have | must have | nice to have | not needed |
| Remarketing audiences | must have | must have | nice to have | not needed |
| Rule-based audiences | must have | must have | nice to have | not needed |
| AI lookalike audiences | must have | must have | not needed | nice to have |
Detailed formatting guide and examples
Matching table
The table is used to map a matchingId to a userId.
Fields:
userId — String | non-null, non-unique
- This is an ID assigned by the data owner. For publishers, it is assigned to each user visiting their properties. It is typically generated by a DMP and derived from a 1st party cookie ID.
- It needs to match across all the Datalab tables (e.g., a given user should have the same
userIdin the Matching table as they do in the Segments table). - When entertaining only an overlap analysis, the
userIdcan be the same as thematchingId.
matchingId — String / Email / HashedEmail / PhoneNr / HashedPhoneNr | non-null, unique
- This is the ID that will be used for matching. The most common format is hashed email, which is a SHA-256 hash of an email address.
- Decentriq supports any type of matching ID, including from ID providers such as OneID, Utiq, netID, First-id, RampID, even 3rd party cookie IDs.
- Note that the
matchingIdmust be unique: two different user IDs should not have the samematchingId. - For details on the differences between ID types and their validation requirements, see matching ID type.
Example:
| userId | matchingId (hashed e-mail) |
|---|---|
| ab0dc82c-f120-49d1-82d4-0ab994e8410c | 3402d61a92a47021279b8b0d3625a6e84142f5352d381… |
| 7913b04c-5457-4d18-828d-46e6395428ab | 9cf17fbe88caad4715c1d7f2cc44901d28eb15bfa561… |
| .... | .... |
Segments table
This table contains the segments in which a user belongs to. Segments that appear in this table will appear in the Audience Insights dashboard so it's important that the labels are human-readable.
Fields:
userId — String | non-null
- Same ID as in the Matching table. Note that
userIdis not unique: a user may be in more than one segment (often the case in practice).
- Same ID as in the Matching table. Note that
segment — String | non-null
- Categorical variable that is expected to be human-understandable and consistently spelled (e.g., "Tech Enthusiasts" and "Technology Enthusiast" will be considered two separate segments).
- A minimum of 10 unique segments is required to train the lookalike model.
- A maximum of 2000 segments are allowed to be ingested.
Note that the combination of userId and segment, (userId, segment), should be unique. Another way of saying this is: there should be no duplicate rows in this table.
Example:
| userId | segment |
|---|---|
| ab0dc82c-f120-49d1-82d4-0ab994e8410c | News_Reader |
| 7913b04c-5457-4d18-828d-46e6395428ab | Sports_Fan |
| .... | .... |
Demographics table
A table used to express the common demographic attributes of age and gender. They are not used for lookalike modeling.
Fields:
userId — String | non-null, unique
- Same as before, it needs to match across all Datalab input tables (i.e., with
Matching.userId,Segments.userId).
- Same as before, it needs to match across all Datalab input tables (i.e., with
age — String | nullable
- Any format can be used to express a user's age bucket, but it is expected to be human-understandable.
- The most common approach is to use age buckets that already exist in the publisher taxonomy, such as "18-25" and "26-35".
- Note that spelling should be consistent, "18-25" and "18 - 25" will be treated as separate age brackets.
- Decentriq recommends not using more than 10 age buckets to keep bucket sizes large enough.
- Note that this may be left null to express missing information.
gender — String | nullable
- Any format can be used to express gender, but it is expected to be human-understandable and consistently spelled.
- As with other categorical values, "M" and "Male" will be treated as separate genders.
- Decentriq recommends not using more than 5 genders to keep bucket sizes large enough.
- Note that this may be left null to express missing information.
Example:
| userId | age | gender |
|---|---|---|
| ab0dc82c-f120-49d1-82d4-0ab994e8410c | 25-34 | M |
| 7913b04c-5457-4d18-828d-46e6395428ab | 35-44 | F |
| .... | .... | .... |
Embeddings table
This table contains user attributes that are used to train a lookalike audience with multidimensional float numbers. It is not human-understandable.
Fields:
userId — String | non-null, unique
- Same as before, it needs to match across all Datalab input tables (i.e., with
Matching.userId,Segments.userId,Demographics.userId). - It is unique: each row corresponds to a single user.
- Same as before, it needs to match across all Datalab input tables (i.e., with
scope — String | non-null
- The scope that these embeddings have been trained across.
- Rows that have the same scope use embeddings that are in the same embedding vector space.
- Each scope is trained and scored separately.
- Please note: if only a single modeling scope is used, the data provider may use a dummy value or leave this field empty.
- The common use for this field is if the data provider has data that was collected from different brands and the attributes don't have the same meaning. In that example, each brand would probably be its own scope.
- It can also be used if models were retrained and the vector space changed — in this case the scope could be the version number of the model.
embed_0 to embed_n — Float vector | non-null
- This is an indeterminate number of columns.
- Typically this is used to record an embeddings vector, with one column per dimension of the embedding vector.
- Every user should have the same number of dimensions, and the dimensions should have the same meaning and be in the same space for every user.
- Nulls can be used, and are treated as a value separate from zero and are not dropped.
Note that the combination of userId and scope, (userId, scope), should also be unique.
Example:
| userId | scope | emb_1 | emb_2 | ... |
|---|---|---|---|---|
| ab0dc82c-f120-49d1-82d4-0ab994e8410c | nightlynews | 0.12 | 0.83 | ... |
| 7913b04c-5457-4d18-828d-46e6395428ab | nightlynews | 0.45 | 0.66 | ... |
| ... | ... | ... | ... | ... |