Skip to main content

Dataset types and formatting guide

Dataset types

A Datalab can contain up to four different datasets, each serving a specific purpose:

DatasetRequiredType descriptionKey columns
Matching✔ RequiredMaps your internal user ID (userId) to matching identifiers (matchingId) such as hashed email (see supported types) used to join datasets in the DCR.userId, matchingId
Segments✔ RequiredMaps your internal user ID (userId) to segment labels (segment) for insights and targeting. Each user can have multiple rows with different labels.userId, segment
DemographicsOptionalMaps your internal user ID (userId) to age brackets (age) and gender (gender) for richer analysis.userId, age, gender
EmbeddingsOptionalMaps your internal user ID (userId) to numeric embeddings (emb_x) for AI lookalike modeling, optionally grouped by scope (scope).userId, scope, emb_x
important

All datasets must be in CSV format with UTF-8 encoding, without column headers or index columns.

Dataset requirements by collaboration type

Different Media DCR collaboration types require different datasets:

Collaboration typeMatchingSegmentsDemographicsEmbeddings
Overlap analysismust havenot needednot needednot needed
Insightsmust havemust havenice to havenot needed
Remarketing audiencesmust havemust havenice to havenot needed
Rule-based audiencesmust havemust havenice to havenot needed
AI lookalike audiencesmust havemust havenot needednice to have

Detailed formatting guide and examples

Matching table

The table is used to map a matchingId to a userId.

Fields:

  • userIdString | non-null, non-unique

    • This is an ID assigned by the data owner. For publishers, it is assigned to each user visiting their properties. It is typically generated by a DMP and derived from a 1st party cookie ID.
    • It needs to match across all the Datalab tables (e.g., a given user should have the same userId in the Matching table as they do in the Segments table).
    • When entertaining only an overlap analysis, the userId can be the same as the matchingId.
  • matchingIdString / Email / HashedEmail / PhoneNr / HashedPhoneNr | non-null, unique

    • This is the ID that will be used for matching. The most common format is hashed email, which is a SHA-256 hash of an email address.
    • Decentriq supports any type of matching ID, including from ID providers such as OneID, Utiq, netID, First-id, RampID, even 3rd party cookie IDs.
    • Note that the matchingId must be unique: two different user IDs should not have the same matchingId.
    • For details on the differences between ID types and their validation requirements, see matching ID type.

Example:

userIdmatchingId (hashed e-mail)
ab0dc82c-f120-49d1-82d4-0ab994e8410c3402d61a92a47021279b8b0d3625a6e84142f5352d381…
7913b04c-5457-4d18-828d-46e6395428ab9cf17fbe88caad4715c1d7f2cc44901d28eb15bfa561…
........
Segments table

This table contains the segments in which a user belongs to. Segments that appear in this table will appear in the Audience Insights dashboard so it's important that the labels are human-readable.

Fields:

  • userIdString | non-null

    • Same ID as in the Matching table. Note that userId is not unique: a user may be in more than one segment (often the case in practice).
  • segmentString | non-null

    • Categorical variable that is expected to be human-understandable and consistently spelled (e.g., "Tech Enthusiasts" and "Technology Enthusiast" will be considered two separate segments).
    • A minimum of 10 unique segments is required to train the lookalike model.
    • A maximum of 2000 segments are allowed to be ingested.

Note that the combination of userId and segment, (userId, segment), should be unique. Another way of saying this is: there should be no duplicate rows in this table.

Example:

userIdsegment
ab0dc82c-f120-49d1-82d4-0ab994e8410cNews_Reader
7913b04c-5457-4d18-828d-46e6395428abSports_Fan
........
Demographics table

A table used to express the common demographic attributes of age and gender. They are not used for lookalike modeling.

Fields:

  • userIdString | non-null, unique

    • Same as before, it needs to match across all Datalab input tables (i.e., with Matching.userId, Segments.userId).
  • ageString | nullable

    • Any format can be used to express a user's age bucket, but it is expected to be human-understandable.
    • The most common approach is to use age buckets that already exist in the publisher taxonomy, such as "18-25" and "26-35".
    • Note that spelling should be consistent, "18-25" and "18 - 25" will be treated as separate age brackets.
    • Decentriq recommends not using more than 10 age buckets to keep bucket sizes large enough.
    • Note that this may be left null to express missing information.
  • genderString | nullable

    • Any format can be used to express gender, but it is expected to be human-understandable and consistently spelled.
    • As with other categorical values, "M" and "Male" will be treated as separate genders.
    • Decentriq recommends not using more than 5 genders to keep bucket sizes large enough.
    • Note that this may be left null to express missing information.

Example:

userIdagegender
ab0dc82c-f120-49d1-82d4-0ab994e8410c25-34M
7913b04c-5457-4d18-828d-46e6395428ab35-44F
............
Embeddings table

This table contains user attributes that are used to train a lookalike audience with multidimensional float numbers. It is not human-understandable.

Fields:

  • userIdString | non-null, unique

    • Same as before, it needs to match across all Datalab input tables (i.e., with Matching.userId, Segments.userId, Demographics.userId).
    • It is unique: each row corresponds to a single user.
  • scopeString | non-null

    • The scope that these embeddings have been trained across.
    • Rows that have the same scope use embeddings that are in the same embedding vector space.
    • Each scope is trained and scored separately.
    • Please note: if only a single modeling scope is used, the data provider may use a dummy value or leave this field empty.
    • The common use for this field is if the data provider has data that was collected from different brands and the attributes don't have the same meaning. In that example, each brand would probably be its own scope.
    • It can also be used if models were retrained and the vector space changed — in this case the scope could be the version number of the model.
  • embed_0 to embed_nFloat vector | non-null

    • This is an indeterminate number of columns.
    • Typically this is used to record an embeddings vector, with one column per dimension of the embedding vector.
    • Every user should have the same number of dimensions, and the dimensions should have the same meaning and be in the same space for every user.
    • Nulls can be used, and are treated as a value separate from zero and are not dropped.

Note that the combination of userId and scope, (userId, scope), should also be unique.

Example:

userIdscopeemb_1emb_2...
ab0dc82c-f120-49d1-82d4-0ab994e8410cnightlynews0.120.83...
7913b04c-5457-4d18-828d-46e6395428abnightlynews0.450.66...
...............