Identifying users (identity stitching)

yali · April 3, 2016, 2:47pm

Identifying users in your Snowplow data

Any attempt at understanding your users has to start with identifying which events, in your Snowplow atomic data set, describe actions that were carried about by particular users. We call this process ‘identity stitching’: it is the process of stitching together the myriad individual events that describe individual user journeys.

Stages in identity stitching

With each event, track as many different user identifiers as possible
Use events where multiple identifiers are present to build a mapping table (graph) of user identifiers for each of your users
Apply that mapping table / graph to your atomic event-level data to identify which event belongs to each of your users

1. Track as many different user identifiers as possible with each event

With each event tracked in Snowplow, we want to capture as many different user-level identifiers as possible. Note that at data capture time we are not interested in definitively deciding which user performed this action. We simply want to capture all the evidence (data points) so that we can make a decision later on in the data modeling process.

Snowplow trackers are build to automatically capture as many of these identifiers as possible automatically:

Collector provided fields

events.user_ipaddress - the IP address that the event occurred on
events.network_userid - third party cookie ID (set by the Clojure Collector and Scala Stream)

All trackers

events.user_id - a user-level identifier that you can set

Javascript tracker

events.domain_userid - first party cookie ID
events.domain_sessionid - third party session cookie
events.user_fingerprint - browser fingerprint

Mobile trackers (Objective-C and Android)

com_snowplowanalytics_snowplow_mobile_context_1.open_idfa - open IDFA (user identifier for advertisers)
com_snowplowanalytics_snowplow_mobile_context_1.apple_idfa - Apple user identifier for advertisers (Apple-only)
com_snowplowanalytics_snowplow_mobile_context_1.apple_idfv - Apple user identifier for vendors i.e. application owners
com_snowplowanalytics_snowplow_mobile_context_1.android_idfa - Android user identifier for advertisers
com_snowplowanalytics_snowplow_client_session_1.user_id - user ID generated client-side by Snowplow Objective-C and Android trackers
com_snowplowanalytics_snowplow_client_session_1.session_id - session ID generated client-side by Snowplow Objective-C and Android trackers

Adding your own identifiers

You can define and pass in as many of your own user-level identifiers, by identifying your own user context e.g.

{
	"$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
	"description": "Schema for an ad click event",
	"self": {
		"vendor": "com.mycompany",
		"name": "user_context",
		"format": "jsonschema",
		"version": "1-0-0"
	},

	"type": "object",
	"properties": {
		"id": {
			"type": "string"
		},
		"email": {
			"type": "string"
		},
		"twitterHandle": {
			"type": "string"
		},
		"facebookId": {
			"type": "string"
		}
	},
	"additionalProperties": false
}

It is then possible to send this context i.e. any of these identifiers with any event recorded into Snowplow.

2. Build your user identifier mapping table (graph)

Now we have an atomic data set with a range of different events, often recorded against different platforms, each with a different set of one or more user identifiers.

To take a very typical example, we might have a webapp where users browse marketing material over several sessions, after which a fraction sign up to the service (creating a login ID), after which events are recorded against the login ID.

In the above example, we’d want to be able to:

Correctly aggregate events from before the user signed up, with events after
Correctly aggregate events recorded on different devices that the user accesses the service on

In this case, we’d start by identifying all the events where the user was logged in. For these events, we should have both a cookie ID (events.domain_userid) and a login ID events.user_id:

create table derived.user_mapping as (
	select
	domain_userid,
	user_id
	from atomic.events
	where domain_userid is not null
	and user_id is not null
	group by 1,2
);

The above table maps cookie IDs to user IDS. Note that if a user logs in on multiple devices, each with its own cookie ID, all those cookie IDs will be correctly mapped to the same user.

3. Apply that mapping table (graph) to your `atomic` event-level data to identify which event belongs to each of your users

We can, as part of the data modeling, use the above table to assign a user ID to a particular event ID, given the cookie ID. So if we run a query like the following:

select
em.user_id,
...
from atomic.events e
left join derived.user_mapping um
on e.domain_userid = um.domain_userid

The value of user_id will be set for every event whether or not our user happened to be logged in when the event occurred. Hence, events tracked when the user was browsing anonymously (prior to signing up with the service) will be correctly identified as belong to that user, with those events that were tracked subsequently.

The above example is pretty simple. You can build more complicated mapping tables, and mapping logic, for example, by:

Adding in additional user identifiers e.g. for events recorded in mobile apps and other platforms
Using more complicated business logic when applying the mapping table to the event level data set (e.g. controlling for particular devices which multiple different users share), or using a probabilistic rather than rule-based / deterministic approach

zivbaram · December 23, 2016, 2:40pm

Hi @yali, I’m looking to extend these with additional 3rd party identifiers for example Google GCLID & CID and similar parameters passed by traffic sources or accessible from their cookies.
Is there a place where you have already documented such functions and parameters?
How can I share what we are doing in this area?

yali · December 24, 2016, 8:50pm

Hi @zivbaram that’s great to hear! Please feel free to share what you’re doing either on this thread if it follows directly or by starting a new thread.

Looking forward to hearing more!

Yali

petervcook · July 22, 2019, 7:19pm

Regarding the sample script in bullet point 3 of this article, should the left join line end in em instead of um?

sonnypolaris · August 10, 2019, 3:43pm

Do you imagine mapping domain_userid to many different types of id? We may have multiple source system that assign a value to the atomic.events.user_id field but that then needs to be mapped into the true identity based on some additional rules. I’m thinking of the following:

create a table that maps domain_userid to a generic id & id_type like below

create table user_mapping 
(domain_userid string,
 user_id string,
 user_id_type);

load that table with multiple passes (see examples)

-- pass 1 - assume the application conforms to a standard 
-- of the (i.e. email hash), so I now how how to match / stitch it
insert into user_mapping (domain_userid, user_id, user_id_type)
  select e.domain_userid, e.user_id, "STANDARD_ID"
    from  atomic.events e
    where e.domain_userid is not null
       and e._user_id is not null
    group by 1, 2;

-- pass 2 : event.user_id may or may not be set  
-- so look at what the site passed
-- in the structured events (SE... columns)
-- Line of Business (LOB-001) use a non-standard id
insert into user_mapping (domain_userid, user_id, user_id_type)
  select domain_userid, se_value::string, "LOB-001"
    from atomic.events e
    where 
      se_category = 'QUOTE_REQUEST'
      and se_value is not null
    group by 1, 2;

-- pass 3 : get an 3rd party id from query string param
insert into user_mapping (domain_userid, user_id, user_id_type)
  select domain_userid, 
    e.page_urlquery, -- to be fair this need to be parsed 
    "LOB-002"
  where e.refr_urlhost = 'third_party_partner.com'
  group by 1, 2;

Finally, write queries to link / match the visitor to my MDM (not a true mdm) according to the rules for specific type "STANDARD_ID", "LOB_001", OR "LOB_002"

STANDARD_ID may be a direct look up based on PK
LOB_001 may be the internal LOB ID and need to go through a couple of joins to identify the user (think Salesforce or a POS)
LOB_002 may be a 3rd party partner that passes a different type of key.

I know that is a lot but I also know this community is doing some great things with Snowplow analytics so any thoughts or ideas will be appreciated.

Sonny

rosner · July 29, 2021, 12:30pm

It might be worth updating this post to point to the latest resources on this topic for anybody who stumbles upon this in the future.

ihor · July 29, 2021, 5:42pm

The “fresher” posts on the subject I located:

Topic		Replies	Views
Tracking anonymous users For engineers	3	2517	July 22, 2019
Unified package - user identifier stitching For data modelers & consumers	5	141	July 11, 2024
Identifying apps users For data modelers & consumers	1	1326	August 9, 2017
User_id in atomic events table vs. userId in client session context For engineers	1	978	September 5, 2022
Cross domain tracking with Snowplow Mini? For data modelers & consumers	2	2684	September 7, 2016