Identifying users in your Snowplow data
Any attempt at understanding your users has to start with identifying which events, in your Snowplow atomic
data set, describe actions that were carried about by particular users. We call this process ‘identity stitching’: it is the process of stitching together the myriad individual events that describe individual user journeys.
Stages in identity stitching
- With each event, track as many different user identifiers as possible
- Use events where multiple identifiers are present to build a mapping table (graph) of user identifiers for each of your users
- Apply that mapping table / graph to your
atomic
event-level data to identify which event belongs to each of your users
1. Track as many different user identifiers as possible with each event
With each event tracked in Snowplow, we want to capture as many different user-level identifiers as possible. Note that at data capture time we are not interested in definitively deciding which user performed this action. We simply want to capture all the evidence (data points) so that we can make a decision later on in the data modeling process.
Snowplow trackers are build to automatically capture as many of these identifiers as possible automatically:
Collector provided fields
-
events.user_ipaddress
- the IP address that the event occurred on -
events.network_userid
- third party cookie ID (set by the Clojure Collector and Scala Stream)
All trackers
-
events.user_id
- a user-level identifier that you can set
Javascript tracker
-
events.domain_userid
- first party cookie ID -
events.domain_sessionid
- third party session cookie -
events.user_fingerprint
- browser fingerprint
Mobile trackers (Objective-C and Android)
-
com_snowplowanalytics_snowplow_mobile_context_1.open_idfa
- open IDFA (user identifier for advertisers) -
com_snowplowanalytics_snowplow_mobile_context_1.apple_idfa
- Apple user identifier for advertisers (Apple-only) -
com_snowplowanalytics_snowplow_mobile_context_1.apple_idfv
- Apple user identifier for vendors i.e. application owners -
com_snowplowanalytics_snowplow_mobile_context_1.android_idfa
- Android user identifier for advertisers -
com_snowplowanalytics_snowplow_client_session_1.user_id
- user ID generated client-side by Snowplow Objective-C and Android trackers -
com_snowplowanalytics_snowplow_client_session_1.session_id
- session ID generated client-side by Snowplow Objective-C and Android trackers
Adding your own identifiers
You can define and pass in as many of your own user-level identifiers, by identifying your own user context e.g.
{
"$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
"description": "Schema for an ad click event",
"self": {
"vendor": "com.mycompany",
"name": "user_context",
"format": "jsonschema",
"version": "1-0-0"
},
"type": "object",
"properties": {
"id": {
"type": "string"
},
"email": {
"type": "string"
},
"twitterHandle": {
"type": "string"
},
"facebookId": {
"type": "string"
}
},
"additionalProperties": false
}
It is then possible to send this context i.e. any of these identifiers with any event recorded into Snowplow.
2. Build your user identifier mapping table (graph)
Now we have an atomic
data set with a range of different events, often recorded against different platforms, each with a different set of one or more user identifiers.
To take a very typical example, we might have a webapp where users browse marketing material over several sessions, after which a fraction sign up to the service (creating a login ID), after which events are recorded against the login ID.
In the above example, we’d want to be able to:
- Correctly aggregate events from before the user signed up, with events after
- Correctly aggregate events recorded on different devices that the user accesses the service on
In this case, we’d start by identifying all the events where the user was logged in. For these events, we should have both a cookie ID (events.domain_userid
) and a login ID events.user_id
:
create table derived.user_mapping as (
select
domain_userid,
user_id
from atomic.events
where domain_userid is not null
and user_id is not null
group by 1,2
);
The above table maps cookie IDs to user IDS. Note that if a user logs in on multiple devices, each with its own cookie ID, all those cookie IDs will be correctly mapped to the same user.
3. Apply that mapping table (graph) to your atomic
event-level data to identify which event belongs to each of your users
We can, as part of the data modeling, use the above table to assign a user ID to a particular event ID, given the cookie ID. So if we run a query like the following:
select
em.user_id,
...
from atomic.events e
left join derived.user_mapping um
on e.domain_userid = um.domain_userid
The value of user_id will be set for every event whether or not our user happened to be logged in when the event occurred. Hence, events tracked when the user was browsing anonymously (prior to signing up with the service) will be correctly identified as belong to that user, with those events that were tracked subsequently.
The above example is pretty simple. You can build more complicated mapping tables, and mapping logic, for example, by:
- Adding in additional user identifiers e.g. for events recorded in mobile apps and other platforms
- Using more complicated business logic when applying the mapping table to the event level data set (e.g. controlling for particular devices which multiple different users share), or using a probabilistic rather than rule-based / deterministic approach