As part of our drive to make the Snowplow community more collaborative and widen our network of open source contributors, we will be regularly posting our proposals for new features, apps and libraries under the Request for Comments section of our forum. You are welcome to post your own proposals too!
This Request for Comments is to allow data sent using the Google Analytics JavaScript tag to be successfully processed via the Snowplow pipeline and available for analysis as Snowplow events and contexts.
This would enable any Google Analytics and/or Measurement Protocol user to send exactly the same HTTP(S) requests to Snowplow as to Google, so that they can:
- Capture and analyse their event-level data in their own data warehouse
- Process and act on their full event-stream in real-time
1. Why integrate Google Analytics into Snowplow?
Google Analytics is the most widely used digital anaytics platform in the world. And for good reason: it’s a great product - and it’s free!
However, as all Snowplow users will be aware, there are significant limitations with Google Analytics - especially with the free product:
- Access to your own data is mediated via Google. You can access your data via the Google Analytics UI and APIs, but there are many restrictions on what data you can fetch, the volume of data you can fetch and the granularity of data you can fetch. In addition, only a subset of data is available in real-time
- Google Analytics applies a standard set of data processing (modeling) steps on the data that are standard across it’s enormous user base; this data modeling includes sessionization and marketing attribution. These steps are not necessarily appropriate for all users
- Google Analytics data is sampled. You can understand why Google would want to fall back to sampling: this has significant implications when you’re providing a product like Google Analytics, with such an enormous user base, for free. But it is a pain if you want to perform very particular analyses on very particular subsets of users, for example, because the data becomes unreliable as the sample size drops
Many of the above reasons are motivations for Google Analytics users to setup Snowplow alongside Google Analytics. However there is some overhead to doing this, particularly on the tracking side: for every Google tag that you create, you need to integrate a comparable Snowplow tracking tag.
By adding native support for Google Analytics and the Measurement Protocol to Snowplow, it should be straightforward for any GA user to add a single small snippet of JavaScript to their setup to push their data to Snowplow as well as GA, and thus benefit from all the opportunities that Snowplow opens up for them.
2. Existing Snowplow experience with Google Analytics
2.1 Inspired by the original GA event types
Although the Snowplow Tracker Protocol is bespoke to Snowplow, a large number of our original event types were closely modelled on equivalents found in the Google Analytics JavaScript SDK.
For example, Snowplow supports:
All of these were closely modeled on the Google Analytics equivalents; this also means a certain level of overlap between properties in the Snowplow Tracker Protocol and the Google Analytics Measurement Protocol.
2.2 Adding native Enhanced Ecommerce support
In 2016 we implemented support for Google Analytics’s Enhanced Ecommerce plug-in in Snowplow.
Because a number of Snowplow users were coming to Snowplow from Google Analytics, having already implemented Enhanced Ecommerce, we added native support for Enhanced ecommerce tracking to our own Snowplow JavaScript Tracker.
This allowed Google Analytics users to mirror their Enhanced Ecommerce integrations in Snowplow directly, cutting down implementation time.
3. A proposal for integrating Google Analytics events into Snowplow
3.1 On the Google Analytics side
To build this integration we can make use of Google Analytics support for third-party plugins.
We will build a simple open-source Google Analytics plugin, which intercepts the Measurement Protocol payloads being sent to Google Analytics, and also sends them to your Snowplow collector. We have started work on this plugin and you can follow our progress in this pull request.
Once it’s deployed, you’ll be able to leverage this plugin simply by adding the following to your existing Google Analytics setup snippet:
<script>
/*
...
Regular GA invocation code
...
*/
ga('create', 'UA-XXXXX-Y', 'auto');
ga('require', 'spGaPlugin', { endpoint: 'events.acme.net' });
ga('send', 'pageView');
</script>
<scipt async src="https://d1fc8wv8zag5ca.cloudfront.net/sp-ga-plugin/0.1.0/sp-ga-plugin.js"></script>
The endpoint
is your current Snowplow collector’s endpoint.
3.2 On the Snowplow side
Under the hood, Snowplow is in fact broadly protocol-agnostic - alongside the Snowplow Tracker Protocol, Snowplow has integrated support for the protocols of each of its supported third-party webhooks.
To send Google Analytics events into Snowplow, we therefore need to add support for the Google Analytics Measurement Protocol into Snowplow.
Broadly this involves:
- Defining JSON Schemas for the Measurement Protocol and associated Google Analytics entities, and hosting them in Iglu Central
- Writing a custom adapter inside of the Snowplow Common Enrich library which can process the Google Analytics Measurement Protocol
3.3 Overall architecture
Putting all of this together, we end up with a technical architecture looking like this:
4. Mapping the Google Analytics payload onto JSON Schemas
4.1 Mapping approach
Google’s Measurement Protocol is an incredibly extensive specification, representing the exhaustive list of all data points that a Google Analytics user (or direct Measurement Protocol user) can send in to the platform.
We considered three approaches to mapping all of these data points into JSON Schemas:
- Smallest viable entities - where we break the GA data down into a large set of tightly-defined entities
- Mega-model - where we create a single huge schema holding all of the data points
- Hybrid - in-between the first two approaches, with a handful of relatively large schemas
We discarded the hybrid approach, because we didn’t want to be responsible for interpreting or curating the Measurement Protocol; for this RFC to be successful, it is important that our Google Analytics mapping is unopinionated and doesn’t involve any “Snowplowisms”.
We then chose the “smallest viable entities” approach over the “mega-model”. Many small independent (if inter-connected) entities is more in line with our general thinking on instrumentation at Snowplow; it also sets us up nicely to move towards a graph representation of the data over time.
4.2 Comprehensive mapping exercise
Having decided our approach, we then compiled a Google Sheet with a row for:
- Every private or undocumented field that we have observed being sent by Google Analytics (e.g.
_v
for the SDK version number) - Every field documented as part of the Measurement Protocol
We then set out to map each of those fields onto a property within a new JSON Schema that we would add to Iglu Central.
You can find this spreadsheet here:
We have configured this spreadsheet so that you can comment directly on it, if you find that more convenient than commenting on this thread.
4.3 Implementation of the smallest viable entities approach
Under our chosen approach, a single Google Analytics payload will be processed by Snowplow into a single Snowplow enriched event. This enriched event will consist of multiple self-describing JSON entities, specifically:
- A single self-describing event in the enriched event’s
unstruct_event
field. The type of self-describing event will be determined by the Google Analytics’hitType
- Zero or more self-describing contexts, added into the array held within the enriched event’s
contexts
field
Let’s take the pageview hit type as an example. It will result in an event with:
- A self-describing event based on the
page_view
entity - A list of additional self-describing contexts conforming to the schemas:
user
hit
system_info
- etc
Let’s next look at some particular challenges in the mapping that we had to address.
4.4 Dealing with multi-dimensional fields
Some fields in the Measurement Protocol are “multi-dimensional”, where the field name itself is overloaded with multiple numeric indexes which precisely specify the data point being referenced. Consider the field:
il<listIndex>pi<productIndex>cm<metricIndex>
This “multi-dimensional” field name identifies a single value in the Measurement Protocol, such as:
il2pi4cm12=45
For our mapping, we will break this into four fields within a single entity:
{
"listIndex": 2,
"productIndex": 4,
"customMetricIndex": 12,
"value": 45
}
4.5 Dealing with currency
We’ve included the cu
(currency code) parameter in all schemas that have a price
field. We felt that as a practical matter, the currency should always be in the same table as the price.
4.6 Schema definitions
With the mapping completed and the approach decided, the next step was to draft all of the required schemas in a branch within our Iglu Central project. Iglu Central is a central public repository of schemas deemed to be of general use to the Snowplow community (and beyond) - this should be a great home for the Google Analytics’ JSON Schemas.
The schemas that we drafted are as follows:
* Note that page_view
can be an event or a context, depending on the Google Analytics hitType
.
None of these schemas have been merged into Iglu Central yet - we welcome your feedback on them! Feel free to comment directly into this pull request:
5. Integration into Snowplow
5.1 Integration principles
The next consideration is how to integrate the Google Analytics data points into Snowplow such that we:
- Make use of Snowplow’s own powerful features as much as possible, but also:
- Process the Google Analytics events as close as possible to how Google’s own systems process them
5.2 Populating fields in the Snowplow enriched event as well as the new contexts
One observation was that some parameters in the Measurement Protocol have unambiguous equivalents in the Snowplow enriched event. For example, the de
or documentEncoding
field in the Measurement Protocol maps directly onto the doc_charset
field in Snowplow’s own enriched event.
Where these mappings are straightforward and noncontroversial, we propose populating the Google data point into the Snowplow enriched event field, as well as populating it into a dedicated context; you can see these “secondary mappings” in the blue Assignment columns on the right-hand side of the Google Sheet.
Please let us know if you disagree with any of these mappings.
5.3 Running Snowplow enrichments on the Google Analytics data
Because we are populating the fields in the Snowplow enriched event, various Snowplow enrichments will work with the Google Analytics data “for free”, including:
- Page URL parsing
- Referer parsing
- MaxMind geo-location lookup
- Both useragent parsers
Fully configurable enrichments, such as the API request enrichment and the SQL query enrichment, can be used with the Google Analytics integration, just by providing Measurement Protocol events and context schemas as part of the input data to the respective enrichments.
The currency conversion enrichment will not work, as it is currently hard-coded to the built-in Snowplow ecommerce events.
5.4 Thoughts on supporting other behaviors
There are some interactions between the Google Analytics data and Snowplow that we are less clear on; these are flagged in orange cells in the Snowplow Notes column.
In particular:
- Whether we should enforce the IP address anonymization on a per-request basis, given that this is a feature that Snowplow does not support yet, although we are planning to add it, per Snowplow JavaScript Tracker issue #586
- Whether we should map the Google Analytics
uid
onto Snowplow’s ownuser_id
- Whether we should enforce Google Analytics’ IP address and useragent overrides - we are leaning towards enforcing these, given that we have equivalent override functionality in Snowplow
We would appreciate your input on these aspects, and all others!
5.5 Ongoing Snowplow development work
We would be remiss if we did not flag that Snowplow data engineer has started exploratory work implementing this RFC, which you can find in this pull request:
While work in this PR is relatively advanced, development on this is paused while we wait for the community’s feedback on this RFC.
6. REQUEST FOR COMMENTS
This RFC represents a significant new step for Snowplow as we expand the scope of what can be tracked with the platform. We are excited about the opportunities for opening up Snowplow to existing Google Analytics users, and are interested in the impact of fully supporting a second web analytics protocol alongside Snowplow’s own protocol.
We welcome any and all feedback from the community. As always, please do share comments, feedback, criticisms, alternative ideas in the thread below.
In particular, we would appreciate hearing from people with extensive experience working with Google Analytics tagging and the Measurement Protocol. Does our proposed integration match the way you would expect to work with Google Analytics data in Redshift, Kinesis or Elasticsearch?