Comparing Snowplow with Google Analytics 360 BigQuery integration (WIP)

yali · October 11, 2016, 9:39pm

Background

A number of prospective Snowplow users are interested in evaluating Snowplow against Google Analytics 360, and in particular, the BigQuery integration. As part of the Google Analytics 360 proposition, Google will make available event-level (they call it ‘hit-level’) data in your own BigQuery data warehouse. This is an alternative to Snowplow, for users who are interested in owning their own event-level data, and performing their own sophisticated analytics on that data.

We have a surprisingly large number of users who have both Snowplow and Google Analytics 360. I find this surprising, because Google Analytics 360 is a very expensive product. However, I’ll put my surprise to one side and ask those users to please contribute to this thread, to help validate the differences, pros and cons that we identify between the two solutions.

Out-of-the-box data points available

The Google Analytics 360 BigQuery table definition can be found here.

The Snowplow Redshift table definitions can be found here (for the main table definition) and here (for the additional tables that you’d add depending on your setup, which 3rd party integrations you use and what trackers and tracker settings you emply).

There are a large number of out-of-the-box field and event types that are available via Snowplow trackers that are not available in Google Analytics

Web examples

IP addresses of the end user do not appear to be surfaced / available in GA 360 / BigQuery
Link click events i.e. automatic recording of any hyperlinks clicked
Page ping i.e. heartbeat events. These include data points on how far down and across a web page a user has scrolled, so you can compute how much content is read and how long users spend on specific content items. This can be used to generate heat maps, for example.
Performance timing data i.e. data on how fast web pages take to load.

Mobile examples

When tracking events from e.g. Android or iOS SDKs Snowplow will automatically grab the following device identifiers (where available), none of which appear to be available in GA 360 BigQuery:
- IDFA (Open, Apple and Android)
- IDFV (Apple)

Google has a small number of very high value integrations with their own ad-related products

Google has very nice out-of-the-box integrations between Analytics and it’s different ad-buying products.

AdWords
Adsense
DFP

Whilst our partners at Poplin Data are working on plug-and-play Adwords and DFP integrations with Snowplow, it is doubtful they will be as simple to setup and use as the Google supported integrations between their own products.

Snowplow has a large number of integrations with third party marketing and data providers, that Google lacks

Snowplow has a large number of integrations with the following third party providers, for example:

Adjust for mobile app attribution
Augur for machine-fingerprinting
AWS Cloudfront for log ingestion
CallRail for call tracking
Clearbit for enriching user information
Mailchimp for email marketing data
Mandrill for transactional email data
Optimizely for A/B testing
OpenWeatherMap for weather data
PagerDuty for devops alerts
Pingdom for uptime alerts
SendGrid for transactional email tracking
Urban Airship for mobile marketing

Approach to defining your own event and entity types

Both Snowplow and Google believe in ‘general event tracking’ i.e. tracking digital events across platforms and channels, not just your standard web and mobile events. In Google’s case this drives a lot of their messaging around “Universal Analytics”.

If you are going to track different types of events from different platforms or devices, you need the ability to schema them. That’s because the structure of an “play video” event is going to look totally different (with totally different fields and associated types) to a temperature reading event, flight check in event or an event in a massive multiplayer online game.

Google’s approach to flexible schemaing: custom events + a long list of custom dimensions + custom measures

Google Analytics has a fixed schema: every company running Google Analytics, be it a travel company, bank, massive multiplayer online game, social network etc. will have fundamentally the same single table with the same columns. This is not surprising because every GA user shares the same data pipeline.

To give users some flexibility, Google provides a large number (200) of custom dimension and (200) custom measures that an end user can assign to different data points. In addition, they support a generic custom event with four generic fields (category, action, label, and value).

Snowplow’s approach: define your own events and entity (context) types

Snowplow users can define their own event types. For each event, they can define as many fields as they’d like.

Further, they can define their own entity / context types. Entities / contexts are things that are tracked across different events (e.g. users, products, videos, articles, bank accounts, reviews etc.)

Because events and entity schemas are versioned, and the data itself is labelled with the appropriate version, Snowplow users can evolve their data schemas over time. In contrast, it is very hard in GA to change the assignation of different custom dimensions and measures without causing a huge amount of downstream analytics pain.

Modelled vs not-modelled (atomic) data

Snowplow data is atomic (i.e. not modelled). It is up to each Snowplow user to model their own data according to their own business logic.

In contrast, the data that GA delivers into BigQuery is already modeled i.e. Google has already decided which user performed each action and what session each action belonged to.

There are pros and cons of both approaches. The GA approach delivers a data set that’s simpler to get started with, because no modeling is required. However, most Snowplow users find that an enormous amount of value is created by being able to define their own modelling process (e.g. their own logic for identity stitching, and aggregating over sequences of events).

Whilst this should be possible with GA data in BigQuery, in practice I’ve not met anyone who has e.g. recomputed on their entire data set with a new identity stitching or sessionization algorithm. So it may be practically very difficult. I’d love to hear from people who have done this.

Data latency

My understanding (correct me if I’m wrong) is it takes 24 hours + for GA data to hit BigQuery. (A subset of aggregate data is available via a separate real-time reporting API.

In contrast, Snowplow users can run their pipelines hourly, loading data into Redshift with a 2 hour latency. Users of the Snowplow real-time pipeline have the full event stream available to consume in Kinesis and Elastic in seconds.

BigQuery vs Redshift

Snowplow does not currently support loading data directly into BigQuery. (Although this is on our roadmap.) As a result, most Snowplow users do most of their analysis on Snowplow data in Redshift. (Although Spark on the data in S3, and Spark Streaming / AWS Lambda / Kinesis Analytics on the Kinesis streams are also increasingly popular ways to consume Snowplow data.)

BigQuery has a lot of great features that Redshift lacks - in particular elastic compute.

Both technologies struggle when it comes to big-table to big-table joins. The approach for working around this issue is different with both:

With BigQuery users are encouraged to nest data in single tables. The GA data structure reflects this
With Redshift big table to big table joins can be made performant by intelligent use of DISTKEYS and SORTKEYS

I suspect this is what makes remodelled event data in BigQuery quite difficult but don’t have enough experience to date to know one way or the other.

What have we missed? What have we got wrong?

I’d love to have our users who are use both GA 360 and Snowplow in anger help us build out this working document. Feedback / comments etc. very welcome!

digitaltouch · October 11, 2016, 11:08pm

@yali,

Thanks for a thoughtful write up! For the sake of disclosure, we are not users of GA 360, but have spent many hours evaluating the pros and cons of the published schemas and features. I do have a few general comments:

It appears that you left off the SQL / API Enrichments that the Snowplow pipeline offers. As far as I am aware, the ability to connect first party data with internal databases will require the use of a separate ETL process that loads the data in batches to BigQuery.

We have spent months integrating Google AdWords into Snowplow through daily batches on the dot-net collector. While is works flawlessly, you are right in mentioning that the setup is not nearly as simple as the out of the box support with Google. I will say the combination of the Keywords Performance Report, joined to the atomic.events table through the gclid has been a powerful report for our team. As far as I know, this level of detail would be impossible to accomplish in GA 360 due to the modeled data that GA delivers.

I don’t believe that enough emphasis was given to the fact that Snowplow was built with third party integrations in mind. We support multiple clients from one Snowplow pipeline with seventeen active third party data integrations. This is not possible inside of GA 360.

Finally, It may seem simple, but Snowplows support of app_id makes it possible to support a multi-tenant database. This would be against best practices inside of GA.

yali · October 12, 2016, 7:17am

That’s awesome additional insight @digitaltouch - thanks so much!

bernardosrulzon · October 13, 2016, 1:15am

Hey @digitaltouch! Good points. What kind of insights you get from a click-based analysis (gclid) vs. a campaign and keyword-based analysis? We’re currently passing that information to Snowplow via AdWords tracking template: e.g. utm_campaign={campaignid}

digitaltouch · October 13, 2016, 4:18am

@bernardosrulzon,

The available parameters in ValueTrack are limited. You can see the available parameters here.

Below are the data points that are available in the Keywords Performance Report that we find very interesting at the keyword level and generally not available through ValueTrack:

Cpc (We run daily, so this provides the daily average of that keywords cost per click in our reporting)
CreativeQualityScore
EstimatedAddClicksAtFirstPositionCpc
EstimatedAddCostAtFirstPositionCpc
KeywordMatchType
QualityScore
AveragePosition
BiddingStrategyName

This allows our team to run a simple join and start pointing out trends to the buyers at a keyword level. We also avoid the additional burden of having the buyers tag each with the proper URL string.

bernardosrulzon · October 13, 2016, 1:25pm

Great, thanks! I was hoping AdWords would give a cost-per-each-click metric using gclid, but apparently daily average is as good as it gets.

yali · November 7, 2016, 5:41pm

Further reading: nice post from Eric Driessen about reasons you’d want to setup Snowplow alongside Google 360

zivbaram · February 28, 2017, 7:48pm

Hi @yali I encountered an intersting case today related to this topic. it shows some of the limitations of the Google Setup / proposition and a potential for snowplow.
I visited a pretty large internet company in a certain country today. they have over 20 sites in different verticals. many of them are the largest in that category. in addition they have a very large number of sites they host for B2B clients.
For some they are using 360 on others they have universal and on others the old analytics…
problem 1 - they are not using 360 on all sites due to high costs…
problem 2 - their setup is per domain. now when they want to track global behavior on a group level including a global use of the customer they are stuck. they are not using cross domain as they claim it is not working properly and they try to save data… they pay a premium price but still can’t accomplish a pretty basic requirement.
Problem 3 - they want to use the analytical data for other uses beyond reporting, for example for the development of algorithms to improve targeting.
a. (relatively) complete data is only available for the sites where they are using the 360.
b. they are also using a DMP. they would like to stitch the data from the DMP with the analytics to create a holistic user profile. they struggle to do that and encounter huge discrepancies
c. they have been trying to use pass keys between the systems at the data capture but none of the methods provided by Google works well for them
All of this highlights 2 things for me where SP should have a real advantage:

ability to integrate well with other systems
ability to centrally manage a large network of sites, maintaining a holistic, network level view of the customers and their interactions alongside a per domain and / or per vertical view.
Both are really hard to accomplish with 360
Would love to hear comments about how people use SP to address these 2 requirements

fwahlqvist · March 1, 2017, 12:07am

Hey @bernardosrulzon, quick question to you, how do you tie the gclid to the keywords repport? (not sure if I missed anything obvious?)

joaocorreia · June 15, 2017, 4:02am

This page has a simple side-by-side comparison of Google Analytics 360 vs Snowplow Analytics.

Topic		Replies	Views
Snowplow vs Google Analytics and BigQuery For data modelers & consumers	3	3002	April 10, 2017
Streaming to BigQuery with google enhanced ecommerce events (unstructured events) Storage targets	11	3093	August 12, 2016
Sending Google Analytics events into Snowplow RFCs	12	7964	July 19, 2017
Real-time data processing with Google Analytics using Snowplow – Snowplow From the blog	0	1154	March 23, 2018
Try Snowplow and Google Tag Manager (GA4) Try Snowplow	5	1642	May 17, 2021