Difference of some metrics between Snowplow and GA

Recently I did a comparison of the same metrics calculated from Snowplow data and GA’s aggregated data and got this result:

I do expect that the number between the two platforms will be somewhat different. However, I’m curious at the pattern that I observe in my data:

  • Looking at the difference at month level, the difference between users and pageviews is quite low. However, Snowplow’s session count is always 20% - 22% higher than that from GA. What is the possible reason? (Again, I’m not interested in reconciling these numbers - just curious)

  • Looking at daily level, there is a seasonal pattern - what does this mean?

Hi @hoanghapham,

In the general case, the question of differences between GA and Snowplow is fairly difficult to answer. On the high level it usually just comes down to ‘there are differences in logic between the two products’.

Fundamentally (and still on a very high level), the difference in logic generally boils down to a difference in philosophy - leaving aside the sampling problem, the GA approach is to remove the need for the user to reason about the ‘raw’ data (at least to some extent), and to carry out at least some aggregation or logic ‘under the hood’. The advantage to that is that it makes the data more accessible, but the down side is that the user has no access to the logic/decisions made about the data under the hood - and so if they want to handle it differently they either can’t, or they need to be pretty creative.

With Snowplow, our approach is to focus on collection, and leave all of the business logic to the user. A colleague recently phrased this idea as ‘we just throw it over the wall and let them worry about the rest’. No decisions are made for you.

You’ve presented some interesting findings here, so hopefully I can prod you towards finding a better understanding of these differences. Quite likely this difference in approach is at play - where we’re throwing stuff over the wall ‘as is’ GA may be amending, aggregating, or otherwise handling the data before it’s presented to you (obviously it’s not possible to determine exactly what GA is doing).

We’re looking at a difference in sessionisation here - a crucial specific difference between the two products is that the Snowplow Javascript tracker carries out sessionisation client-side. I believe but am not certain that sessionisation is done server-side with GA.

Additionally, we have configurable sessionisation - you can specify in the tracker what the session timeout is.

So, I would explore a few avenues to learn more about this:

  • Are there Snowplow sessions with only one event? Perhaps there are users who return to a tab much later, trigger an event by opening the tab, close it and leave. If GA disregard these, or attribute them to existing sessions, then (depending on how the tracker is instrumented) you may see some single-event sessions in Snowplow. Our philosophy is that it’s best to let you decide to filter those out if you choose to.

  • Is there a difference in the tracking configuration? If sessions in the Snowplow tracker are configured differently to GA’s sessionisation then obviously we can see a difference (this is fairly likely), but also if GA is configured to trigger events at times that Snowplow isn’t, then it could be the case that GA is attributing one session where Snowplow attributes two. Think of someone watching a 1hr video - if GA is firing events during that hour but Snowplow isn’t, GA will consider it a single session where Snowplow will consider it two.

  • Is there any indication that the extra sessions may be attributed to bot traffic, or some strange interactions with browser plugins? Javascript browsers are inherently unreliable environments because we have no control over what other Javascript runs - some plugins do silly things like overwriting standard methods. If there’s some such interaction then it could be the case that GA filters them out but Snowplow leaves that to you. Best way to check this is to count sessions per user per day. If some JS interaction or bot causes these things, I would expect to see a small amount of users with a ludicrously high amount of sessions per day (say, for example, if some external influence is causing a new session id to be generated per event).

I hope that’s helpful stuff - and I’m glad to hear you don’t want to reconcile them, that kind of exercise is normally a massive time sink for no real value - you’re right to aim towards understanding the difference and reasoning about that in how you come to conclusions.

Best,

2 Likes

Shameless plug - this topic was briefly touched on in the recent Digital Analytics Power Hour podcast featuring our co-founder Yali.

I’m not recommending it as ‘listen to this hour of discussion about the differences between platforms’, but the discussion on a high level of approaches to this problem space is quite interesting for any fellow data nerds who need to fill an hour with some interesting broad discussion. :slight_smile:

It is really hard to tell you why there is such a difference without knowing how 2 systems are configured. To add to what Colm was saying I would also consider any any specific GA reporting configurations in addition to differences in how tracking is implemented (for example what is the session expiry for two systems, reporting filters in GA etc.).

You have to remember that sessionization in GA is done on server side. So if your session duration in GA set to 30 mins - all events that happen less than 30 mins apart from a user will be considered as a part of the same session (unless you forcefully terminate it via special event). In Snowplow sessionization is handled on a client side so if something on the client side (or your site) does something to session cookies/local storage values or modifies some of the functions then your session count will be different.

1 Like

Great pointers for further exploration, thanks a lot Colm!
I’ve just done a quick check and turn out there are about 20% of the sessions with only one event. so probably that’s the reason?
Not really sure about the tracking configuration - I will need to check with our engineering team.
As for bot traffic, when I made the comparison I did try to exclude bot traffic, but the difference was still as large, so I don’t think it’s due to bots.

I’ve just done a quick check and turn out there are about 20% of the sessions with only one event. so probably that’s the reason?

That does seem like the most reasonable explanation alright - you can choose to implement some logic to handle these cases, or just exclude them from the relevant analyses.

If you’re still curious I’d take a look at what those events are - my bet is that they’re page pings for the most part (ie a dormant tab is reactivated just before closing), or page views (a browser is repoened on the tab, triggering a pv before tab is closed or user navigates away). If they’re some other event, such as a transaction or a custom event which is only triggered in specific places, that would be evidence that there’s something wrong in your implementation.

Having said that, that exercise would really only be to satisfy curiosity - for most analysis just understanding the cause is enough to handle it appropriately and start to dig into some more interesting stuff!

Thanks Colm. In that 20%, about 12% are sessions with 1 pageview, and 8% are sessions with 1 pageping.

Anyway, now I clearly understand why in a blog post you guys mentioned that counting sessions in this day and age (of multiple tabs browsing) may not make sense.