Proposing the Snowplow Relay initiative

alex · November 7, 2018, 3:27pm

We are excited to propose the Snowplow Relay initiative.

Snowplow Relay is an initiative for feeding Snowplow enriched events into third-party tools or destinations. Example destinations include SaaS marketing platforms, open-source machine learning engines or fraud detection services. We call an individual app that feeds Snowplow events into a specific destination a relay:

These relays will be open-source, cloud native and designed with the consent of data subjects at the forefront. They will operate in near real-time, running on AWS and GCP.

Depending on your background, you may be wondering how Snowplow Relay compares to the various tag management solutions widely used in our industry. Let’s take a look back at the tag management ecosystem before diving into what makes Snowplow Relay different.

Tag management originated as a tool for web analytics, so let’s start there.

1. Tag management for the web

Working in the web environment, you may well have used an in-browser tag manager, such as Google Tag Manager or Tealium, to route customer behavioral data to third-party SaaS tools.

Let’s call the service that you want to send data to Acmetrics. You would typically configure your tag manager to:

Initialize Acmetrics’ JavaScript library (or “SDK”) on your web pages
Observe the end user’s behavior
Send relevant data about the end user’s behavior to Acmetrics via its JavaScript library

This data flow is shown below:

In-browser tag managers represent a powerful abstraction layer between your website and your business analytics requirements; Marketing teams have often used tag managers to prevent their tagging needs from being blocked or delayed by their peers in IT or Software Engineering.

The Snowplow JavaScript Tracker is very often called from a tag manager - for example, here is our guide to setting up the JS Tracker with Google Tag Manager.

2. Equivalents to tag management for mobile apps

In the mobile app environment, things evolved quite differently to the web. If you want to route in-app behavior to a third-party tool, then you typically have three distinct options:

An in-app analytics manager
An in-app JavaScript tag manager
A software-as-a-service vendor who will route your events server-side

Let’s look at these options in turn.

2.1 In-app analytics managers

An in-app analytics manager is a client-side approach, somewhat equivalent to a browser tag manager: you add Acmetrics’ mobile SDK and any other tracking SDKs into your mobile app, and then the in-app analytics manager presents a unified abstracted interface over those SDKs, so that you can instrument your analytics tracking once, and those events will be sent to Acmetrics and your rest of your SaaS tools:

The primary example of an in-app analytics manager is ARAnalytics, which is for iOS/Mac only.

2.2 In-app JavaScript tag managers

Vendors such as Tealium and Google Tag Manager (GTM) offer a “hybrid” JavaScript-powered approach for mobile apps, where:

You embed an SDK into your app (the Firebase SDK in the case of GTM)
You instrument your app by making calls to the tag manager library to record user behavior
The tag manager library regularly fetches your latest routing rules from the tag manager’s own servers
The rules are typically expressed as JavaScript and invoked in a hidden browser frame inside your app
The in-app events are thus sent to whatever destinations you have configured, directly from the client device

This is a fairly complex workflow - for more details check out these links:

Note that Tealium can also operate as a SaaS analytics router, see below.

2.3 SaaS analytics routers

The more common approach in mobile has been to use a SaaS vendor such as Segment or mParticle to route your behavioral data to your third-party destinations from their own servers.

A tool such as Segment works like this:

You add the Segment library into your mobile app
You instrument your app by making calls to the Segment library to record user behavior
The Segment library sends all of these in-app events to Segment’s servers
From there, Segment routes the in-app events to whatever destinations you have configured

A simplified data flow for a SaaS analytics router is shown here:

3. Challenges with client-side approaches

While in-browser tag managers and in-app analytics managers have been hugely empowering tools for data and marketing teams, their limitations have become manifest over time. The two major issues for client-side approaches are:

Web page or mobile app bloat and slowdown
Data leakage

Let’s cover both of these briefly.

3.1 Web page or mobile app bloat and slowdown

In a browser context, pulling in multiple third-party tracking libraries has often led to significant slowdowns on initial page loads and then subsequent page performance. Tracking down a “misbehaving tag” is a common task for developers and marketers working with tag managers.

In a mobile app context, adding multiple analytics libraries or “SDKs” into a mobile app has inevitably led to increases in the app’s install size; post-install, we then see significant increases in network traffic as each of the analytics libraries transmits its own event stream to its own servers.

3.2 Data leakage

By their very nature, client-side tag and analytics managers bring third-party code, much of it proprietary and obfuscated, into the host environment of website, webapp or mobile app.

As a site or app owner, it is very difficult to limit what that code can do - after all, it is code executing in our end user’s environment, just the same as our code. Instead, we have to scrutinize the terms and conditions of our various vendors to understand how their code should behave.

One of the worst forms of misbehavior for third-party code is around “data leakage”. Data leakage is where third-party code collects identity or behavioral data from the client which goes above and beyond its reasonably-expected remit; a common end-game for data leakage is building some kind of centralized data asset which the offending third-party then monetizes.

These client-side problems have tilted the balance more recently towards server-side approaches - even major in-browser tag managers like Tealium have introduced server-side capabilities.

4. Data governance and server-side data control

Although server-side tag managers avoid the problems of client bloat and data leakage, another challenge is rapidly emerging in that field: that of data governance.

GDPR and the wider data privacy movement reinforce the importance of keeping tight control over how and when behavioral signals from individual data subjects are utilized. Simply put, the idea of multiplexing all user event data to all destinations for arbitrary further analysis and processing seems increasingly problematic in a GDPR world.

The alternative is fine-grained control of behavioral data routing, managed from the server-side. This is a complex area, and could include:

Performing identity resolution or stitching to map events to an underlying data subject
Capturing consent from data subjects for certain aspects of their digital behavior to be routed to certain third-parties, in support of specific use cases
Routing that data to third-party systems
The auditing/logging of that data routing, to ensure compliance with regulations and in support of specific data subject rights, such as the Right to be Forgotten

The current crop of server-side tag managers largely pre-dates the data governance challenge; at Snowplow we believe that there is a need to take a fresh approach to routing behavioral data to third-parties, designing in data governance from the start.

5. Introducing our Relay initiative

Snowplow Relay, then, is a new initiative for feeding Snowplow enriched events into third-party tools or destinations, from SaaS marketing platforms to open-source machine learning engines to fraud detection services.

Each individual relay app will run server-side - at this point it is clear that server-side analytics routing is the way forward, for the reasons explained above. Each relay will take the Snowplow enriched event stream as its starting point, transform it into a format which is compatible with the destination and then feed that transformed event into the destination.

Individual Snowplow relays will be open-source, cloud native and designed with the consent of data subjects at the forefront. Let’s cover these values in turn.

5.1 Open source

Open source is hugely important to Snowplow in general and to the Snowplow Relay initiative specifically. We believe building this in the open will:

Maximize contributions - we expect that the majority of relays will be authored by others - perhaps Snowplow community members, or the third-party destinations themselves
Improve accountability and auditability - in a world where data privacy and governance is increasingly important, Snowplow relays must be auditable by security and data officers. “Black boxes” are untenable here

5.2 Cloud native

Snowplow runs natively on AWS (batch and real-time pipelines) and Google Cloud Platform (real-time pipeline). It’s important that it’s possible to run Snowplow relays on AWS and GCP with a minimum of fuss.

5.3 Data subject consent-oriented

This is the most challenging design goal.

It is relatively easy to create a Relay which simply forwards events into a third-party system with some light structural transformation. It is much more challenging to create a Relay which deeply understands which data subject each individual event relates to, and what that data subject has permitted to be done with that event, for example in terms of routing that event.

We have some valuable building blocks for integrating data subject consent into the Relay initiative - for example, the consent tracking we recently added into our major trackers. However, there are still a lot of unanswered questions here.

6. Anatomy of a Relay

This RFC represents the “draft specification” for building a Snowplow Relay.

The conceptual architecture of a Relay looks like this:

6.1 Key constraints of a Relay

A Relay has the following constraints:

It should run in near-real-time
It should be stateless - it cannot preserve or retrieve state across multiple events
It will work in an at-least once fashion - we cannot guarantee exactly once processing in a Relay

6.1 Core components of a Relay

The core components of a Relay are:

Read stage, from a stream of Snowplow enriched events
Transform stage, where we apply a mapping of the Snowplow enriched event properties to the data structure expected by the destination
Write stage, where we feed the transformed data into the destination

Let’s cover each of these in turn.

6.2 Relay: Read stage

In the Read stage, the Relay will read the event from the Snowplow enriched event stream - for example, the Amazon Kinesis stream or Google Cloud Pub/Sub topic containing the events.

To add additional flexibility, we would like to support filters in the Read stage: filters would let you configure the Relay to silently discard certain Snowplow event types, so that they are not relayed into the destination. The initial filters would likely be an optional whitelist or alternatively blacklist of event types.

6.3 Relay: Transform stage

In the Transform stage, the Relay will apply a mapping of the Snowplow enriched event properties to the data structure expected by the destination. This is the most complex step, involving a deep familiarity with the data structure that the destination is expecting.

We envisage three types of mapping rule:

Static, where there is a fixed, universally correct mapping between a specific Snowplow event datapoint and an equivalent datapoint expected by the destination. This static mapping would be hardcoded into the Relay
Dynamic, where each Snowplow user would want to set up a custom mapping
Hybrid, where that might be a dynamic mapping with a static fallback

Our current assumption is that mapping rules will need to be relatively fixed; Turing-complete mappings (e.g. by using a scripting language like JavaScript) will be out-of-scope.

6.4 Relay: Write stage

In the Write stage, the Relay will feed the transformed event into the destination.

This process will not be immune to a major outage in the destination or the destination’s APIs - a relay may support some minimal retry-on-failure, but it will not provide full guarantees that events will be definitely written to the destination.

7. Certification

We are considering implementing a lightweight certification process to help Snowplow users know which community-contributed relays they can feel comfortable adopting.

The main concerns of a certification program would be:

Does the Relay support the current Snowplow enriched event format?
Does the Relay support all of the mandatory features that make a relay a relay?
Does the Relay support - or worse encourage - any bad behaviors, for example around data privacy?

We could provide Snowplow relays which pass certification with a live GitHub badge to make their status clear.

8. Released and upcoming relays

8.1 Released relays

This RFC is a little “late” - we have been experimenting with the concepts set out above with the release of two initial relays:

Snowplow Piinguin Relay (release post) - a relay which takes PII transformation events from the Snowplow pipeline and feeds them into our Piinguin service
Snowplow Indicative Relay (release post) - a relay which sends Snowplow enriched events into Indicative (currently AWS-only)

We are mindful that these two relays pre-date this RFC, so please don’t treat the design decisions implicit in those two relays as being set in stone; those relays can and will be revised following community feedback from this RFC.

8.2 Upcoming relays

We are currently building a prototype Relay for Amplitude, the product analytics service for mobile apps.

Other relays that our customers and community have expressed an interest in include:

Braze
Google Analytics
Facebook
Intercom
Vero

If you are interested in contributing to one of the above relays, please create a new thread in our Discourse.

9. Out of scope

We have no plans to support the Relay initiative for users of the Snowplow AWS batch pipeline at this time.

We have no plans to support “historic replay” of an existing Snowplow event archive through a relay at this time - although this would be achievable with some additional components.

As discussed above:

We have no current plans to support Turing-complete data mappings in relays
We have no current plans to add bulletproof back-off-and-retry to relays, for the case where a destination suffers a sustained outage. This is something we could revisit in the future

10. REQUEST FOR COMMENTS

This RFC represents a hugely exciting new initiative for Snowplow, and so we welcome any and all feedback from the community. As always, please do share comments, feedback, criticisms, alternative ideas in the thread below.

In particular, we would appreciate any experience or guidance you have from working with existing tag managers in general, or ideally server-side routers and multiplexers like Segment and mParticle.

Finally, feel free to explore the Snowplow Indicative Relay and use that to provide feedback. We look forward to your thoughts!

evaldas · January 18, 2019, 5:40pm

@alex so is this mostly applicable when tracking with mobile SDK’s or there would be any benefit while still using tag manager ? (maybe consolidating multiple tags). Given majority of websites are still tracked using js tags do you see that being superseded somehow?

alex · January 19, 2019, 12:07am

It’s a good question, @evaldas!

First off - I don’t see JavaScript tag managers going extinct anytime soon. Client-side tag management is a mature and well-established technology, with plenty of corporate adoption. And remember that the tag vendors themselves like having their own JavaScript SDKs running inside users’ browsers; it gives them more control of their incoming data, and more metadata from the browser; some of them will want to slow down the move to all-server-side.

Having said this, I believe that the page bloat and data leakage points I made above are strong reasons why websites will follow mobile apps in adopting Snowplow Relay and similar server-side routing options over time.

Ultimately, the website owner will want to have as much control over the relaying of business-critical customer data to third parties as possible, and this is far easier to achieve in a sophisticated and secure data processing environment like AWS Lambda than it is in each user’s web browser.

evaldas · January 19, 2019, 8:59am

I see, thanks for sharing the thinking!

spatialy · November 29, 2019, 12:58am

Hi @alex, what happen over this initiative?

best

stevecs · November 29, 2019, 2:39pm

It’s still front of mind, @spatialy, but the relays listed are still in the same place. We don’t feel two and a half relays gives enough context for us to standardise the way they’re built yet either. When we do have that confidence we’ll publish those guidelines.

We’re hoping to drive the relay initiative forward next year so watch this space! Are there any relays in particular you’re interested in seeing?

spatialy · November 30, 2019, 3:41am

Hi @stevecs, thanks for the answer.
We are interested in GA/Fb/Matomo (former piwik) for sure. Maybe is to much ask but some early guide or advice if we decide to go with some experimental development on our own?
Best

stevecs · December 9, 2019, 8:48pm

Sorry for the delay, @spatialy, I missed your reply. I’d recommend taking a look at the Indicative Relay for inspiration.

pjatx · October 9, 2020, 9:36pm

Hey Guys,

We’re considering implementing Snowplow - but the relay functionality is definitely a feature we need.

What’s the current status of this?

Relays we’d be interested in:

Google Analytics
Amplitude
Iterable

Rudderstack is starting to look like a compelling alternative, though, they don’t over the event replay, enrichment, and some other nice features that Snowplow does in their open source offering.

Rounding relays out would really add a lot of value I think.

Colm · October 12, 2020, 10:50am

Hi @pjatx,

It’s still an idea we’re keen on but ~~at the moment this RFC proposal hasn’t moved to development~~ so far the indicative relay is still the most relevant release. (correction as per Alex’s comment below)

We have a GA plugin that might be of interest. Aside from that, the best approach is to create an application (eg. an AWS lambda) to send your data from the enriched stream out to the other destinations. I know that some of our existing user base send snowplow data to amplitude, for example.

The Singer project seems like a cool thing to look into as well, but I don’t know enough about it to know if it meets your needs.

If getting your data to multiple destinations like this is your main priority, then sure I can see why you might want to go with a tool like Rudderstack.

Snowplow’s main value proposition is reliable, scalable data collection, and an emphasis on data quality. If that’s what one is looking for in a data collection tool, then my personal (biased, obviously) view is that I would need a lot of questions answered before I could consider that tool as a viable alternative to Snowplow.

I hope that info is helpful in figuring out what you need!

rahulj51 · October 12, 2020, 11:24am

At Omio, we have written one to send data to Amplitude. I think it’s not complicated to build a lightweight relay that reads from Kinesis/PubSub and transmits events downstream.

Most of our complexity is observed in the translation layer for the following reasons:

Historically, we have used structured events with contexts. Mapping these to a flat event model in Amplitude requires effort.
We don’t exercise much control on the taxonomy of events sent to Snowplow (although this may change in future). But we use the relay to add more filtering/white-listing/constraints etc. before pushing the events to Amplitude.
Amplitude separates properties into user and event properties. So there needs to be some logic to split event properties into these two buckets.
Mapping Snowplow session-ids to Amplitude was not very clearly documented (Amplitude documentation states that the session-id should be a monotonically increasing timestamp. However, we later discovered that Snowplow session ids work just fine).
Additional boilerplate around fail-overs, retries, API throttles etc.

alex · October 12, 2020, 11:44am

Thanks @Colm, thanks @rahulj51 for sharing.

Just a minor correction on this:

Actually we already have a relay built and in-production for forwarding Snowplow events to Indicative, a product analytics platform. You’ll find the codebase here:

pjatx · October 12, 2020, 1:16pm

Thanks @Colm!

The main motivation for sending these off to the platforms mentioned is because parts of the business are heavily reliant on them. We have a homegrown solution that is analogous to the collect, transform, and relay portions - but are missing the rest.

Singer is awesome, but that primarily solves the Extract problem cases. In the Snowplow model, I think you would collect and store everything - then send them to third party platforms. Reingesting them from third parties back into the warehouse might also be of interest, but secondary in importance for us.

I think this project is awesome - seems like a natural next step to build out relays more so then you can compete a bit more directly with Segment in that regard.

Thanks for the reply and insight!

@rahulj51

Yeah - what you’re describing also sounds like what we would do. I’ve had friends who’ve done just this at other companies and it seems to work pretty well. Thanks for the color!

@alex

Thanks for sharing that!

Topic		Replies	Views
RFC: Google Tag Manager Server Side and Snowplow RFCs	18	3343	May 2, 2022
Sending Snowplow events to additional destinations with Google Tag Manager Server Side New releases	1	2329	November 24, 2021
Sending Google Analytics events into Snowplow RFCs	12	7965	July 19, 2017
High-end customer analytics with Snowplow and Indicative – Snowplow From the blog	2	1167	September 20, 2018
Snowplow Indicative Relay 0.3.0 released New releases	2	1396	May 22, 2019