Snowplow Serverless

acgray · March 31, 2018, 9:03pm

Hi Snowplowers,

I’m excited to introduce you to a project I’ve been working on recently, which I am tentatively naming Snowplow Serverless: an implementation of (a minimal subset of features of) the Snowplow Collector and Enrich components entirely as functions for AWS Lambda, using the Serverless framework.

To give a bit of background, most of my posts on here are based on my work leading the data architecture at Property Finder Group, where we are heavy users of the Snowplow streaming stack.

However, I’ve worked in the charity sector in the past and continue to do occasional pro-bono advisory work with small charities and social enterprises. For these types of organisations, even the most basic Snowplow infrastructure is prohibitively expensive; the cost of a minimal real-time Snowplow deployment with a relational DB is in the order of hundreds of dollars a month, which immediately places it out of reach.

(Snowplow Mini goes part of the way but serves the distinct use case of experimentation for new users, rather than production cost-saving.)

In contrast, a Lambda-based deployment such as this makes it possible to process several million events per month for just a few dollars.

I’m a long way from being a Serverless crusader, and Lambda certainly isn’t for everyone. Nonetheless, cost-saving aside, there are undeniably other benefits of this approach, such as:

One-click deployment
Seamless scaling
Reduced sysadmin overhead
Reduced code complexity – much of the functionality of the current Snowplow code, such as concurrency, retries, and so on, is delegated to the Lambda execution engine

This code is currently extremely experimental, implements a very basic set of Snowplow functionality, and is almost definitely not for production use. In particular, the following Snowplow features are not yet supported:

Custom Iglu schemas (only Iglu central events are supported)
Custom enrichments
GeoIP enrichment
Webhooks
Graceful handling of bad collector requests
Graceful handling of Kinesis failures
Snowplow monitoring
3rd party cookies (network_userid)
Redirects
Any sinks other than Kinesis

Nonetheless, I’m pretty happy with it, and think this approach has huge potential, particularly when paired with other serverless AWS features. For example, by forwarding the enriched events stream to Kinesis Firehose, events could be stored in S3 and queried using Amazon Athena for a fraction of a cent per query.

There are, no doubt, other angles on this I haven’t considered, and I’d love to get feedback and thoughts from others in the Snowplow community. This is EXTREMELY experimental at the moment (see the README for details) but I’m happy to take it forward if there is an appetite for it.

Happy Easter!

Adam

antman · April 2, 2018, 3:17pm

This is really cool, Adam, I’m looking forward to hearing more about it!

arikfr · April 2, 2018, 4:15pm

I recently had similar thoughts about having a simple Lambda endpoint that will receive Snowplow events for small scale deployments.

Very interested to see how this will develop.

Personally I wish you picked a different language for the implementation, but I guess Scala makes it easier to reuse existing Snowplow codebase?

tjh34 · April 2, 2018, 5:19pm

This is very interesting. I think it could be useful for much more than just Charitables. Many companies have a cost and complexity concern as well. If you can get these other features working, it would be revolutionary!

acgray · April 3, 2018, 1:36pm

@arikfr correct - I’m personally not a big Scala fan either, but the Snowplow shared libraries are written in Scala and make heavy use of Scalaz and functional paradigms which don’t convert well to Java at all (I tried, it wasn’t pretty)

mike · April 3, 2018, 11:14pm

This is really cool. I had a crack at refactoring the stream collector in Node.js a while ago (so it could run on Azure Functions, GCP Cloud Functions and AWS Lambda) and got most of though not all of the way.

I think you’ve hit the nail on the head regarding the utility of Lambda/serverless - the main things I noticed when building the cloud function (at least for the collector) were:

Reasonably high latencies on responses (often > 100ms)
For smaller scale sites quite good but for higher volume you hit concurrency limits quickly (most services limit to 1000 invocations/second).
Reasonable cost for smaller volumes but gets expensive otherwise
Some security limitations - the maximum concurrency for Lambda functions means that the collector is open to very simple denial of service attacks and due to the way that API Gateway performs throttling high load on one API can impact the latency of other unrelated APIs.

Once serverless has dealt with a few of these growing pains I think the collector could be well suited to eventually becoming serverless. I suspect the enricher will also be serverless but is likely to move towards running on something like Apache Beam/Dataflow where having a warm cache for running a variety of enrichments will be a requirement for lower latencies.

li0nel · April 4, 2018, 12:05pm

Hi Adam,

Thanks for sharing!

I’m building a serverless Snowplow stack too, but based on the CloudFront collector, S3, Firehose and Lambda, deployable with Terraform, which hopefully I’ll be able to share as well.

The Redshift stack could also be made serverless with the release of Redshift Spectrum, so I’m hoping having the full stack being serverless, in Terraform.

Once thing I’m stuck with, how do you extract the pageViewId when webPage context is enabled? I can’t find which URI parameter is used to send the pageViewId with each event.

Best,

Lionel

fwahlqvist · March 14, 2019, 4:38pm

Hey @Mike, is this something you can share?

mike · March 14, 2019, 10:02pm

Sure!

As a disclaimer I haven’t touched this for 18 months or so and my Node has never been good but hopefully there’s something useful in there as I’d love if someone got some value out of it.

gist.github.com

https://gist.github.com/miike/cbe99c2d8c220b548f062dca23cdc6e0

gistfile1.txt

// this collector code isn't finished

const fs = require('fs');
const path = require('path');
const bufrw = require('bufrw');
const Thrift = require('thriftrw').Thrift;
const thriftrw = require('thriftrw');
const express = require('express', '4.16.2');
const cookieParser = require('cookie-parser');
const uuidv4 = require('uuid/v4');

This file has been truncated. show original

jakethomas · March 15, 2019, 2:34am

I did something very similar to @li0nel (as an experiment) and wrote about it here:

The cost is crazy low, considering the uptime/data availability/scalability of the system, and I haven’t had to touch it once since deploying.

fwahlqvist · March 15, 2019, 10:53am

Thanks @mike, will have a look and if i make any good progress ill share back here
Best
Fred

fwahlqvist · March 15, 2019, 11:02am

Hey @jakethomas, any chance you can share the Lambda function code?

Mike7L · March 15, 2019, 1:23pm

We are very interested in your progress too.

alevashov · August 10, 2019, 9:39am

hi @fwahlqvist, @Mike7L and all

we followed the experiment @jakethomas did, had to re-invent Lambda part.

Those who interested can check the notes at https://www.ownyourbusinessdata.net/enrich-snowplow-data-with-aws-lambda-function/
We shared lambda function (python) in Git repo linked from the post

alevashov · September 25, 2019, 3:01am

An update: we added Terraform script to our repository (link is above in my previous post) to make deployment of the solution quick and easy.

jakethomas · September 25, 2019, 12:58pm

This looks familiar I’m glad you found the (first half of the) pipeline easy to set up and useful. Nice work!

stanley_sewall · March 29, 2021, 6:05pm

Hi @alevashov

Did you get the Snowplow Serverless Implementation into production?
Thanks in advance.

alevashov · March 29, 2021, 9:35pm

Hi Stan

We ran it for our business website for several months and also helped setting it for one other business.

stanley_sewall · March 29, 2021, 10:13pm

Hi @alevashov
Thanks for responding. I assume this was deployed on AWS. Is that right?

How many transactions did your serverless implementation scale up to? That’s really the bonus question. THanks for the Q&A.

alevashov · March 29, 2021, 10:51pm

Yes, it is AWS

We have open sourced deployment script, check our GitHub repo

I can’t talk about exact number of transactions, but the limits are very high, lambda functions can process a lot.

The best for specific case will be to build and test.

Topic		Replies	Views
'Serverless' Snowplow architecture For engineers	7	3513	June 1, 2017
Is my version of snowplow lambda architecture correct For engineers	3	2213	May 17, 2018
Snowplow for €0.02/day with Terraform, dbt, Docker and BigQuery For engineers	1	903	June 28, 2023
On-Premise Production Architecture? Kafka real-time pipeline	3	2801	February 14, 2019
Snowplow Collector, Enricher and Lambda's run in Containers?	12	558	November 30, 2023

Snowplow Serverless

Related topics