Hi Snowplowers,
I’m excited to introduce you to a project I’ve been working on recently, which I am tentatively naming Snowplow Serverless: an implementation of (a minimal subset of features of) the Snowplow Collector and Enrich components entirely as functions for AWS Lambda, using the Serverless framework.
To give a bit of background, most of my posts on here are based on my work leading the data architecture at Property Finder Group, where we are heavy users of the Snowplow streaming stack.
However, I’ve worked in the charity sector in the past and continue to do occasional pro-bono advisory work with small charities and social enterprises. For these types of organisations, even the most basic Snowplow infrastructure is prohibitively expensive; the cost of a minimal real-time Snowplow deployment with a relational DB is in the order of hundreds of dollars a month, which immediately places it out of reach.
(Snowplow Mini goes part of the way but serves the distinct use case of experimentation for new users, rather than production cost-saving.)
In contrast, a Lambda-based deployment such as this makes it possible to process several million events per month for just a few dollars.
I’m a long way from being a Serverless crusader, and Lambda certainly isn’t for everyone. Nonetheless, cost-saving aside, there are undeniably other benefits of this approach, such as:
- One-click deployment
- Seamless scaling
- Reduced sysadmin overhead
- Reduced code complexity – much of the functionality of the current Snowplow code, such as concurrency, retries, and so on, is delegated to the Lambda execution engine
This code is currently extremely experimental, implements a very basic set of Snowplow functionality, and is almost definitely not for production use. In particular, the following Snowplow features are not yet supported:
- Custom Iglu schemas (only Iglu central events are supported)
- Custom enrichments
- GeoIP enrichment
- Webhooks
- Graceful handling of bad collector requests
- Graceful handling of Kinesis failures
- Snowplow monitoring
- 3rd party cookies (network_userid)
- Redirects
- Any sinks other than Kinesis
Nonetheless, I’m pretty happy with it, and think this approach has huge potential, particularly when paired with other serverless AWS features. For example, by forwarding the enriched events stream to Kinesis Firehose, events could be stored in S3 and queried using Amazon Athena for a fraction of a cent per query.
There are, no doubt, other angles on this I haven’t considered, and I’d love to get feedback and thoughts from others in the Snowplow community. This is EXTREMELY experimental at the moment (see the README for details) but I’m happy to take it forward if there is an appetite for it.
Happy Easter!
Adam