I’ve been looking to spin up a new real time Snowplow streaming analytics pipeline, but the documentation is a bit confusing. As I understand it, a typical Snowplow streaming setup would have multiple S3 Loader applications running on separate server instances (or clusters) for all of the following:
Loading from the collector-fed raw “good” stream to S3 [Thrift to LZO]
Loading from the collector-fed raw “bad” stream to S3 [Thrift to LZO]
Loading from the enriched “good” stream to S3 [TSV to GZIP]
Loading from the enriched “bad” stream to S3 [TSV to GZIP]
Complicating matters is the fact that the S3 Loader config has an input stream and an output stream as well.
Here are my questions:
Is this correct as far as the number of separate S3 Loader applications that should be run (four)?
What stream should be used for the S3 Loader out stream? What would even consume such a stream? It seems to me it would be fed events that failed to make it to enrich or RDB load but also failed to load to an S3 bucket…aren’t these unrecoverable?
If I’m understanding correctly, this seems like a LOT of required resources just for the S3 logging alone. Wouldn’t it be easier AND cheaper to use Kinesis Firehose? I know the Snowplow team recommends the S3 Loader application but I’ve seen mentions that Firehose can be used without too much difficulty (does require a Lambda function to convert Enriched streams to proper loading format)
So we generally run only three S3 loaders for a pipeline which follows a flow a little like this:
S3 Loader Raw: Pulls from “collector-fed” stream (Thrift + LZO) and outputs to “bad” stream for any failures
S3 Loader Enriched: Pulls from “enriched-fed” stream (TSV + GZIP) and outputs to “bad” stream for any failures
S3 Loader Bad: Pulls from central “bad” stream (JSON + GZIP) and outputs to “bad-2” stream for any failures
We need a /dev/null stream to prevent any recursion in the pipeline - this stream is monitored but generally we do not do anything with the data that lands here (it should never have anything in practice).
In this flow you then have options for saving money - if you never leverage the “raw” data this loader can be turned off (really the only data of interest is the enriched and bad data). So you end up with only two S3 loaders for each pipeline.
The “good” path applications all share the same “bad” stream - the format of bad data is consistent for all of our micro-services and so can be joined with data from multiple sources.
As far as using firehose it should be possible with a custom lambda attached to it to insert newlines for the enriched data (we have little experience using it as we leverage our own loader in production for all of our managed clients).