S3 Loader infrastructure

travisdevitt · November 18, 2020, 10:50pm

I’ve been looking to spin up a new real time Snowplow streaming analytics pipeline, but the documentation is a bit confusing. As I understand it, a typical Snowplow streaming setup would have multiple S3 Loader applications running on separate server instances (or clusters) for all of the following:

Loading from the collector-fed raw “good” stream to S3 [Thrift to LZO]
Loading from the collector-fed raw “bad” stream to S3 [Thrift to LZO]
Loading from the enriched “good” stream to S3 [TSV to GZIP]
Loading from the enriched “bad” stream to S3 [TSV to GZIP]

Complicating matters is the fact that the S3 Loader config has an input stream and an output stream as well.

Here are my questions:

Is this correct as far as the number of separate S3 Loader applications that should be run (four)?
What stream should be used for the S3 Loader out stream? What would even consume such a stream? It seems to me it would be fed events that failed to make it to enrich or RDB load but also failed to load to an S3 bucket…aren’t these unrecoverable?
If I’m understanding correctly, this seems like a LOT of required resources just for the S3 logging alone. Wouldn’t it be easier AND cheaper to use Kinesis Firehose? I know the Snowplow team recommends the S3 Loader application but I’ve seen mentions that Firehose can be used without too much difficulty (does require a Lambda function to convert Enriched streams to proper loading format)

Thanks!

josh · November 19, 2020, 8:12am

Hey @travisdevitt,

So we generally run only three S3 loaders for a pipeline which follows a flow a little like this:

S3 Loader Raw: Pulls from “collector-fed” stream (Thrift + LZO) and outputs to “bad” stream for any failures
S3 Loader Enriched: Pulls from “enriched-fed” stream (TSV + GZIP) and outputs to “bad” stream for any failures
S3 Loader Bad: Pulls from central “bad” stream (JSON + GZIP) and outputs to “bad-2” stream for any failures

We need a /dev/null stream to prevent any recursion in the pipeline - this stream is monitored but generally we do not do anything with the data that lands here (it should never have anything in practice).

In this flow you then have options for saving money - if you never leverage the “raw” data this loader can be turned off (really the only data of interest is the enriched and bad data). So you end up with only two S3 loaders for each pipeline.

The “good” path applications all share the same “bad” stream - the format of bad data is consistent for all of our micro-services and so can be joined with data from multiple sources.

As far as using firehose it should be possible with a custom lambda attached to it to insert newlines for the enriched data (we have little experience using it as we leverage our own loader in production for all of our managed clients).

Hope this helps!

travisdevitt · November 19, 2020, 7:33pm

Thanks @josh !

Is your “bad-2” stream the same as the “/dev/null” stream?

And am I understanding it correctly that all of the following output to the same “bad-1” stream upon failure?
-Collector
-Stream Enrich
-S3 Loader Raw
-S3 Loader Enriched

And the applications above use different formats when outputting to the bad-1 stream? (JSON, Thrift, or TSV depending on the application)

josh · November 20, 2020, 4:19am

Is your “bad-2” stream the same as the “/dev/null” stream?

Yep exactly right.

And am I understanding it correctly that all of the following output to the same “bad-1” stream upon failure?

They use the same format for all bad data (hence why they can be colocated) but have different formats for the good data (happy path).

Topic		Replies	Views
S3 Loader Not Loading Data from Stream Storage targets	2	2038	October 15, 2018
Kinesis Streams vs S3 Buckets AWS real-time pipeline	4	5459	February 17, 2020
Scala Stream Collector + Strem Enrich + S3 Loader Setup AWS real-time pipeline	6	3698	December 5, 2017
Replay data from S3 AWS real-time pipeline	3	2635	February 14, 2018
Snowflake loader with snowplow s3 loader - gzip? AWS real-time pipeline	2	1069	May 11, 2020

S3 Loader infrastructure

Related topics