Understanding my options for Transforming/Loading

playabledanniehansen · October 17, 2022, 1:12pm

Hi,

We’re in the process of setting up Snowplow on our own infrastructure (AWS) using the documentation available. During this process, I’m stuck on understanding the transformation/loader step fully (or at the very least, making a choice). My understanding is that the transformation will take the output of the enrichment process and prepare it for the loader. The loader will then take these transformations and deliver them to wherever they’re told (e.g., Redshift).

Looking at the documentation, I can only see 2 ways of dealing with the transformation step. And those are:

EMR Spark S3 Copy / Transformation step.
or read directly from the enrichment kinesis stream and have the transformer read this.

My understanding is that it’s good to keep the enriched events in an S3 data lake and then from there load them into your various destinations. This ensures that you can always do a full load to existing or new destinations.

We’re leaning towards using EMR as it keeps a snapshot of the enriched data in an S3 archive, but we’re facing challenges with deploying the EMR cluster as it seems rather expensive (depending on the schedule) and complicates our stack quite a lot. The collector and enrichment processes are all running using EKS and Kinesis already. Adding EMR to the stack just feels like adding more complexity that we would rather be without. Attached is a little diagram of our current solution:

Because of this, we tried looking at the transformer-kinesis solution. However, this leaves us with no S3 copy and disallows us from running part or full snapshots of past events in the future.

EMR = batching. Batching can be done more frequently, but at a higher cost. Kinesis stream read allows for near real-time data in Redshift at the expense of not having S3 archive storage. Neither feels like a great option.

Am I completely misunderstanding this last step? Are there deployment options we haven’t considered?

The best case scenario is that we get archived storage of events from the enrichment process and close-to-real-time processing of these events using the transformer/loader. A few minutes’ delay is acceptable. But we want to avoid 30-minute plus delays due to scheduled batches running.

istreeter · October 18, 2022, 6:07am

Hi @playabledanniehansen,

It is possible to use transformer-kinesis, but simultaneously create a S3 archvie of the TSV-format events. This will be easier to explain if I summarise some of the processes that can run downstream of Enrich:

Process 1: Streaming transformer - reads from Kinesis and writes micro-batches of transformed data to S3
Process 2: Loading events directly to S3. (We use our custom S3 loader or you are using Firehose)
Process 3: Spark/EMR transformer - Reads the output of process 2 in batches, and re-writes it to S3 in transformed format.

Normally we recommend to either run just process 1, or alternatively to run process 2 and 3. Process 1 (streaming transformer) is the cheapest and most direct way to get events transformed and ready for loading. Whereas process 2 + 3 (s3 loader + emr transformer) is more mature Snowplow tech, and is proven to work on very high volume pipelines.

However, you could choose to run Process 1 and Process 2 in parallel. Kinesis lets you have multiple consumers of the same stream, which each see every single event. So your streaming transformer can transform all events and prepare them for the warehouse. While simultaneously you run the S3 loader (or Firehose) to read the same events from Kinesis and write them to S3 in TSV format.

By running Process 1 + Process 2, you get the benefits of cheap, fast warehouse loading via the streaming transformer, but you also get your TSV archive which means in future you have the option to do a full load to new destinations, if you ever need to.

playabledanniehansen · October 18, 2022, 6:35am

Hi @istreeter,

Thank you so much for the detailed reply. I didn’t know you could have 2 consumers on the same kinesis stream receiving the same events. That is very interesting & fit our use-case perfectly. Our current needs are low-volume. But then we can always upgrade the tech if we increases that to a point where our pipeline cannot follow along.

I will try this and reply back with my findings. Much appreciated!

playabledanniehansen · October 18, 2022, 10:31pm

@istreeter

It looks to be working. I was able to get the transformer-kinesis working in EKS. The jobs are seen by both firehose and transformer-kinesis. which is perfect.

I appreciate the help.

EddieM · October 19, 2022, 2:20pm

Thanks for accepting the answer - really helpful to other users.
Cheers,
Eddie

Topic		Replies	Views
Snowplow Kinesis to EmrEtl For engineers	4	1773	July 31, 2019
Need help getting events from kinesis to s3 to redshift Storage targets	4	2518	April 6, 2016
Is it possible to load data to Redshift after StreamEnricher? Storage targets	10	2823	September 12, 2018
EMR ETL stream_enrich mode Enrichment	14	3087	September 21, 2019
Should I run rdb_load only? For engineers	7	1235	February 11, 2020

Understanding my options for Transforming/Loading

Related topics