Understanding my options for Transforming/Loading

Hi @playabledanniehansen,

It is possible to use transformer-kinesis, but simultaneously create a S3 archvie of the TSV-format events. This will be easier to explain if I summarise some of the processes that can run downstream of Enrich:

  • Process 1: Streaming transformer - reads from Kinesis and writes micro-batches of transformed data to S3
  • Process 2: Loading events directly to S3. (We use our custom S3 loader or you are using Firehose)
  • Process 3: Spark/EMR transformer - Reads the output of process 2 in batches, and re-writes it to S3 in transformed format.

Normally we recommend to either run just process 1, or alternatively to run process 2 and 3. Process 1 (streaming transformer) is the cheapest and most direct way to get events transformed and ready for loading. Whereas process 2 + 3 (s3 loader + emr transformer) is more mature Snowplow tech, and is proven to work on very high volume pipelines.

However, you could choose to run Process 1 and Process 2 in parallel. Kinesis lets you have multiple consumers of the same stream, which each see every single event. So your streaming transformer can transform all events and prepare them for the warehouse. While simultaneously you run the S3 loader (or Firehose) to read the same events from Kinesis and write them to S3 in TSV format.

By running Process 1 + Process 2, you get the benefits of cheap, fast warehouse loading via the streaming transformer, but you also get your TSV archive which means in future you have the option to do a full load to new destinations, if you ever need to.

2 Likes