Shredding & loading enriched events in near-real-time

alex · August 24, 2017, 2:26pm

Hi @rgabo - it sounds like we are thinking about all this in the same way.

Note that Sluice is no more - it was removed in Snowplow R91.

Yes, that’s correct. Although we still use EmrEtlRunner internally for all core Snowplow pipelines, we use Dataflow Runner for our R&D and non-standard/non-Snowplow jobs on EMR.

Dataflow Runner is built around the Unix philosophy - all it does is run jobflows, currently on EMR only. You can schedule it any way you like. And it’s fully declarative - it’s just Avro, so you can generate, lint or visualise a dataflow anyway you like. (We are also planning a native integration between Factotum and Dataflow Runner in the future, so you get that “view-through” that you described between Airflow and EmrEtlRunner.)

Note that a future release of EmrEtlRunner will generate Dataflow Runner playbooks, and a later release will then remove the EMR orchestration functionality from EmrEtlRunner altogether:

Topic		Replies	Views
Scala Kinesis Enrich AWS real-time pipeline	9	2883	April 9, 2018
Does the Kinesis LZO S3 Sink support reading from an "enriched" stream? AWS real-time pipeline	12	4375	May 4, 2018
Is it possible to load data to Redshift after StreamEnricher? Storage targets	10	2820	September 12, 2018
Should I run rdb_load only? For engineers	7	1232	February 11, 2020
Snowplow Kinesis to EmrEtl For engineers	4	1769	July 31, 2019

Shredding & loading enriched events in near-real-time

Related topics