After some research on this forum and on github I know that there is some plans to create a loader able to load stream enriched events into Redshift, but curently I would like to achieve kind of the same thing with the plain of jars release (r89).
This is our current pipeline (we currently don’t use the result of stream enriched)
Basically as we are enriching in near real time I just would like to shred and load the stream enriched data into Redshift.
I managed to do it by using the GZIP sink on the Stream enrich good stream, nevertheless it required some tweaks because by default the EmrEtlRunner expect enriched events files to follow a pattern containing part- which is not the case if you use one of the S3 Kinesis sink.
Is there any plan in supporting this flow, or will we see a stream shredder which could then have its own sink to load the data into Redshift afterwards?
Hey @AcidFlow - that’s great that you got this working. We have managed to set up something similar internally - but we used Dataflow Runner rather than EmrEtlRunner, which sidestepped the EmrEtlRunner incompatibilities you found:
Other changes we made:
Reading/writing direct to/from S3 (i.e. skipping HDFS and S3DistCp)
Running the job on a persistent cluster (saving on the cluster bootstrap times)
We’re continuing to tweak and improve this setup, with the goal of getting Redshift load frequency down to every 10 minutes or so. We will keep you posted!
I found it invaluable to use Kinesis firehose and lambda to read data straight from kinesis when creating the same type of real-time system.
I remember seeing a snowplow lambda event shredder somewhere but I can’t find it now. We created our own in-house solutions to forward events to third party systems and AWS firehose which puts things into s3 and then redshift (but can do elastic-search as well)
The only downside is that firehose will only write to redshift at minimum every 1mb or 60 seconds (whatever comes first), however this problem disappears with volume.
I have been trying to also setup this kind of config (Enriched stream to Redshift) but the StorageLoader is obviously not reading the files from the Kinesis S3 sink (from the enriched stream). As @alex suggests here is to use the dataflow-runner but I get stuck in how to configure this one. Could you get me some more clues on configuring this?
Your Dataflow Runner example made me think how EmrEtlRunner and now Dataflow Runner relates to workflow/orchestration systems like Luigi or Airflow.
We have been orchestrating our Snowplow pipelines by invoking EmrEtlRunner for each atomic step from Airflow, achieved by --skipping every other step. This gives us visibility into each step in the pipeline and also allows us to just use aws s3 mv and aws s3 rm --recursive where Sluice fails us.
Dataflow Runner steps look awfully a lot like our Airflow tasks in our Snowplow DAGs. Airflow also has rudimentary EMR support to create ephemeral clusters and add steps to it.
Is it fair to say that Dataflow Runner is a lightweight orchestrator (“runner”) for (not just) Snowplow pipelines? Does Dataflow Runner try to slow anything else that I might’ve missed?
Hi @rgabo - it sounds like we are thinking about all this in the same way.
Note that Sluice is no more - it was removed in Snowplow R91.
Yes, that’s correct. Although we still use EmrEtlRunner internally for all core Snowplow pipelines, we use Dataflow Runner for our R&D and non-standard/non-Snowplow jobs on EMR.
Dataflow Runner is built around the Unix philosophy - all it does is run jobflows, currently on EMR only. You can schedule it any way you like. And it’s fully declarative - it’s just Avro, so you can generate, lint or visualise a dataflow anyway you like. (We are also planning a native integration between Factotum and Dataflow Runner in the future, so you get that “view-through” that you described between Airflow and EmrEtlRunner.)
Note that a future release of EmrEtlRunner will generate Dataflow Runner playbooks, and a later release will then remove the EMR orchestration functionality from EmrEtlRunner altogether:
Hi @alex - we do think alike, that’s why we are happy, long-time Snowplow users
Sluice is out, but our production pipeline is r87. We are putting together a new pipeline that has both streaming and batch components. Scala Stream Collector and Stream Enrich with S3 Sink for the enriched events and Elasticsearch Sink for real-time feedback on events and bad rows. We would still like to shred and load enriched events into Redshift for data modeling and analysis and we will most likely limit data in Elasticsearch to a couple of weeks to keep storage requirements low.
How exactly we orchestrate the batch steps of such a pipeline is why I am looking at Dataflow Runner and the new components. I really like that Sluice is no more and both staging and loading steps were moved to EMR.
I would imagine that it makes sense to stick with EmrEtlRunner for the time being, as it already delivers benefits such as staging logs and loading into Redshift from EMR. Dataflow Runner will just become part of that story, correct?