Following the announcement of our intentions back in January, Snowplow’s EmrEtlRunner is now finally and officially a deprecated application.
While we will continue to evaluate and address any reported security vulnerabilities for a further 6 months, we will no longer add new features or fix new bugs. If you encounter issues with EmrEtlRunner you should move to the new RDB Loader estate as detailed below.
What was EmrEtlRunner?
EmrEtlRunner has been around since the very early days of Snowplow. It was used in older versions of the pipeline to coordinate a AWS EMR batch job that copied events around in S3, enriched the events, shredded events, and loaded events into Redshift.
The enrichment functionality was deprecated long ago in favour of the streaming versions of Enrich. The shredder/loader functionality became redundant when we released RDB loader version R35.
How should I run the RDB shredder/loader?
The new RDB shredder runs in EMR using a very simple 2-stage EMR job, that copies data in S3 data and shreds it. We recommend using Dataflow Runner to coordinate the EMR job, and we have an example playbook on our docs site. The new RDB loader runs completely outside of EMR as a standalone application.
We now have complete confidence that the new architecture of shredder/loader is production-ready, and better than anything we had before. Shredding and loading now run in parallel, and shredding can continue even when the warehouse is unavailable. Furthermore, we added loads of helpful new features to the standalone loader, such as folder monitoring and runtime metrics
What does this mean if I still run EmrEtlRunner?
All previous versions of EmrEtlRunner will still be available on the Github releases page, so your pipeline will continue to work.
We recommend the upgrade guides on the Snowplow docs site to help you migrate to the newer architecture.