We are pleased to announce the release of [Dataflow Runner] dataflow-runner-post, a new open-source system for the creation and running of AWS EMR jobflow clusters and steps.
This release signals the first step in our journey to deconstruct EmrEtlRunner into two separate applications, a Dataflow Runner and
snowplowctl, per our RFC on Discourse.
This is really cool stuff Josh.
Does this mean that with the correct playbooks/API calls you could in theory have on persistent EMR cluster responsible for multiple runs? It might look something like
- Bootstrap EMR cluster
- Run complete enrichment process
- Go into idle mode (optionally remove task/core nodes)
- Run step 2
@mike that’s correct! At the moment there would be no way to shutdown nodes between runs - although EMR auto-scaling rules could provide the answer there…
The only caveat would be that your playbook would need to handle any cleanup required to get the cluster into a clean state ready for another enrichment process.