How to use persistent job flow option with EmrEtlRunner?

I am trying to create a more real-time data loading from S3 to Redshift. I read here that snowplow EmrEtlRunner might be able to do this using persistent job flow option.

May I know how to use this option? Do I need to start an EMR cluster myself for EmrEtlRunner to look for the persistent cluster? Is the EmrEtlRunner binaries constantly runs when using this option? Do all the steps in the Batch Pipeline Steps are constantly being run every x minutes?

Kindly let me know if there is a guide to use this since I can’t seem to find more details about this.

@aditya, you can read about persistent cluster support in the release blog.

This is what I am reading as I have mentioned above but I am still not sure how it works. Does it create and start a persistent cluster when I use --use-persistent-jobflow option? What happens in the event of steps failure? How do I do a recovery? Does the clusters shut off when there’s failure? and can I re-run the emr etl runner with ---use-persistent-jobflow again without creating new persistent cluster?

@aditya, EmrEtlRunner will check if the EMR cluster with the name specified in your config.yml with job_name exists. If it is the tasks will be submitted for execution on that cluster. If it is not available in WAITING state the EMR cluster will be spun and tasks submitted for execution.

Whether EMR cluster terminates or not depends on the failure - sometimes it does terminate as well but is likely not. If the fix is expected to take a long time you might want to terminate the cluster manually to save on running AWS cost.

To resume, follow the guide you already mentioned,

1 Like

@ihor What happened when the EMR cluster is in the waiting state when the job is done?

Let’s say I have 10 events in my S3 bucket ready to be ETL-ed and loaded into Redshift. Once the job done and the EMR cluster in the WAITING state, does the cluster check for another events in S3 bucket?

Or, I would have to run the EmrEtlRunner every x minutes to ensure that EMR cluster re-run the batch pipeline steps?

Also I have a question about the job-flow-duration. This option specifies how long a cluster should be running. Let’s say I run an emr-etl-runner with persistent-job-flow and 12 hours job-flow-duration at 7AM. When I re-run it at 6PM with the same option ie. persistent-job-flow and 12 hours job-flow-duration, does it extend the cluster for another 12 hours or it respects the original 12 hours duration and terminates at 7PM?

@aditya, you would need to schedule your EmrEtlRunner to run regularly. While EMR cluster persists EmrEtlRunner process terminates. When there is a need in the persistent EMR cluster that normally implies datastore (Redshift, for example) drip-feeding. That is you would schedule EmrEtlRunner to kick off every 5 minutes or so. If there is an ongoing data processing on the cluster then no new tasks will be submitted for execution.

The “job-flow-duration” describes how long EMR cluster was up and running (regardless of its state). That is when EmrEtlRunner starts it checks when EMR cluster was bootstrapped. If it exceeds the duration submitted with --persistent-jobflow-duration EmrEtlRunner will terminate it and spins up the new EMR cluster instead. If EmrEtlRunner is not scheduled to run frequently then there is no much sense in the persistent cluster.

Thanks a lot for answering my inquiries @ihor . I will need to try this out first before coming up with more questions. Sorry that I can’t just try this out right away and have to ask questions first since I don’t have development environment for my Snowplow setup (too costly for the AWS resources).