I’m not really familiar with cron jobs but I’d to schedule the following
run the snowplow-emr-etl-runner
run the snowflake loader. I’m running a command line data flow tasks today
run a series of SQL scripts against Snowflake.
I’m wondering if I should create a shell script that gets started by the cron job. The shell script would ensure the sequence of events. Any thoughts? Any scripts that you may have is greatly appreciated.
@sonnypolaris, that is exactly how we manage the pipelines for our clients at the moment. To facilitate the scheduling and organizing the steps to be executed, we also use in-house built open-source Factotum (wrapping up EmrEtlRunner), Dataflow Runner (wrapping up Snowflake transformer and loader and/or EmrEtlRunner), and SQL Runner (to run data model on data on Redshift, Snowflake, BigQuery).
How do you kick off the Dataflow-Runner after EmrErlRunner completion? Is the EmrEtlRunner synchronous with the EMR cluster?
My current cron job looks something like /home/ec2-user/snowplow/snowplow-emr-etl-runner run -c /home/ec2-user/snowplow/config.yml -r /home/ec2-user/snowplow/iglu_resolver.json
If I just append && dataflow-runner run-transient --emr-config cluster.json --emr-playbook playbook.json is that going to work or is it going to launch the second EMR cluster too soon; before the first has finished?
@davehowell, the EMR cluster is terminated by EmrEtlRunner (if using the cluster in transient mode). The Dataflow Runner should spin a new cluster. I do not expect any conflicts with the clusters even if the Dataflow Runner request to spin EMR cluster while the cluster used by EmrEtlRunner has not terminated fully yet.
Thanks for your reply, that gives me more confidence. I wasn’t thinking about conflicts, but making sure it waits until all the enriched files are finished before the snowflake transformer & snowflake loader begin that next stage.