Cron Job for emr-etl and snowflake data

sonnypolaris · August 12, 2019, 3:02pm

I’m not really familiar with cron jobs but I’d to schedule the following

run the snowplow-emr-etl-runner
run the snowflake loader. I’m running a command line data flow tasks today
run a series of SQL scripts against Snowflake.

I’m wondering if I should create a shell script that gets started by the cron job. The shell script would ensure the sequence of events. Any thoughts? Any scripts that you may have is greatly appreciated.

ihor · August 12, 2019, 7:27pm

@sonnypolaris, that is exactly how we manage the pipelines for our clients at the moment. To facilitate the scheduling and organizing the steps to be executed, we also use in-house built open-source Factotum (wrapping up EmrEtlRunner), Dataflow Runner (wrapping up Snowflake transformer and loader and/or EmrEtlRunner), and SQL Runner (to run data model on data on Redshift, Snowflake, BigQuery).

davehowell · March 18, 2020, 12:19am

@ihor

How do you kick off the Dataflow-Runner after EmrErlRunner completion? Is the EmrEtlRunner synchronous with the EMR cluster?

My current cron job looks something like /home/ec2-user/snowplow/snowplow-emr-etl-runner run -c /home/ec2-user/snowplow/config.yml -r /home/ec2-user/snowplow/iglu_resolver.json

If I just append && dataflow-runner run-transient --emr-config cluster.json --emr-playbook playbook.json is that going to work or is it going to launch the second EMR cluster too soon; before the first has finished?

ihor · March 18, 2020, 5:22pm

@davehowell, the EMR cluster is terminated by EmrEtlRunner (if using the cluster in transient mode). The Dataflow Runner should spin a new cluster. I do not expect any conflicts with the clusters even if the Dataflow Runner request to spin EMR cluster while the cluster used by EmrEtlRunner has not terminated fully yet.

davehowell · March 19, 2020, 3:43am

Thanks for your reply, that gives me more confidence. I wasn’t thinking about conflicts, but making sure it waits until all the enriched files are finished before the snowflake transformer & snowflake loader begin that next stage.

Topic		Replies	Views
Scheduling EmrEtlRunner and StorageLoader Enrichment	2	1264	April 12, 2016
Dataflow-runner - EMR cluster not terminated after completion Enrichment	7	2066	June 1, 2020
Snowflake Loader/Dataflow Runner using persistent cluster instead of new ones For engineers	5	718	September 12, 2020
How to use persistent job flow option with EmrEtlRunner?	6	1126	June 10, 2020
Splitting EmrEtlRunner into snowplowctl and Dataflow Runner RFCs	0	3569	June 15, 2016

Cron Job for emr-etl and snowflake data

Related topics