Don’t apologise! We’re happy to see you’re engaged and asking for help!
The EMR cluster should shut down if it’s created as a transient cluster. Otherwise, it’ll be persistent and will just wait for the next job to run.
At volume, it can more efficient to run the load jobs on a persistent cluster, and load in a micro-batch style (ie kicking off a new job on the same cluster as soon as the last job finishes) - since there’s a cost to the time that the cluster takes to spin up and down again.
If you don’t need that, I believe you’ll just need to make a change in the config which creates the EMR cluster (if memory serves it’s part of your dataflow-runner configuration).
Yep - you should be able to use the run-transient command in Dataflow runner for this.
As Colm has mentioned there’s a small cost associated with spinning up (bootstrapping) an EMR cluster, so if you’d like to micro batch it is quite often cheaper and makes it easier to load more frequent batches into Snowflake. I’m not sure about your data volume but it looks like the loader is only taking 2 minutes while the transformer is taking 30 minutes - so you could well see some performance improvements by changing the node types.
Depending on your volume of data you could likely use a smaller master node and upgrade both nodes from m2 to m4 or m5 to see some performance and cost improvements. A m5.xlarge will give you twice as many vCPUs (4) then the equivalent m2.xlarge for approximately the same cost as well as faster networking which will speed up copy operations between S3 and the EMR cluster.
Thanks for the info @Colm and @mike. We’ll assess having the cluster as persistent instead a bit later on. I’ll also bump test performance when upgrading nodes to m4/m5 (i think i tried this before and it failed for some reason, but I’ll try again)
For now we’d like to have the enrich and storage process scheduled to run a few times a day which is why we want to run it in transient mode.
The following command is being used when executing dataflow-runner however the EMR cluster is not terminating:
Hey guys, just wondering if you possible had any other ideas on things I could check to try and get the EMR cluster to terminate after successful run or dataflow-runner? I couldn’t find anything in any of the config json files relating to the process.
Is data flow runner generating any output / logs? (if not run with --log-level debug You should see some logs indicating that it is attempting to terminate the cluster (assuming no intermediate errors).
Well this is quite bizarre. I just re-ran the emr-etl-runner and dataflow-runner process about 5 more times, and every single time the dataflow-runner EMR cluster correctly terminated itself!
When i was testing this before and having issues, it was transforming/loading a much larger amount of enriched events (30+ GB). This is the only difference compared to when I did the recent runs and it worked (same config files, same execution command)
There are log files that generated in the S3 bucket for all previous dataflow-runner jobs including those that did not auto terminate the EMR cluster. I looked through the logs relating to the jobs that did not auto terminate the EMR cluster and couldn’t see anything that stood out. Is there a certain part of a certain file I should look at? I’d love to find out why the issue was occurring previously in case it helps others.