I have just ran the EmrEtlRunner locally for the first time just as a test and the full run is taking longer than 1 hour to complete all the job steps on EMR. Is this normal?
What kind of average run time can I except from the runner?
All the best,
The job could run from something like 20 minutes to a few hours. It all depends on
- number of events (log files) to process
- the size of the EMR cluster
What do you mean by “ran the EmrEtlRunner locally”? It’s expected to run on EC2 as it has to access various AWS services.
On rear occasion, the EMR job might get stuck at some task in which case the cluster would have to be terminated manually. That’s really an extreme case though.
Sure, I will eventually run it on EC2 but I ran it locally with AWS credentials which gave it access to the necessary AWS services.
It ran with no errors but took 1hr 10 minutes to complete with only a couple of raw events in the “in” bucket.
I ran the EMR using an m1.small instance.
I vaguely remember reading a note about the EmrEtlRunner mentioning that some of the files need to be in separate S3 buckets (not just separate ‘folders’) otherwise the runner will have problems. Is this true and if so which raw/enriched/shredded files need to be in a separate bucket?
around 1h should be fine, setting up the machine (bootstraping) takes already 5min sometimes and depending on how fast your machines in the cluster are, it can take some time.
I also think an hour or so should be fine. You can speed it up but using faster instances but this will of course increase cost. It all depends on how fast you want to process the data. If you want to run it hourly I would recommend trying to get EMR under half an hour to allow for the data load.
Another thing to consider is that AWS charges by the hour. So a slower instance is cheaper per hour but if it needs e.g. one hour and ten minutes to complete you’re still paying for two hours.
It requires a bit of trial and error in the beginning and it’s very difficult to accurately predict a run time depending on the number of events. While a larger number of events will take a longer time the custom events can have quite some influence on the time the Shredding step takes.
In regards to your question about the separate buckets, is this what you mean:
do not put your raw:processing inside your raw:in bucket, or your enriched:good inside your raw:processing, or you will create circular references which EmrEtlRunner cannot resolve when moving files.