I was wondering what is the 2018 recommended way of setting up EMR Etl Runner for enriching events from Clojure collector. There is this thread from 2016 Should I use different EC2 instance types for EMR besides the default? but I believe things some things might have changed since.
So what is the current recommended instance type for
master node
core instances
task instances
?
what is the general workflow for figuring out the number of core+task instances? i.e how to recognize that my emr cluster is over/underpowered?
In my case, the gzipped hourlly tomcat logs on s3 are ~2mbs in size (maybe about 15k events/hr?), on average. I think this is quite a small amount.
It depends how often you want to run the EMR job? More data to crunch through takes longer, Let’s say you want to run it once a day (360k events, that’s a very small amount)
master node: The master node is just for coordinating the cluster, it can be a pretty small instance, m4.medium or m4.large should be sufficient
core instances: You always want to use faster instances instead of more instances, I mostly go for 2-3 instances and then m4.large should be enough for your load.
I wouldn’t use task instances at all, it might make sense if you have TB of data to crunch through but I never used them
underpowered/overpowered: If your cluster runs take longer, you should consider using bigger instances, I mostly try to stay around a 1h window. If your cluster only takes 20min then you should consider taking smaller instances, but it depends how fast you need that data. Beware, it’s not really possible to run the cluster < 10min because it already takes 5-10min to setup the EMR cluster itself and there’s not really a way to speed that up afaik.