Optimum Estimation of resources for EMR

jimy2004king · November 28, 2016, 10:25am

Here we are trying to estimate the resources would be required to handle the load we plan to put in next 3 months.
Before jumping to questions, let me put down some context.
We did a load testing of 20M events within the period of 2 hours. The load test was done with the help of Avalanche tool.
We also processed the events through EMR. EMR took 1 day and 9 hours to complete processing of 20M events. The instance type used was m1.medium. Below was the average CPU utilization of all machines (1 master and 2 core).
Master - between 15 to 20% all the period.
Core I - more than 95% all the period.
Core II - more than 95% for first 3h, below 5% for next 6h, more than 90% for rest of period.

As learned from the above insights, it is cleared that CPU is the bottleneck for core machines. So the questions are ::

Why was the Core II CPU was idle for 6h ?
What should be the approach to reduce the time take by EMR job ? Increasing number of instances or increasing the compute capacities of the machines ?
Above 20M events were in 2 to 3 log files, but in real scenario there will be 24 log files containing this much events. Will that drastically change the CPU Utilization pattern from above ?

We would like EMR to be completed in max 4 hours. So what should be the configuration based on some some real life experience if you have ?

Thanks

sachinsingh10 · November 28, 2016, 12:26pm

@jimy2004king Great, we are trying to evaluate the same premise, I would love to hear views from snowplowers.

mike · November 28, 2016, 10:31pm

The first thing I’d recommend is to switch out the m1.medium (core/task) nodes for beefier machines. In March next year this instance type will be 5 years old so the performance in terms of compute/disk/networking will be significantly poorer than current generation machines (you’ll also pay more for older machines).

For EMR generally the c3 or c4 class instances are a pretty good bet using spot pricing. A great site I use to compare the specs and pricing on these machines directly is EC2 instances. It may take a little bit of tweaking with these instance types but you should be able to comfortably process that event volume in less than 4 hours.

jimy2004king · November 29, 2016, 3:33pm

@sachinsingh10 Sure would keep you posted.

@mike Thanks for your info, will switch out of m1.medium. One thing I noted while comparing the m1 vs c4 is m1 comes with it’s 410 GB of disk space. While c3 comes with 32 GB and c4 is EBS only. So I am quite not sure how much disk space I should go with ? Also how to configure the EBS space in etlemr runner config ?

jimy2004king · December 29, 2016, 7:33am

@sachinsingh10 Sorry couldn’t update you lately was busy settling things into production. So here it is we are using 2 m3.xlarge instance for core and m1.medium (Yes I know it’s old but right now its doing it work so don’t want to disturb it) for master. It’s doing a quite good job of processing 7 million events in around 2 hours. We couldn’t go with c3 or c4 as it comes with very limited disk space. And right now it’s not possible to attach a larger volume of EBS. So till that feature is ready we will stick with m3 option and after that experiment with c3 or c4.

sachinsingh10 · January 3, 2017, 7:49am

Thanks @jimy2004king. Are you enriching the records?

Regards
SS

jimy2004king · January 5, 2017, 11:46am

@sachinsingh10 Yes, Currently we use event_fingerprint_enrichment, ip_lookups, ua_parser and user-agent-utils enrichments.

Topic		Replies	Views
Increasing EMR Speed For engineers	3	931	December 12, 2018
Should I use different EC2 instance types for EMR besides the default? AWS batch pipeline (Legacy)	3	3980	December 22, 2016
Recommended ec2 instances for EMR ETL Runner in 2018 AWS batch pipeline (Legacy)	2	1978	September 4, 2018
EMR ETL perfomance Enrichment	11	2069	January 25, 2017
Learnings from using the new Spark EMR Jobs AWS batch pipeline (Legacy)	8	13552	August 23, 2017

Optimum Estimation of resources for EMR

Related topics