Increasing EMR Speed

joaocorreia · December 12, 2018, 8:37am

Hi Snowplowers,

I don’t have a lot of hits on my batch pipeline. Im using m1.medium for my EMR job to run daily. It still takes a good 20 minutes.

  master_instance_type: m1.medium
  core_instance_count: 2
  core_instance_type: m1.medium

Do you suggest increasing to m1.large to increasing processing speed? I can’t use t2 can I?

Thanks
Joao Correia

aldrinleal · December 12, 2018, 9:31am

Joao,

As a rule of thumb, trying to use the latest suitable mN (largest N possible) would get you the best performance/cost ratio. Why not a m4.large for instance? Given its short run, spot instances are a good option as well
Can you try launching your cluster with Ganglia? It would show you the usual culprits (bottlenecks)

tclass · December 12, 2018, 9:35am

Unfortunately the bootstrapping of the cluster already takes 5-10min most of the time. You can’t really get the whole pipeline to run below 10-15min in my experience, even if you choose bigger machines. It seems to me like a reasonable solution that if you want real-time data you have to use the real-time pipeline and for the rest you have to wait at least ~20min

mjensen · December 12, 2018, 2:11pm

@tclass yup. why we have 2 pipelines. realtime and batch set up as well. even if you want to process one event in batch, it’ll take 15-20 mins to complete all the steps in EMR. we run batch hourly and then realtime we load to redshift every 60 seconds.

Topic		Replies	Views
Should I use different EC2 instance types for EMR besides the default? AWS batch pipeline (Legacy)	3	3991	December 22, 2016
Increasing execution time on batch mode instance For engineers	9	1312	November 5, 2016
Optimum Estimation of resources for EMR Enrichment	6	1506	January 5, 2017
Changing core instance type to m4 For engineers	3	929	July 10, 2018
How does the Snowplow batch pipeline scale? For engineers	2	1521	February 10, 2022

Increasing EMR Speed

Related topics