As a rule of thumb, trying to use the latest suitable mN (largest N possible) would get you the best performance/cost ratio. Why not a m4.large for instance? Given its short run, spot instances are a good option as well
Can you try launching your cluster with Ganglia? It would show you the usual culprits (bottlenecks)
Unfortunately the bootstrapping of the cluster already takes 5-10min most of the time. You can’t really get the whole pipeline to run below 10-15min in my experience, even if you choose bigger machines. It seems to me like a reasonable solution that if you want real-time data you have to use the real-time pipeline and for the rest you have to wait at least ~20min
@tclass yup. why we have 2 pipelines. realtime and batch set up as well. even if you want to process one event in batch, it’ll take 15-20 mins to complete all the steps in EMR. we run batch hourly and then realtime we load to redshift every 60 seconds.