Processing a big file in EMR or split it up?

rbolkey · March 17, 2018, 2:42am

Here’s our experience. We’re typically able to process 10gb of data in an hour with 4 r3.8xlarges core nodes. We haven’t really tried r4s with ebs since r3s have instance storage. I haven’t measured the performance impact of instance vs ebs, but it could be negligible. But there would be advantages to using a beefier box like an r4.16xl.

Our “master” node is an m1.medium. The node does nothing but run a resource manager. There’s no reason to make it any larger than that.

A single 28gb file would definitely be a BAD idea as far as efficiency. For spark jobs, the sweet spot is between 100mb and 1gb. We produce 70mb files. You want enough files to make full use of the parallelism in Spark.

Would be curious to hear others data points (particularly with r4.16xlarge instances).

Thanks
Rick

Topic		Replies	Views
Learnings from using the new Spark EMR Jobs AWS batch pipeline (Legacy)	8	13565	August 23, 2017
EmrEtlRunner sizing	5	2142	June 24, 2019
R89 Spark job underutilizing cluster AWS batch pipeline (Legacy)	3	1722	June 27, 2017
How to estimate the EBS storage size needed for EMR process? For engineers	5	1111	June 6, 2018
Recommended ec2 instances for EMR ETL Runner in 2018 AWS batch pipeline (Legacy)	2	1980	September 4, 2018

Processing a big file in EMR or split it up?

Related topics