R89 Spark job underutilizing cluster

rbolkey · June 26, 2017, 6:24pm

I don’t think the new Spark EMR pipeline is fully utilizing the EMR cluster. I tried to spin up a fairly large cluster (6 core nodes and 25 task nodes), but when I look at the Yarn Resource Manager and the Spark User History console on the cluster and even EC2 monitoring, very few of the nodes are being utilized during a step.

Digging in a little more.

The jobs have no EMR configuration (the JSON body is empty). I’m not sure if this is to be expected, or an issue on our side?
Consequently, the maximizeResourceAllocation option is not set. It seems like it ought to be, but I’m not an expert here. Documented here: http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html.

On a per node level, I’m running r4.2xlarges (60gb / 8 cores, but when I look at the Spark User History Console, I see

spark.executor.memory is set to 5120M
spark.executor.cores is set to 4.

Both of those values are underutilizing the hardware.

Not sure if this a misconfiguration on my part, or a bug?

Thanks,
Rick

BenFradet · June 27, 2017, 9:24am

My understanding is that this can be the result of the dynamic allocation done by YARN.
I.e. there is nothing to do at this particular point in time so the resources are scaled down and you end up with a 5G/4c executor which looks pale compared to the instance specs 60g/8c.
This also has the benefit of being able to decorrelate the number of executors compared to the number of EMR nodes.

However, since our emr cluster is supposed to be single-tenant (nothing except the batch pipeline is supposed to be run on it), I agree that having only one executor gobbling all the resources a node has to offer through maximizeResourceAllocation might be the way to go.

I’m curious as to what people think about the two different approaches.

rbolkey · June 27, 2017, 4:44pm

I would have expected the cluster to be fully utilized for each step since, as you mention, it’s a dedicated resource for the job.

It was quite surprising (and would be costly) to spin up 20 nodes and only have 1 to 3 nodes doing any work. As it stands, our pipeline keeps running out of resources during the enrichment phase, and without any means to directly rely on or influence the amount of resources used, this blocks us from upgrading.

rbolkey · June 27, 2017, 5:51pm

Trying to educate myself more here. I’m not sure if maximizeResourceAllocations is a silver bullet for everyone: see http://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/. It could be good for sizing out to many small instances, but not a couple large ones.

Topic		Replies	Views
Learnings from using the new Spark EMR Jobs AWS batch pipeline (Legacy)	8	13565	August 23, 2017
Optimal setup for Spark jobs For engineers	1	1015	August 21, 2017
Spark memory woes AWS batch pipeline (Legacy)	1	1937	December 14, 2017
Optimizing and reducing shredding/loading costs For engineers	4	1111	January 20, 2021
Processing a big file in EMR or split it up? AWS batch pipeline (Legacy)	2	2970	March 17, 2018

R89 Spark job underutilizing cluster

Related topics