I have faced an issue with running enrich process on AWS. It was working great for almost a year, but for some reason (our traffic is growing + number of events is growing also) I get a failed pipeline. In logs I have found root cause of failed enrich process:
Container killed by YARN for exceeding memory limits. (5.6Gb of 5.5Gb)
I was using 4 core m4.large instances + 1 master node with the same type m4.large.
I was looking for possible solutions of that issue and what I found is enable option spark.yarn.executor.memoryOverhead: “true” and advice to increase number of instances + improve types. I have switch to 6 m4.2xlarge nodes and rerun it. Now I get an error:
am container exited with exitcode 137, looker deeper in logs I managed to find out explain of that issue:
java.lang.OutOfMemoryError: Java heap space.
So every time I’m facing out of memory issue. I have check amount of files that placed in my enrich input directory and they are about 1.8Gb in total. Any advise how can I clarify the rootcause and fix it? Should I try increase number of node?
Update: I have try custom config recomendation mentioned here, also try to use 6 r4.2xlarge instances and still no luck. My monitoring tab on that cluster looks like so:
Still looking for any help on this issue.
@sphinks, it is of paramount importance to configure your Spark and the cluster according to the volume of data to be processed. This roughly correlated to the total size of the files/logs.
Note just bumping the number of nodes is likely not help as the resource usage is controlled by Spark configuration. Spark jobs are memory hungry and the workers need to be optimized based on the nodes composing EMR cluster. There are plenty of examples of how Spark needs to be configured in this forum.
I just give one more here. If say you utilize 1x
r4.8xlarge core node the Spark configuration would be like this
You can use this post as a guide to Spark configuration.
@ihor I have tryed your suggested settings with 1 x r4.8xlarge. It was much longer doing coping data from s3 to hdfs comparing to 6 x r4.2xlarge. According to EMR console it was done 101,601 task out of 635,314 and failed due to error:
ERROR Executor: Exception in task 102185.0 in stage 0.0 (TID 102185) org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout on all executors. Not sure what can I do in such case? Increase timeout? Any suggestions?
However, I met such idea: replace default GC with G1GC to minimize pauses.