Shred failure with R89/Spark

bernardosrulzon · June 14, 2017, 11:44am

Hi guys,

Nice work on the Spark release! Our pipeline ran successfully a few times, but as I was experimenting with instance types, the Shred step failed 2 hours into the job. This is probably memory-related, but I wasn’t expecting this with 4x c4.4xlarge instances (each with 30GB of memory).

Here’s the stderr file from one of the containers:

gist.github.com

https://gist.github.com/anonymous/86df1936e6d9de791c61a6df04844b74

shred_logs.txt

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/yarn/usercache/hadoop/filecache/10/__spark_libs__1774640650339734057.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/06/14 04:09:57 INFO SignalUtils: Registered signal handler for TERM
17/06/14 04:09:57 INFO SignalUtils: Registered signal handler for HUP
17/06/14 04:09:57 INFO SignalUtils: Registered signal handler for INT
17/06/14 04:09:57 INFO ApplicationMaster: Preparing Local resources
17/06/14 04:09:58 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1497410974650_0004_000001

This file has been truncated. show original

Thanks!
Bernardo

alex · June 14, 2017, 11:56am

Thanks for raising @bernardosrulzon - looks like the smoking gun is:

ERROR YarnClusterScheduler: Lost executor 57 on ip-10-0-51-185.ec2.internal: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

Can you share the memory utilization graph for the job duration from the EMR console?

bernardosrulzon · June 14, 2017, 12:00pm

Sure! Memory is fully allocated throughout the Shred job.

bernardosrulzon · June 14, 2017, 12:20pm

Update: Running EmrEtlRunner with --process-shred, the Shred step fails 10 minutes after the job. Same error on the logs. Trying to run with 4x r3.2xlarge now.

BenFradet · June 14, 2017, 12:27pm

Hey @bernardosrulzon ,

spark.yarn.executor.memoryOverhead is supposed to be 10% of the executor memory which in your case should be a bit less than ~3Go. The 5.5Go is a bit surprising to me.

To minimize this overhead, you can distribute the work on more instances even if they are smaller, the bigger the memory pool, the bigger the overhead.

Topic		Replies	Views
Shred step just started failing (R97) AWS batch pipeline (Legacy)	1	1140	March 19, 2019
[shred] spark: Shred Enriched Events - Failures For engineers	7	1022	February 18, 2020
Shred step failure, no error message For engineers	4	741	June 1, 2021
Snowflake transformer failing with Futures Timed Out message For engineers	5	1720	October 6, 2020
Shred stage failure on EMR ETL Runner upgrade	7	1536	August 6, 2021

Shred failure with R89/Spark

Related topics