Nice work on the Spark release! Our pipeline ran successfully a few times, but as I was experimenting with instance types, the Shred step failed 2 hours into the job. This is probably memory-related, but I wasn’t expecting this with 4x c4.4xlarge instances (each with 30GB of memory).
Here’s the stderr file from one of the containers:
Update: Running EmrEtlRunner with --process-shred, the Shred step fails 10 minutes after the job. Same error on the logs. Trying to run with 4x r3.2xlarge now.
spark.yarn.executor.memoryOverhead is supposed to be 10% of the executor memory which in your case should be a bit less than ~3Go. The 5.5Go is a bit surprising to me.
To minimize this overhead, you can distribute the work on more instances even if they are smaller, the bigger the memory pool, the bigger the overhead.