EMR job failing

Hi all,

I have a stream-enrich EMR job that consistently fails at the shredding step. While debugging we see that our enriched/good bucket has ~280K files but 0 data. They look to be all directories.

Thank you for the help!

Hi @matt.miller,

That appears to be “shredded:good”, not “enriched:good” bucket. I believe you have the empty directories like “run=YYYY-MM-DD-hh-mm-ss”. Those are the directories left over by S3DistCp utility that moves the files.

It’s a good idea to run some maintenance on both “shredded:good” and “enriched:good” buckets and delete any empty directory and files. Once accumulated, they can cause a problem as it becomes hard to scan them all.

Could you, please, delete anything in the s3://pixel-shared-tenant-emr-30q5/comair-stream/shredded/good/ (shredded:good) bucket?

Could you also show a similar screenshot for s3://pixel-shared-tenant-emr-30q5/comair-stream/enriched/good (enriched:good bucket)?

Here is the correct screenshot. I cleaned out the shredded/good bucket.


@matt.miller, that is a lot of data to process in one go in the batch job. I assume that is exceptional, and got accumulated due to the issue with shredding you encountered. I also expect you to have lots of empty folders in “enriched:good” bucket. Therefore, my recommendation would be

  1. Delete all the empty “run=…” folders. Only one of them with most recent timestamp would contain enriched data. Make sure you do not delete that folder
  2. Split your accumulated enriched data into two. That is, just move half of it temporarily to some other location and thus leaving in “enriched:good”, say, 5-6 GB.

Still, to process this volume of data, you need to configure your EMR cluster and the Spark job appropriately. The default configuration is not suitable, as your bumped EMR cluster will be underutilized. For clarity, here’s the “default” Spark configuration, which is OK for a small EMR cluster

    yarn.resourcemanager.am.max-attempts: '1'
    maximizeResourceAllocation: 'true'

What you need to have in your config.yaml file in order to process 5-6 GB of data is the following.

    yarn.nodemanager.vmem-check-enabled: "false"
    yarn.nodemanager.resource.memory-mb: "256000"
    yarn.scheduler.maximum-allocation-mb: "256000"
    maximizeResourceAllocation: "false"
    spark.dynamicAllocation.enabled: "false"
    spark.executor.instances: "49"
    spark.yarn.executor.memoryOverhead: "3072"
    spark.executor.memory: "22G"
    spark.executor.cores: "3"
    spark.yarn.driver.memoryOverhead: "3072"
    spark.driver.memory: "22G"
    spark.driver.cores: "3"
    spark.default.parallelism: "588"

The above is the Spark configuration for the EMR cluster with 5 x r5.8xlarge core nodes to be able to process 5-6 GB of enriched (gzipped) files, no more than that.

Bear in mind that different in size EMR cluster requires different Spark configuration to be efficient and sufficient to process that enriched data at hand. If you let us know what the typical volume of your data (size of enriched files you usually process) is we can advise what EMR cluster and the Spark configuration should be.

Give it a go and see if that works. Surely, you would need to run EmtEtlRunner from the “shred” step (with --skip staging_stream_enrich). If that works, move the other half of the files to “enriched:good” bucket and run EER from shred again.

Hopefully, that works.

Thank you @ihor this worked.