EMR job failing

matt.miller · November 13, 2021, 12:43am

Hi all,

I have a stream-enrich EMR job that consistently fails at the shredding step. While debugging we see that our enriched/good bucket has ~280K files but 0 data. They look to be all directories.

Thank you for the help!

ihor · November 13, 2021, 3:00am

Hi @matt.miller,

That appears to be “shredded:good”, not “enriched:good” bucket. I believe you have the empty directories like “run=YYYY-MM-DD-hh-mm-ss”. Those are the directories left over by S3DistCp utility that moves the files.

It’s a good idea to run some maintenance on both “shredded:good” and “enriched:good” buckets and delete any empty directory and files. Once accumulated, they can cause a problem as it becomes hard to scan them all.

Could you, please, delete anything in the s3://pixel-shared-tenant-emr-30q5/comair-stream/shredded/good/ (shredded:good) bucket?

Could you also show a similar screenshot for s3://pixel-shared-tenant-emr-30q5/comair-stream/enriched/good (enriched:good bucket)?

matt.miller · November 13, 2021, 6:42pm

Here is the correct screenshot. I cleaned out the shredded/good bucket.

Thanks!

ihor · November 14, 2021, 3:53am

@matt.miller, that is a lot of data to process in one go in the batch job. I assume that is exceptional, and got accumulated due to the issue with shredding you encountered. I also expect you to have lots of empty folders in “enriched:good” bucket. Therefore, my recommendation would be

Delete all the empty “run=…” folders. Only one of them with most recent timestamp would contain enriched data. Make sure you do not delete that folder
Split your accumulated enriched data into two. That is, just move half of it temporarily to some other location and thus leaving in “enriched:good”, say, 5-6 GB.

Still, to process this volume of data, you need to configure your EMR cluster and the Spark job appropriately. The default configuration is not suitable, as your bumped EMR cluster will be underutilized. For clarity, here’s the “default” Spark configuration, which is OK for a small EMR cluster

configuration:
  yarn-site:
    yarn.resourcemanager.am.max-attempts: '1'
  spark:
    maximizeResourceAllocation: 'true'

What you need to have in your config.yaml file in order to process 5-6 GB of data is the following.

configuration:
  yarn-site:
    yarn.nodemanager.vmem-check-enabled: "false"
    yarn.nodemanager.resource.memory-mb: "256000"
    yarn.scheduler.maximum-allocation-mb: "256000"
  spark:
    maximizeResourceAllocation: "false"
  spark-defaults:
    spark.dynamicAllocation.enabled: "false"
    spark.executor.instances: "49"
    spark.yarn.executor.memoryOverhead: "3072"
    spark.executor.memory: "22G"
    spark.executor.cores: "3"
    spark.yarn.driver.memoryOverhead: "3072"
    spark.driver.memory: "22G"
    spark.driver.cores: "3"
    spark.default.parallelism: "588"

The above is the Spark configuration for the EMR cluster with 5 x r5.8xlarge core nodes to be able to process 5-6 GB of enriched (gzipped) files, no more than that.

Bear in mind that different in size EMR cluster requires different Spark configuration to be efficient and sufficient to process that enriched data at hand. If you let us know what the typical volume of your data (size of enriched files you usually process) is we can advise what EMR cluster and the Spark configuration should be.

Give it a go and see if that works. Surely, you would need to run EmtEtlRunner from the “shred” step (with --skip staging_stream_enrich). If that works, move the other half of the files to “enriched:good” bucket and run EER from shred again.

Hopefully, that works.

matt.miller · November 15, 2021, 5:11am

Thank you @ihor this worked.

Topic		Replies	Views
EMR Shredding fails randomly Enrichment	12	1660	February 23, 2019
EMR failing : Enriched HDFS -> S3: FAILED Troubleshooting	4	2007	April 11, 2017
EMR job writes empty files in enriched.bad and shredded.bad buckets Enrichment	4	1475	April 10, 2017
Error on EmrEtlRunner, S3 not empty Enrichment	2	2068	December 16, 2016
Spark memory woes AWS batch pipeline (Legacy)	1	1937	December 14, 2017

EMR job failing

Related topics