Shred problems using Batch

I’ve struggled with this issue for 2 days now for our Dev instance of batch snow plow. I’ve tried all the tricks i’ve done in the past to fix enrichment or shred issues but nothing is working. i’ve been using snow plow for 4 years. there’s nothing in the EMR logs that tell me what the issue is. i’ve tried the following so far:

  • Cleaned out good/bad/archive folders for shredded folders in s3. this is sometimes a problem if there are too many files in each folder for snow plow. especially since we are not on s3a yet.
  • We are behind on updates but production is working fine so far. it’s just dev. EMR ETL runner version is below. prod and dev are in sync version wise
  • tried to increase number of cores to use in case there’s just a lot of data it has to shred but same thing.
  • i’ve gone through the troubleshooting doc for step issues, tried it all.
./snowplow-emr-etl-runner --version
uri:classloader:/gems/avro-1.8.1/lib/avro/schema.rb:350: warning: constant ::Fixnum is deprecated
snowplow-emr-etl-runner 0.33.1

Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-26AORV9NANGGB failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.
Snowplow DEV ETL: TERMINATING [STEP_FAILURE] ~ elapsed time n/a [2020-11-18 15:20:20 +0000 - ]
 - 1. Elasticity S3DistCp Step: Enriched S3 -> HDFS: COMPLETED ~ 00:00:50 [2020-11-18 15:20:22 +0000 - 2020-11-18 15:21:13 +0000]
 - 2. Elasticity Spark Step: Shred Enriched Events: FAILED ~ 00:12:10 [2020-11-18 15:21:13 +0000 - 2020-11-18 15:33:23 +0000]
 - 3. Elasticity S3DistCp Step: Shredded S3 -> Shredded Archive S3: CANCELLED ~ elapsed time n/a [ - ]
 - 4. Elasticity S3DistCp Step: Enriched S3 -> S3 Enriched Archive: CANCELLED ~ elapsed time n/a [ - ]
 - 5. Elasticity Custom Jar Step: Load Data Warehouse Storage Target: CANCELLED ~ elapsed time n/a [ - ]
 - 6. Elasticity S3DistCp Step: Raw Staging S3 -> Raw Archive S3: CANCELLED ~ elapsed time n/a [ - ]
 - 7. Elasticity S3DistCp Step: Shredded HDFS _SUCCESS -> S3: CANCELLED ~ elapsed time n/a [ - ]
 - 8. Elasticity S3DistCp Step: Shredded HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]):
    uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:691:in `run'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in `send_to'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in `call_with'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in `block in redefine_method'
    uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:138:in `run'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in `send_to'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in `call_with'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in `block in redefine_method'
    uri:classloader:/emr-etl-runner/bin/snowplow-emr-etl-runner:41:in `<main>'
    org/jruby/RubyKernel.java:994:in `load'
    uri:classloader:/META-INF/main.rb:1:in `<main>'
    org/jruby/RubyKernel.java:970:in `require'
    uri:classloader:/META-INF/main.rb:1:in `(root)'
    uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1:in `<main>'

20/11/18 15:33:21 INFO Client: Application report for application_1605712699264_0002 (state: RUNNING)
20/11/18 15:33:22 INFO Client: Application report for application_1605712699264_0002 (state: FINISHED)
20/11/18 15:33:22 INFO Client: 
	 client token: N/A
	 diagnostics: User class threw exception: org.apache.spark.SparkException: Job aborted.
	 ApplicationMaster host: 172.30.1.222
	 ApplicationMaster RPC port: 0
	 queue: default
	 start time: 1605712888343
	 final status: FAILED
	 tracking URL: http://ip-172-30-15-104.ec2.internal:20888/proxy/application_1605712699264_0002/
	 user: hadoop
Exception in thread "main" org.apache.spark.SparkException: Application application_1605712699264_0002 finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1104)
	at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1150)
	at org.apache.spark.deploy.yarn.Client.main(Client.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/11/18 15:33:22 INFO ShutdownHookManager: Shutdown hook called
20/11/18 15:33:22 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-01569491-7d76-4b63-a602-949479a43d51
Command exiting with ret '1'

Hi @mjensen,

Have you managed to solve the problem? One more advice I can give you to troubleshoot it is to look at YARN container logs. Somewhere in containers/application_1605712699264_0002/ in your EMR logs folder.

But as you’ve noticed your pipeline is quite behind latest version, if nothing useful is found in container logs I’d just recommend to upgrade the pipeline first.