Shred step just started failing (R97)

wleftwich · March 19, 2019, 2:35pm

Hi -

We are running R97 Knossos – haven’t upgraded in over a year because we never had a problem.

However last week, on a couple of our nightly ETL jobs from a Cloudfront collector, we had a failure at the Shred step. Rerunning with ‘-f shred’ the job completed OK.

But then last night, after the same error, we have had no success with 3 recovery attempts.

Maybe I’m not looking at the right log, but stderr for the failed step is not super informative:

19/03/19 11:07:33 INFO Client: 
	 client token: N/A
	 diagnostics: User class threw exception: org.apache.spark.SparkException: Job aborted.
	 ApplicationMaster host: 10.0.0.96
	 ApplicationMaster RPC port: 0
	 queue: default
	 start time: 1552992503630
	 final status: FAILED
	 tracking URL: http://ip-10-0-0-82.ec2.internal:20888/proxy/application_1552992078044_0002/
	 user: hadoop
Exception in thread "main" org.apache.spark.SparkException: Application application_1552992078044_0002 finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1104)
	at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1150)
	at org.apache.spark.deploy.yarn.Client.main(Client.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
19/03/19 11:07:33 INFO ShutdownHookManager: Shutdown hook called
19/03/19 11:07:33 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-570e85f9-ed28-48c3-9ba6-e116cb96b606
Command exiting with ret '1'

At this point I would appreciate any advice at all. Thanks in advance!

Wade Leftwich
Ithaca, NY

wleftwich · March 19, 2019, 6:28pm

Responding to my own post.

I disabled cross-batch natural deduplication, by removing the DynamoDB config from my targets directory, and the job proceeded to completion.

I actually don’t know if this really made a difference, because the problem had been intermittent. There had been no errors logged in DynamoDB.

But anyway, at least I got yesterday’s data into Redshift.

Topic		Replies	Views
Shred step failure, no error message For engineers	4	741	June 1, 2021
Elasticity Spark Step: Shred Enriched Events: consistent failure without clear reason Storage targets	2	2366	November 11, 2017
Shred failure with R89/Spark AWS batch pipeline (Legacy)	4	1711	June 14, 2017
ETL Shred is consistently failing AWS batch pipeline (Legacy)	8	1791	March 30, 2017
Frequently failing in the 4th steps of storage process AWS batch pipeline (Legacy)	4	1485	November 22, 2017

Shred step just started failing (R97)

Related topics