"Elasticity Scalding Step: Shred Enriched Events" failures

alexisdeman · April 12, 2016, 1:44am

Hi

I’m using snowplow r77 and have started having failures on the Shred Enriched Event step (both staging and production environment) in the last 5 days.

Here is the output I get :
Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-2K7LCNCVIN965 failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com
/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.
Snowplow ETL Staging: TERMINATING [STEP_FAILURE] ~ elapsed time n/a [2016-04-11 08:09:40 +0000 - ]

1. Elasticity S3DistCp Step: Raw S3 -> HDFS: COMPLETED ~ 00:01:08 [2016-04-11 08:09:40 +0000 - 2016-04-11 08:10:48 +0000]
1. Elasticity Scalding Step: Enrich Raw Events: COMPLETED ~ 00:02:20 [2016-04-11 08:10:55 +0000 - 2016-04-11 08:13:15 +0000]
1. Elasticity S3DistCp Step: Enriched HDFS -> S3: COMPLETED ~ 00:00:40 [2016-04-11 08:13:15 +0000 - 2016-04-11 08:13:55 +0000]
1. Elasticity S3DistCp Step: Enriched HDFS _SUCCESS -> S3: COMPLETED ~ 00:00:40 [2016-04-11 08:13:55 +0000 - 2016-04-11 08:14:36 +0000]
1. Elasticity Scalding Step: Shred Enriched Events: FAILED ~ 00:00:06 [2016-04-11 08:14:36 +0000 - 2016-04-11 08:14:42 +0000]
1. Elasticity S3DistCp Step: Shredded HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]):

The step fails after 5-6 seconds and there are no logs available at all in EMR which makes it hard to debug.

I’m sure that there are some events to shred (in other words, I don’t get this error : https://github.com/snowplow/snowplow/wiki/Troubleshooting#shred-fail).

I was using spot instances, and disabled it (following the recommendation here : https://groups.google.com/forum/#!topic/snowplow-user/rFw6E4Ysafs) but still have the same problem.

Any idea of what could be wrong ?
Or what I should look for.

Thank you
Alexis

alex · April 12, 2016, 11:18am

Hi Alexis,

Thanks for raising this. We have seen this behavior across our users and our own jobs too - intermittent failures on Hadoop Shred after between 4 to 9 seconds (often 6 seconds).

Re-running the job almost always fixes it - very occasionally we have to restart it a couple of times.

Occasionally the failure is correlated with bootstrap failures bringing the cluster up (which EmrEtlRunner automatically recovers from).

We have a support ticket open with AWS to find out what is causing this. If it’s something in Hadoop Shred we’ll obviously fix it.

Will keep you posted!

Timmycarbone · April 29, 2016, 9:49am

Hey!

I’m encountering the same issue. Any news/update on the situation?

I’ve tried to re-run the process with --process-shred xxxxx but still the same issue.
Will try again later today.

Let me know if you found something!

Thank you!
Tim

alex · April 29, 2016, 10:11am

Hi @Timmycarbone - the solution to this is in this thread: EMR jobflow failing on Hadoop Enrich step after a few seconds

Timmycarbone · April 29, 2016, 11:36am

Awesome thank you!

Although re-running it another time fixed it, I will apply the solution in the linked thread.

Thanks again!
Tim

Topic		Replies	Views
Elasticity Scalding Step: Enrich Raw Events fails Enrichment	2	1724	July 21, 2016
EMR failing : Enriched HDFS -> S3: FAILED Troubleshooting	4	2007	April 11, 2017
EmrEtlRunner error - Elasticity Scalding Step: Enrich Raw Events: FAILED For engineers	3	1038	June 5, 2017
Steps Elasticity S3DistCp Step: Raw Staging S3 -> Raw Archive S3	13	1238	January 17, 2020
'Elasticity Scalding Step: Shred Enriched Events' step failing Troubleshooting	5	1922	March 7, 2017

"Elasticity Scalding Step: Shred Enriched Events" failures

Related topics