"Elasticity Scalding Step: Shred Enriched Events" failures

Hi

I’m using snowplow r77 and have started having failures on the Shred Enriched Event step (both staging and production environment) in the last 5 days.

Here is the output I get :
Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-2K7LCNCVIN965 failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com
/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.
Snowplow ETL Staging: TERMINATING [STEP_FAILURE] ~ elapsed time n/a [2016-04-11 08:09:40 +0000 - ]

    1. Elasticity S3DistCp Step: Raw S3 -> HDFS: COMPLETED ~ 00:01:08 [2016-04-11 08:09:40 +0000 - 2016-04-11 08:10:48 +0000]
    1. Elasticity Scalding Step: Enrich Raw Events: COMPLETED ~ 00:02:20 [2016-04-11 08:10:55 +0000 - 2016-04-11 08:13:15 +0000]
    1. Elasticity S3DistCp Step: Enriched HDFS -> S3: COMPLETED ~ 00:00:40 [2016-04-11 08:13:15 +0000 - 2016-04-11 08:13:55 +0000]
    1. Elasticity S3DistCp Step: Enriched HDFS _SUCCESS -> S3: COMPLETED ~ 00:00:40 [2016-04-11 08:13:55 +0000 - 2016-04-11 08:14:36 +0000]
    1. Elasticity Scalding Step: Shred Enriched Events: FAILED ~ 00:00:06 [2016-04-11 08:14:36 +0000 - 2016-04-11 08:14:42 +0000]
    1. Elasticity S3DistCp Step: Shredded HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]):

The step fails after 5-6 seconds and there are no logs available at all in EMR which makes it hard to debug.

I’m sure that there are some events to shred (in other words, I don’t get this error : https://github.com/snowplow/snowplow/wiki/Troubleshooting#shred-fail).

I was using spot instances, and disabled it (following the recommendation here : https://groups.google.com/forum/#!topic/snowplow-user/rFw6E4Ysafs) but still have the same problem.

Any idea of what could be wrong ?
Or what I should look for.

Thank you
Alexis

Hi Alexis,

Thanks for raising this. We have seen this behavior across our users and our own jobs too - intermittent failures on Hadoop Shred after between 4 to 9 seconds (often 6 seconds).

Re-running the job almost always fixes it - very occasionally we have to restart it a couple of times.

Occasionally the failure is correlated with bootstrap failures bringing the cluster up (which EmrEtlRunner automatically recovers from).

We have a support ticket open with AWS to find out what is causing this. If it’s something in Hadoop Shred we’ll obviously fix it.

Will keep you posted!

Hey!

I’m encountering the same issue. Any news/update on the situation?

I’ve tried to re-run the process with --process-shred xxxxx but still the same issue.
Will try again later today.

Let me know if you found something!

Thank you!
Tim

Hi @Timmycarbone - the solution to this is in this thread: EMR jobflow failing on Hadoop Enrich step after a few seconds

Awesome thank you!

Although re-running it another time fixed it, I will apply the solution in the linked thread.

Thanks again!
Tim