Shred step failure, no error message


With snowplow-emr-etl-runner-r117, our ETL job is failing at the “[shred] spark: Shred Enriched Events” step. Lots of *.gz files are left in S3 enriched/good/run=2021-06-01-08-30-14/stream/ .

stderr for the failed step doesn’t offer much of a clue:

21/06/01 16:07:05 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host:
ApplicationMaster RPC port: 0
queue: default
start time: 1622563598944
final status: FAILED
tracking URL: http://ip-10-0-1-129.ec2.internal:20888/proxy/application_1622563416514_0002/
user: hadoop
Exception in thread “main” org.apache.spark.SparkException: Application application_1622563416514_0002 finished with failed status
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1150)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
21/06/01 16:07:05 INFO ShutdownHookManager: Shutdown hook called

When I try to restart the job with “-f shred” I get the same error.

I am trying to troubleshoot this service that was installed by Someone Who Is No Longer With The Company, so am really groping here.

Is there another place I should be looking for more informative error logs?

Any advice appreciated.

@wleftwich, where the final folder “stream” (in enriched/good/run=2021-06-01-08-30-14/stream/) comes from? No folders are expected after the “run=” folder.

Also, if it fails again (after some time of shredding), perhaps you need to consider bumping EMR cluster as the data volume could be too high for the current EMR cluster configuration.

Hi @ihor, thanks for the fast reply.

I don’t know where the /stream comes from – it is not in the emr-etl-runner config.yml:
assets: s3://snowplow-hosted-assets
encrypted: false
enriched: {archive: ‘s3://rr-snowplow-events-e2-prod/enriched/archive’, bad: ‘s3://rr-snowplow-events-e2-prod/enriched/bad’,
errors: null, good: ‘s3://rr-snowplow-events-e2-prod/enriched/good’, stream: ‘s3://rr-snowplow-enriched-stream-e2-prod’}
jsonpath_assets: s3://rr-snowplow-cloudfront-iglu-central/jsonpaths/
log: s3://rr-snowplow-events-e2-prod/emr_logs
shredded: {archive: ‘s3://rr-snowplow-events-e2-prod/shredded/archive’, bad: ‘s3://rr-snowplow-events-e2-prod/shredded/bad’,
errors: null, good: ‘s3://rr-snowplow-events-e2-prod/shredded/good’}
consolidate_shredded_output: false
region: <%= ENV[‘RR_SNOWPLOW_REGION’] %>

Maybe it is coming from the folder?

~ $ aws s3 ls s3://rr-snowplow-enriched-stream-e2-prod/
PRE stream/
2020-09-03 06:51:55 0 stream_$folder$

At any rate I will try just moving all the *.gz up a level.`

Thanks again @ihor. I followed both your suggestions and got back in business.

– Wade

Hey @wleftwich, glad to hear that.

Maybe it is coming from the folder?

It shouldn’t unless your S3 Loader is configured to upload the streamed data to that folders. At staging step the files are simply moved from enriched:stream to enriched:good location as per this dataflow diagram.