EmrEtlRunner::EmrExecutionError while storing the events in redshift database

ihor · October 14, 2017, 12:25am

Your workflow looks alright and seems to follow the one depicted in
How to setup a Lambda architecture for Snowplow. Though the name “Stream enrich” sounds odd as the data flown at that point is still “raw” (not enriched). Also, note that EmrEtlRunner will be engaged in enriching your data (it is not just shredding). Thus, the correct workflow would be

JS Tracker -> Scala Stream Collector -> Raw Stream -> Kinesis S3 -> S3 -> EmrEtlRunner -> Redshift

Your error message suggests the files have failed to be moved from HDFS (EMR cluster internal storage) to S3 bucket (post enrichment).

Elasticity S3DistCp Step: Enriched HDFS -> S3: FAILED ~ 00:00:06 [2017-10-13 12:48:48 +0000 - 2017-10-13 12:48:54 +0000]

The logs indicate

Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ip-172-31-11-139.ec2.internal:8020/tmp/286fd0b7-6f45-4d16-bc13-fb69f5a294f9/files

It could be just a “glitch” (EMR cluster terminated prematurely) or you have no “good” enriched files produced (say, all ended up in the “bad” bucket).

You might want to resume the pipeline with option --skip staging in case it was a temporary failure. Do ensure the “good” bucket is empty before rerunning. The resume steps (depending on the failure point) could be found here: https://github.com/snowplow/snowplow/wiki/Batch-pipeline-steps

I wouldn’t use the same bucket for your “raw” in events and the files produced during processing/enrichment/shredding.

Additionally, I can see you are using m2.4xlarge instances. Those are old generation types and do not require/support EBS storage. You could use either 1 x c4.8xlarge or 1 x m4.10xlarge instead.

Topic		Replies	Views
EmrEtlRunner::EmrExecutionError AWS batch pipeline (Legacy)	3	1770	October 5, 2017
Reprocessing Bad Events, EmrEtlRunner Error Troubleshooting	7	2008	August 23, 2017
EmrExecutionError - Enriched HDFS -> S3: FAILED Enrichment	7	1343	May 3, 2019
EmrEtlRunner sink Shredded data into S3 bucket For engineers	0	703	November 11, 2019
Error while Running EmrEtlRunner AWS batch pipeline (Legacy)	19	2389	September 22, 2017

EmrEtlRunner::EmrExecutionError while storing the events in redshift database

Related topics