EmrEtlRunner::EmrExecutionError while storing the events in redshift database

@sandesh,

Your workflow looks alright and seems to follow the one depicted in
How to setup a Lambda architecture for Snowplow. Though the name “Stream enrich” sounds odd as the data flown at that point is still “raw” (not enriched). Also, note that EmrEtlRunner will be engaged in enriching your data (it is not just shredding). Thus, the correct workflow would be

JS Tracker -> Scala Stream Collector -> Raw Stream -> Kinesis S3 -> S3 -> EmrEtlRunner -> Redshift

Your error message suggests the files have failed to be moved from HDFS (EMR cluster internal storage) to S3 bucket (post enrichment).

Elasticity S3DistCp Step: Enriched HDFS -> S3: FAILED ~ 00:00:06 [2017-10-13 12:48:48 +0000 - 2017-10-13 12:48:54 +0000]

The logs indicate

Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ip-172-31-11-139.ec2.internal:8020/tmp/286fd0b7-6f45-4d16-bc13-fb69f5a294f9/files

It could be just a “glitch” (EMR cluster terminated prematurely) or you have no “good” enriched files produced (say, all ended up in the “bad” bucket).

You might want to resume the pipeline with option --skip staging in case it was a temporary failure. Do ensure the “good” bucket is empty before rerunning. The resume steps (depending on the failure point) could be found here: https://github.com/snowplow/snowplow/wiki/Batch-pipeline-steps

I wouldn’t use the same bucket for your “raw” in events and the files produced during processing/enrichment/shredding.

Additionally, I can see you are using m2.4xlarge instances. Those are old generation types and do not require/support EBS storage. You could use either 1 x c4.8xlarge or 1 x m4.10xlarge instead.