Your workflow looks alright and seems to follow the one depicted in
How to setup a Lambda architecture for Snowplow. Though the name “Stream enrich” sounds odd as the data flown at that point is still “raw” (not enriched). Also, note that EmrEtlRunner will be engaged in enriching your data (it is not just shredding). Thus, the correct workflow would be
JS Tracker -> Scala Stream Collector -> Raw Stream -> Kinesis S3 -> S3 -> EmrEtlRunner -> Redshift
Your error message suggests the files have failed to be moved from HDFS (EMR cluster internal storage) to S3 bucket (post enrichment).
Elasticity S3DistCp Step: Enriched HDFS -> S3: FAILED ~ 00:00:06 [2017-10-13 12:48:48 +0000 - 2017-10-13 12:48:54 +0000]
The logs indicate
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ip-172-31-11-139.ec2.internal:8020/tmp/286fd0b7-6f45-4d16-bc13-fb69f5a294f9/files
It could be just a “glitch” (EMR cluster terminated prematurely) or you have no “good” enriched files produced (say, all ended up in the “bad” bucket).
You might want to resume the pipeline with option --skip staging
in case it was a temporary failure. Do ensure the “good” bucket is empty before rerunning. The resume steps (depending on the failure point) could be found here: https://github.com/snowplow/snowplow/wiki/Batch-pipeline-steps
I wouldn’t use the same bucket for your “raw” in
events and the files produced during processing/enrichment/shredding.
Additionally, I can see you are using m2.4xlarge
instances. Those are old generation types and do not require/support EBS storage. You could use either 1 x c4.8xlarge
or 1 x m4.10xlarge
instead.