Trouble with s3distcp in EMR

We are using snowplow 112 version with stream enrich and last 2 weeks we have been getting troubles with S3distcp.

  1. Sometimes it fails while copying shredded data from HDFS-> S3 or archiving the data.
  2. Most of the time it archives the data but still sends failure signal to EMR job.
  3. In another scenario, while copying data from HDFS-> S3 using distcp, reduce job fails at reduce step and tries 3 4 times and recreates multiple version of data in S3.
  4. EMR failed another day at the loader step as it was unable to locate one of JsonPath files but worked again on retries.

Is there someone encountering similar issue with s3distcp? Any solutions/recommendations will be highly appreciated as this issue is impacting our production environment. Thanks in advance!

Hey @neelam_bagnial, yes, we encounter this issue from time to time as well and I believe our developers are looking into a possible solution. I’m afraid there’s not much that can be done at the moment.

Thanks @ihor for your reply, Any suggestion on how are you recovering failed EMR in such case? Do you have any recovery plan?

@neelam_bagnial, we follow the recovery strategy as per https://github.com/snowplow/snowplow/wiki/Batch-pipeline-steps.