Trying to copy non-existent file: Elasticity S3DistCp Step: Raw S3 -> Raw HDFS

Hello, I am getting a failure in the Elasticity S3DistCp Step: Raw S3 -> Raw HDFS step.

This doesn’t occur during every run, but has occurred during my latest run. The problem appears to be when attempting to copy files from S3 to HDFS it attempts to copy a file that doesn’t exist.

Here is the error:
Error: java.lang.RuntimeException: Reducer task failed to copy 272 files: s3://production-snowplow-processing-data/processing/EWBPWVW3GFOLK.2018-09-13-11.4d9fdf5f.gz etc
at com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.cleanup(CopyFilesReducer.java:67)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:179)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:635)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

I checked with
aws s3api get-object-acl --bucket production-snowplow-processing-data --key /processing/EWBPWVW3GFOLK.2018-09-13-11.4d9fdf5f.gz

With a response of >An error occurred (NoSuchKey) when calling the GetObjectAcl operation: The specified key does not exist.

@frankcash, this could be due to infamous eventual consistency inherent to AWS S3 service when a deleted object still appears to be present. Try resuming the pipeline in a little while.

@ihor I re-ran with --Skip Staging as suggested in the recovery documentation, if this persists I will respond again. Is there a solution to preventing this from happening again?

@frankcash, I’m afraid we do not have a solution for this at the moment (eventual consistency is part of the AWS S3 service and it’s meant to serve as a reliability means with such side-effect as consequence). We do experience it from time to time but it normally occurs at data load step and not at staging step. In future, the solution could be keeping track of all the files processed and skipping attempting to process the file that appears to be still present nonetheless but yet recorded as processed (again this is related to eventual consistency during data load).

@ihor

In future, the solution could be keeping track of all the files processed and skipping attempting to process the file that appears to be still present nonetheless but yet recorded as processed

How would one go about this?

It would be part of EmrEtlRunner and some manifest table in Dynamo DB to keep track of files. It’s just a possible scenario, not a promise.

I am facing the same issue. Is there any solution for this that works?