EmrEtlRunner fails during raw staging S3 step


I have Snowplow batch running on AWS (scala-collector > s3-loader > EmrEtlRunner).

It was running fine for the past few weeks but lately I’ve been getting a lot of failures during the raw staging S3 step.

The step fails with the following trace in stderr

    Error: java.lang.RuntimeException: Reducer task failed to copy 2275 files: s3://snowplow/raw/in/2018-10-24-49589377919602874491714939496115412362808439243580375074-49589377919602874491714939496115412362808439243580375074.lzo.index etc
  at com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.cleanup(CopyFilesReducer.java:67)
  at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:179)
  at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:635)
  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:422)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

I have to manually move files from the raw/processing folder to the raw/in and re-run the job hoping that it won’t fail this time to fix it.

If I look at the container logs I can see the following error

2018-10-23 18:32:19,725 ERROR [s3distcp-simpler-executor-worker-1] com.amazon.elasticmapreduce.s3distcp.CopyFilesRunnable: Error downloading input files. Not marking as committed

java.io.FileNotFoundException: No such file or directory 's3://snowplow/raw/in/2018-10-23-49588889455877140086970628804200750496158524777810624562-49588889455877140086970628809616738168032063824091676722.lzo.index'
  at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:816)
  at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:1194)
  at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:773)
  at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:166)
  at com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.openInputStream(CopyFilesReducer.java:293)
  at com.amazon.elasticmapreduce.s3distcp.CopyFilesRunnable.mergeAndCopyFiles(CopyFilesRunnable.java:102)
  at com.amazon.elasticmapreduce.s3distcp.CopyFilesRunnable.run(CopyFilesRunnable.java:35)
  at com.amazon.elasticmapreduce.s3distcp.SimpleExecutor$Worker.run(SimpleExecutor.java:49)
  at java.lang.Thread.run(Thread.java:748)

When the file 2018-10-23-49588889455877140086970628804200750496158524777810624562-49588889455877140086970628809616738168032063824091676722.lzo.index actually exists.

Any idea if there is something wrong with the EmrEtlRunner or if it’s an issue with s3DistCp? And how could this be potentially solved?

ami_version: 5.9.0
rdb_loader: 0.14.0
rdb_shredder: 0.13.1
spark_enrich: 1.16.0
S3 bucket encryption turned on

Thank you!

This is an issue with AWS S3, not EmrEtlRunner config.

Note the files are moved (copied over) with a native AWS utility S3DistCp. The error “No such file or directory” while it is there could be a result of infamous eventual consistency issue inherent to S3 service.

Your logs show “copy 2275 files”. You might wish to run your batch job more often to reduce the number of files.

Also, why would you move the files to processing bucket manually? Let the EmrEtlRunner do that for you. This kind of failure normally rectifies itself. If some of the files have been moved during the failure nonetheless (processing bucket is not empty), just resume the pipeline with --skip staging option.

Thanks ihor,

The reason I move the files to the processing bucket manually is that it’s the recommended way to deal with this error.

If the job died during the move-to-processing step, either:

Rerun EmrEtlRunner with the command-line option of --skip staging, or:
Move any files from the Processing Bucket back to the In Bucket and rerun EmrEtlRunner without any --skip option*
* We recommend option 2 if only a handful of files were transferred to your Processing Bucket before the S3 error.

I was also worried that some lzo files may have been copied successfully but not the corresponding lzo.index (not sure if that would mean we’d be missing some data?).

I will try changing the s3 loader buffer config so that it creates files less frequently to see if that helps.

@abrenaut, I also think that you rather might need to adjust S3 Loader settings than the config file for EmrEtlRunner. Let us know how you get on.

I stopped getting error once I changed the buffer config on the s3 loader.

Thanks for the help @ihor