EMR intermittently fails at Loading S3 to Redshift

Hi,

I am getting intermittent failures in Snowplow EMR job during “Elasticity Custom Jar Step: Load Redshift Storage Target” step. Anyone else running into the same problem? I am on the latest release v92. I think the problem might be a lag in s3, where one step is uploading a ton of files to s3, and then the next step is quickly trying to access those file to load into redshift. Below is the stdout error from EMR:

Data loading error [Amazon](500310) Invalid operation: S3ServiceException:The specified key does not exist.,Status 404,Error NoSuchKey,Rid 0F095DB7B73D7948,ExtRid A1MS6DMeTvlqAn8hCfLs9HqH2Pn0LN9hHql+iSe4k9sC+LKArkFf+oPobirRC0wZuMakt6tE4lQ=,CanRetry 1
Details: 
 -----------------------------------------------
  error:  S3ServiceException:The specified key does not exist.,Status 404,Error NoSuchKey,Rid 0F095DB7B73D7948,ExtRid A1MS6DMeTvlqAn8hCfLs9HqH2Pn0LN9hHql+iSe4k9sC+LKArkFf+oPobirRC0wZuMakt6tE4lQ=,CanRetry 1
  code:      8001
  context:   S3 key being read : s3://XXXXX/shredded/good/run=2017-09-27-22-00-18/atomic-events/part-00061.gz
  query:     208888
  location:  table_s3_scanner.cpp:352
  process:   query3_68 [pid=10325]
  -----------------------------------------------;
ERROR: Data loading error [Amazon](500310) Invalid operation: S3ServiceException:The specified key does not exist.,Status 404,Error NoSuchKey,Rid 0F095DB7B73D7948,ExtRid A1MS6DMeTvlqAn8hCfLs9HqH2Pn0LN9hHql+iSe4k9sC+LKArkFf+oPobirRC0wZuMakt6tE4lQ=,CanRetry 1
Details: 
 -----------------------------------------------
  error:  S3ServiceException:The specified key does not exist.,Status 404,Error NoSuchKey,Rid 0F095DB7B73D7948,ExtRid A1MS6DMeTvlqAn8hCfLs9HqH2Pn0LN9hHql+iSe4k9sC+LKArkFf+oPobirRC0wZuMakt6tE4lQ=,CanRetry 1
  code:      8001
  context:   S3 key being read : s3://XXXXX/shredded/good/run=2017-09-27-22-00-18/atomic-events/part-00061.gz
  query:     208888
  location:  table_s3_scanner.cpp:352
  process:   query3_68 [pid=10325]
  -----------------------------------------------;
Following steps completed: [Discover]
INFO: Logs successfully dumped to S3 [s3://XXXXX/log/rdb-loader/2017-09-27-23-00-18/16dc63e6-6720-43d1-bbd9-097c06dffeec]

Hello @neekipatel,

I believe this error happens due invalid Role ARN. It must look like arn:aws:iam::719197435995:role/RedshiftLoadRole. It also must have AmazonS3ReadOnlyAccess permission:

Then you need chose Amazon Redshift → AmazonS3ReadOnlyAccess, choose a role name, for example RedshiftLoadRole. Once created, copy the Role ARN as you will need it in the next section.

Hi @anton ,

Thank you for your help. I double checked the permissions and they seem to be set properly. If they weren’t wouldn’t it always fail, instead of intermittently failing? On instances when the error does occur, I re-run snowplow-emr-etl-runner with --resume-from=“rdb_load” and everything works out fine.

Hi @neekipatel,

Sorry, you’re totally right, I must be misread that it fails intermittently.

In that case, I believe it happens due to notable S3 eventual consistency issue. What’s the typical amount of files you’re loading (both in atomic-events and shredded)?

Problem is that when you have too many files - discover logic can give you wrong list of files, where some files are basically ghosts from previous load. S3 will become consistent, but it happens “eventually”, but not now. Meanwhile Redshift tries to load these ghost files and (correctly) fails with it.

We added some logic in RDB Load to check and wait for some time, but in the end unfortunately there’s no silver bullet against eventual inconsistency - we have to wait.

1 Like

Just wanted to report back, after some more investigation we found the issue was related to s3 versioning. Once we turned off S3 versioning the issue hadn’t occurred in the last 3 days.

Thanks for sharing @neekipatel!

Hey @anton,

This issue recently began cropping up routinely for us. We do not have versioning enabled on our buckets. We are using RDB R31. Seems to resolve after re-running the rdb_loader. Any suggestions for reducing the frequency of this error? Also it looks like the snowplow-processing-manifest could help mitigate this issue, are there any docs kicking around for getting that setup? Thanks!

Hi @evanlamarre,

What version of a enrich are you using? Spark or Stream enrich? I’m asking because one thing that helped us significantly with this problem is more frequent loads (and hence smaller folders) with Stream Enrich, although I don’t expect this issue cropping up too often in any modern RDB Loader (0.13.0+). Technically we could just increase amount of consistency checks, but as I said, this problem generally went away with pipelines we manage.

As of snowplow-processing-manifest - we don’t use it anymore and planning to deprecate as its complexity was outweighing all benefits. Although, we also plan to start working on a near-real-time loading soon, which will attack this problem for our OSS users from one more angle.

Hey @anton,

Thanks for the quick reply! We are using spark enrich (1.18.0). And we are running the full pipeline hourly for most hours of the day. Hoping to move to stream in the near future.

Hoping to move to stream in the near future.

That’s a very good idea anyway as we’re deprecating Spark Enrich. Have a read Paul’s upgrading guide: AWS batch pipeline to real-time pipeline upgrade guide.

One more thing I forgot to mention is consolidate_shredded_output setting from R112, which also helped to increase stability of loading step.

1 Like

Great thanks @anton! Very helpful.

Is there any ‘wait and retry’ check for the rdb shredder?

I’m having this problem intermittently but it fails a lot more than it works. I’m using rdb shredder 0.16.0 on top of 48 files.
I’m using stream enrich and i’ve configured to dump the enriched files twice per hour, hence 48 files.

Is there any way I can make this shredding process more stable? Is 48 files too much?