EMR intermittently fails at Loading S3 to Redshift

neekipatel · September 28, 2017, 12:25pm

Hi,

I am getting intermittent failures in Snowplow EMR job during “Elasticity Custom Jar Step: Load Redshift Storage Target” step. Anyone else running into the same problem? I am on the latest release v92. I think the problem might be a lag in s3, where one step is uploading a ton of files to s3, and then the next step is quickly trying to access those file to load into redshift. Below is the stdout error from EMR:

Data loading error [Amazon](500310) Invalid operation: S3ServiceException:The specified key does not exist.,Status 404,Error NoSuchKey,Rid 0F095DB7B73D7948,ExtRid A1MS6DMeTvlqAn8hCfLs9HqH2Pn0LN9hHql+iSe4k9sC+LKArkFf+oPobirRC0wZuMakt6tE4lQ=,CanRetry 1
Details: 
 -----------------------------------------------
  error:  S3ServiceException:The specified key does not exist.,Status 404,Error NoSuchKey,Rid 0F095DB7B73D7948,ExtRid A1MS6DMeTvlqAn8hCfLs9HqH2Pn0LN9hHql+iSe4k9sC+LKArkFf+oPobirRC0wZuMakt6tE4lQ=,CanRetry 1
  code:      8001
  context:   S3 key being read : s3://XXXXX/shredded/good/run=2017-09-27-22-00-18/atomic-events/part-00061.gz
  query:     208888
  location:  table_s3_scanner.cpp:352
  process:   query3_68 [pid=10325]
  -----------------------------------------------;
ERROR: Data loading error [Amazon](500310) Invalid operation: S3ServiceException:The specified key does not exist.,Status 404,Error NoSuchKey,Rid 0F095DB7B73D7948,ExtRid A1MS6DMeTvlqAn8hCfLs9HqH2Pn0LN9hHql+iSe4k9sC+LKArkFf+oPobirRC0wZuMakt6tE4lQ=,CanRetry 1
Details: 
 -----------------------------------------------
  error:  S3ServiceException:The specified key does not exist.,Status 404,Error NoSuchKey,Rid 0F095DB7B73D7948,ExtRid A1MS6DMeTvlqAn8hCfLs9HqH2Pn0LN9hHql+iSe4k9sC+LKArkFf+oPobirRC0wZuMakt6tE4lQ=,CanRetry 1
  code:      8001
  context:   S3 key being read : s3://XXXXX/shredded/good/run=2017-09-27-22-00-18/atomic-events/part-00061.gz
  query:     208888
  location:  table_s3_scanner.cpp:352
  process:   query3_68 [pid=10325]
  -----------------------------------------------;
Following steps completed: [Discover]
INFO: Logs successfully dumped to S3 [s3://XXXXX/log/rdb-loader/2017-09-27-23-00-18/16dc63e6-6720-43d1-bbd9-097c06dffeec]

anton · September 28, 2017, 2:06pm

Hello @neekipatel,

I believe this error happens due invalid Role ARN. It must look like arn:aws:iam::719197435995:role/RedshiftLoadRole. It also must have AmazonS3ReadOnlyAccess permission:

Then you need chose Amazon Redshift → AmazonS3ReadOnlyAccess, choose a role name, for example RedshiftLoadRole. Once created, copy the Role ARN as you will need it in the next section.

neekipatel · September 28, 2017, 10:50pm

Hi @anton ,

Thank you for your help. I double checked the permissions and they seem to be set properly. If they weren’t wouldn’t it always fail, instead of intermittently failing? On instances when the error does occur, I re-run snowplow-emr-etl-runner with --resume-from=“rdb_load” and everything works out fine.

anton · September 29, 2017, 5:09am

Hi @neekipatel,

Sorry, you’re totally right, I must be misread that it fails intermittently.

In that case, I believe it happens due to notable S3 eventual consistency issue. What’s the typical amount of files you’re loading (both in atomic-events and shredded)?

Problem is that when you have too many files - discover logic can give you wrong list of files, where some files are basically ghosts from previous load. S3 will become consistent, but it happens “eventually”, but not now. Meanwhile Redshift tries to load these ghost files and (correctly) fails with it.

We added some logic in RDB Load to check and wait for some time, but in the end unfortunately there’s no silver bullet against eventual inconsistency - we have to wait.

neekipatel · October 2, 2017, 8:00pm

Just wanted to report back, after some more investigation we found the issue was related to s3 versioning. Once we turned off S3 versioning the issue hadn’t occurred in the last 3 days.

alex · October 2, 2017, 8:04pm

Thanks for sharing @neekipatel!

evanlamarre · February 24, 2020, 12:54am

Hey @anton,

This issue recently began cropping up routinely for us. We do not have versioning enabled on our buckets. We are using RDB R31. Seems to resolve after re-running the rdb_loader. Any suggestions for reducing the frequency of this error? Also it looks like the snowplow-processing-manifest could help mitigate this issue, are there any docs kicking around for getting that setup? Thanks!

anton · February 24, 2020, 5:25am

Hi @evanlamarre,

What version of a enrich are you using? Spark or Stream enrich? I’m asking because one thing that helped us significantly with this problem is more frequent loads (and hence smaller folders) with Stream Enrich, although I don’t expect this issue cropping up too often in any modern RDB Loader (0.13.0+). Technically we could just increase amount of consistency checks, but as I said, this problem generally went away with pipelines we manage.

As of snowplow-processing-manifest - we don’t use it anymore and planning to deprecate as its complexity was outweighing all benefits. Although, we also plan to start working on a near-real-time loading soon, which will attack this problem for our OSS users from one more angle.

evanlamarre · February 24, 2020, 6:40am

Hey @anton,

Thanks for the quick reply! We are using spark enrich (1.18.0). And we are running the full pipeline hourly for most hours of the day. Hoping to move to stream in the near future.

anton · February 24, 2020, 8:49am

Hoping to move to stream in the near future.

That’s a very good idea anyway as we’re deprecating Spark Enrich. Have a read Paul’s upgrading guide: AWS batch pipeline to real-time pipeline upgrade guide.

One more thing I forgot to mention is consolidate_shredded_output setting from R112, which also helped to increase stability of loading step.

evanlamarre · February 24, 2020, 4:27pm

Great thanks @anton! Very helpful.

Vitor_Avancini · March 12, 2020, 12:32pm

Is there any ‘wait and retry’ check for the rdb shredder?

I’m having this problem intermittently but it fails a lot more than it works. I’m using rdb shredder 0.16.0 on top of 48 files.
I’m using stream enrich and i’ve configured to dump the enriched files twice per hour, hence 48 files.

Is there any way I can make this shredding process more stable? Is 48 files too much?

Topic		Replies	Views
Problem at S3 to HDFS S3DistCp step AWS batch pipeline (Legacy)	19	7280	June 4, 2021
Error loading data to Redshift Storage targets	4	1309	May 3, 2019
Can't load data back into redshift Troubleshooting	7	1886	June 4, 2018
Exception in emr step of loading data in redshift AWS batch pipeline (Legacy)	8	1996	July 31, 2018
Elasticity Custom Jar Step: Load AWS Redshift enriched events storage Storage Target: FAILED Storage targets	9	2292	October 31, 2017

EMR intermittently fails at Loading S3 to Redshift

Related topics