RDB Loader 0.18.1: "Folder with atomic-events was not found in shredded/good"

Hi all,

My last EMR cluster failed at the rdb_load step with the following logs:

15:13:01.199: Consistency check failed. Making another attempt
15:13:11.306: Consistency check failed. Making another attempt
15:13:21.442: Consistency check failed. Making another attempt
15:13:31.529: Consistency check did not pass after 5 attempts
15:13:31.538: Data discovery error with following issues:
Folder with atomic-events was not found in [s3:/bucket/shredded/good/run=2021-05-06-12-00-21/]

I couldn’t find a solution about this error.

I tried running a new job from rdb_load but it failed with the same error.

Any help will be appreciated!

@guillaume, I suspect you have lots of “empty” (0 bytes) directories and/or files in s3:/bucket/shredded/good/ location. You need to delete them to allow the app to see your data in run=2021-05-06-12-00-21 folder (provided you do have data in that folder). It’s a good idea to do a periodical clean-up to prevent this error.

I have only 194 objects in s3:/bucket/shredded/good/ (I cleaned the bucket a few weeks ago).

In run=2021-05-06-12-00-21, I have 3 folders

  • atomic-events
  • shredded-tsv
  • shredded-types.

In atomic-events, I have

  • a file _SUCCESS
  • 46 files part-000xx-5a3c5f8b-c2ee-4142-be72-1fe66a9cde56-c000.txt.gz

where xx goes from 00 to 45. Each one of those files is about 80kb.

@guillaume, it looks good if no other files and folders in s3:/bucket/shredded/good/. It could be an infamous AWS eventual consistency issue when the status of the bucket cannot be seen as is.

You could try resuming with --skip consistency_check.

I emptied s3:/bucket/shredded/good/ completely, resumed from shred and it worked.

Thanks!

I’m going to add a lifecycle rule on the bucket to automatically delete the files older than 24h.