i don’t get this. we had that amazon AWS redshift outage last night for an hour. so EMR failed at the loading step. usually i can just run this command to reload the data stuck in shredded folder but it’s stuck on enriched. and i see the good folder in enriched folder that it’s complaining about. also i remember an older release had a file on S3 that told me what files were put into enriched processing. if i could find that file again i could just start over and re-process those files from scratch but i can’t find that file anymore.
@anton or @alex any ideas? i was going to rerun from scratch if i knew what files enrichment processed so i can re-process them but can’t find the list. other problem is dynamodb already has the shredded items on file from going through shred.
Sorry, I’m not sure I entirely follow the problem, but what command are you using to run recovery? If your previous run failed on load then you should try this:
./snowplow-emr-etl-runner --resume-from load ...
This should load data from shredded good and then archive both enriched and shredded.
just to close this out. i had no choice but to clean out shredded and enriched folders and then move the raw files in archive directory back into processing and then run --skip staging. i had to disable dynamodb for that batch since they already were run through it.
btw our dev cluster had the same exact problem as prod.
E, [2018-06-01T11:59:16.549000 #7229] ERROR – : No run folders in [s3://ga-snowplow-production/snowplow-enriched/good/] found
This error occurs when EER can’t extract the latest run ID due to a large number of empty files ( *$folder$) in snowplow-enriched and snowplow-shredded buckets. The files get left around as part of the S3DistCp routine. There’s an open issue to add a maintenance step - (#3439).
As for now, you can create a script to remove the files regularly or remove them manually when there is such need with aws s3 rm command. Once the files are removed --resume-from rdb_load should work as expected.