i don’t understand why this won’t resume from rdb-load.
it failed to load because redshift was in maintenance mode. but when i run the EMR command to resume from rdb-load it shouldn’t care about enriched folder no? EMR definitely failed on the load step.
would upgrading help my stuck state though? i can fix it by clearing enriched/shred folders and letting it process from scratch to get things going again which is fine, i’ve done it before. or would upgrading to R95/latest be the best?
@anton the problem now is because shredding completed, all those records are in dynamodb but not in redshift. so i would have to turn dyanmodb dedupping off in config and then let those records in twice and then run dedup SQL scripts
One thing about cross-batch deduplication: if you start from shred, e.g. process same enriched data with same etl_tstamp - it won’t harm. When cross-batch in shredding encounters event_id:event_fingerprint pair with same etl_tstamp it lets it go through shredding process, (in other words does not de-duplicate it).
It is a bit strange that on R92 you still use StorageLoader. Since R90, RDB Loader is preferred way to load data into Redshift.
The fact that you tried resuming from “shred” step and got the error related to “enrich” step
Snowplow::EmrEtlRunner::UnexpectedStateError (No run folders in [s3://ga-snowplow-production/snowplow-enriched/good/] found
is likely to be an indication of a bug which has been resolved in the later version, R95+. You might also get “No run folders” error due to a large number of empty files (with prefix $folder$) accumulated in “enriched/good” and “enriched/shredded” buckets due to the usage of S3DistCp utility as a file moving means. You might wish to delete those files manually.