I’m using EmrEtlRunner (Beanstalk/Clojure + S3 + EMR + Redshift) and I had an issue where EMR was running for 2 days (normally takes <20 mins) so I had to terminate it.
Once it was terminated, I ran “snowplow-runner-and-loader.sh” again but had to move files out of the S3 bucket (processing, shredded, enriched) because it throws an error that the folders aren’t empty which is fine.
Anyway, when I it all ran successfully again, I found I was missing a couple of days of data. How would I go about getting that back? I have all the files from processing, shredded and enriched and (I didn’t delete anything, just moved it).
Also, I run it 6 times per day - would running it for part of that day cause duplicating in the atomic.events table?
Regarding the 6 times per day bit, I meant if I copied the raw files back into processing for a time period I had already imported for, would it cause double rows in Redshift for that time period.
Actually I just realised those dates are missing from the raw folder as well. They wouldn’t still be on the collector or anywhere else would they (it’s from a few days ago)? And if so, could you advise how I would pull them?
Currently yes, it would load duplicates into Redshift. This will change in the future when we have cross-batch dedupe for Redshift, but this is still a way off.
Afraid not - if staging ran, moved files from your collectors’ S3 bucket to staging, and then you deleted those files from staging, those events are irretrievably gone.