Processing logs for a specific time period

Hi guys,

I’m using EmrEtlRunner (Beanstalk/Clojure + S3 + EMR + Redshift) and I had an issue where EMR was running for 2 days (normally takes <20 mins) so I had to terminate it.

Once it was terminated, I ran “” again but had to move files out of the S3 bucket (processing, shredded, enriched) because it throws an error that the folders aren’t empty which is fine.

Anyway, when I it all ran successfully again, I found I was missing a couple of days of data. How would I go about getting that back? I have all the files from processing, shredded and enriched and (I didn’t delete anything, just moved it).

Also, I run it 6 times per day - would running it for part of that day cause duplicating in the table?



It should be pretty easy to recover the missing 2 days of data:

  • Wait till the latest run has fully completed
  • Pause the regular schedule of processing
  • Move the raw files that you had to move out of the S3 bucket back into processing
  • Run the pipeline with --skip staging
  • Confirm the pipeline runs through and loads the missing data into Redshift
  • Un-pause the regular schedule of processing

I’m not sure I understand the question?

Thank you @alex I’ll give it a go.

Regarding the 6 times per day bit, I meant if I copied the raw files back into processing for a time period I had already imported for, would it cause double rows in Redshift for that time period.

Thank you.

Actually I just realised those dates are missing from the raw folder as well. They wouldn’t still be on the collector or anywhere else would they (it’s from a few days ago)? And if so, could you advise how I would pull them?

Hi Tim -

Currently yes, it would load duplicates into Redshift. This will change in the future when we have cross-batch dedupe for Redshift, but this is still a way off.

Afraid not - if staging ran, moved files from your collectors’ S3 bucket to staging, and then you deleted those files from staging, those events are irretrievably gone.

Thanks @alex, that explains it.