Processing logs for a specific time period

timgriffinau · November 8, 2016, 2:09am

Hi guys,

I’m using EmrEtlRunner (Beanstalk/Clojure + S3 + EMR + Redshift) and I had an issue where EMR was running for 2 days (normally takes <20 mins) so I had to terminate it.

Once it was terminated, I ran “snowplow-runner-and-loader.sh” again but had to move files out of the S3 bucket (processing, shredded, enriched) because it throws an error that the folders aren’t empty which is fine.

Anyway, when I it all ran successfully again, I found I was missing a couple of days of data. How would I go about getting that back? I have all the files from processing, shredded and enriched and (I didn’t delete anything, just moved it).

Also, I run it 6 times per day - would running it for part of that day cause duplicating in the atomic.events table?

Thanks!

Cheers,
Tim

alex · November 8, 2016, 10:02pm

It should be pretty easy to recover the missing 2 days of data:

Wait till the latest run has fully completed
Pause the regular schedule of processing
Move the raw files that you had to move out of the S3 bucket back into processing
Run the pipeline with --skip staging
Confirm the pipeline runs through and loads the missing data into Redshift
Un-pause the regular schedule of processing

I’m not sure I understand the question?

timgriffinau · November 9, 2016, 12:49am

Thank you @alex I’ll give it a go.

Regarding the 6 times per day bit, I meant if I copied the raw files back into processing for a time period I had already imported for, would it cause double rows in Redshift for that time period.

Thank you.

timgriffinau · November 9, 2016, 1:05am

Actually I just realised those dates are missing from the raw folder as well. They wouldn’t still be on the collector or anywhere else would they (it’s from a few days ago)? And if so, could you advise how I would pull them?

alex · November 9, 2016, 10:25am

Hi Tim -

Currently yes, it would load duplicates into Redshift. This will change in the future when we have cross-batch dedupe for Redshift, but this is still a way off.

Afraid not - if staging ran, moved files from your collectors’ S3 bucket to staging, and then you deleted those files from staging, those events are irretrievably gone.

timgriffinau · November 14, 2016, 10:03pm

Thanks @alex, that explains it.

Topic		Replies	Views
Rerunning logs (new to Snowplow) For engineers	2	1514	December 19, 2019
EmrEtlRunner Issues - taking too long on step 2 AWS batch pipeline (Legacy)	13	3619	March 29, 2017
Processing folder not empty - but no error on the ETL script! Enrichment	4	1344	June 14, 2017
EmrEtlRunner sink Shredded data into S3 bucket For engineers	0	703	November 11, 2019
ETL runner overwriting processing logs Enrichment	4	1478	May 17, 2017

Processing logs for a specific time period

Related topics