How to reimport data for some data

sphinks · March 25, 2020, 8:56am

My EMR jobs failed during enrich process, I try to resume it, however I have to empty enrich/good folder. Process is finished, but there are only few events (e.g. 100 visits instead of 50k) for some dates. I suppose I have done something wrong with resuming enrich process. Is there a way to replay all events starting from some date? I mean data raw events and make them pass all the pipeline?

ihor · March 25, 2020, 7:30pm

@sphinks, here’s how to resume correctly depending on the dataflow step the failure took place in and the mode the pipeline runs in, https://github.com/snowplow/snowplow/wiki/Batch-pipeline-steps.

It is possible to reprocess again depending on your pipeline architecture. However, it is not clear to me what exactly happened here and what state your batch pipeline in. Do you run the pipeline in Stream Enrich mode or pure batch? Typically, the pipeline would keep your data for each intermediate state in the dedicated S3 locations - raw, enriched, and shredded and thus allowing you to resume/reprocess.

sphinks · March 26, 2020, 10:15am

@ihor I have resumed failed pipeline job and it finished. But in results put in database too few rows. Now EMR jobs are running as expected. I’m using batch mode and want to reprocess all events in particular date once again starting from very beginning of pipeline (from raw folder on S3). How I can do it?

sphinks · March 30, 2020, 10:55am

@ihor or anyone?

ihor · March 31, 2020, 4:38pm

@sphinks, you would need to move the files from the archive:raw bucket to the processing bucket and run EmrEtlRunner with --skip staging option.

Topic		Replies	Views
Resubmit record from stream enrich Enrichment	1	1645	September 1, 2020
No logs to process: No Snowplow enriched stream logs to process since last run Enrichment	2	1040	October 13, 2020
How to re-run a job that fails at the processing stage? Troubleshooting	2	1992	August 4, 2016
Processing logs for a specific time period AWS batch pipeline (Legacy)	5	1518	November 14, 2016
Recover from EMR failures with deduplication? Storage targets	4	2514	September 2, 2017

How to reimport data for some data

Related topics