What to do next with recovered *.lzo files?

gherolyants · March 29, 2019, 2:03pm

Hi!

I managed to fix my payload using snowplow-event-recovery tool and got bunch of part-r-XXXXX.lzo files in an S3 bucket. What should I do next to get them into the database (we use Redshift)?

Thanks.

ihor · March 29, 2019, 3:40pm

@gherolyants, you process them as usual in your Snowplow pipeline. The difference is the bucket the files have been recovered into would be served as processing bucket. That means when running the pipeline you do so by skipping the staging step (step 1 in the dataflow diagram) - use --skip staging option.

It is also wise to create a separate, say recovery, schema in Redshift to load the recovered events into. You would review your recovered events and copy them over into atomic schema on successful completion once you confirmed all is good.

Do note that recovered data might also contain good data that have already been successfully processed and present in atomic schema. You might want to filter out that data before copying the recovered data over (if duplicated events is an issue for you). This is due to the fact that POST payload goes to bad as a whole - together with good events that have passed validation successfully. It simply because there’s no easy mechanism to separate good from bad from the payload.

Topic		Replies	Views
Snowplow bad events reprocessing	7	1434	February 23, 2021
Correct/change data in thrift LZO files for reprocessing Enrichment	20	5316	March 27, 2017
How can I recover data from when my pipeline was broken?	3	44	December 11, 2024
Snowplow Event Recovery on GCP GCP pipeline	4	670	September 27, 2023
Using Hadoop Event Recovery to recover events with a missing schema [tutorial] Troubleshooting	17	5441	June 1, 2017

What to do next with recovered *.lzo files?

Related topics