What to do next with recovered *.lzo files?


I managed to fix my payload using snowplow-event-recovery tool and got bunch of part-r-XXXXX.lzo files in an S3 bucket. What should I do next to get them into the database (we use Redshift)?


@gherolyants, you process them as usual in your Snowplow pipeline. The difference is the bucket the files have been recovered into would be served as processing bucket. That means when running the pipeline you do so by skipping the staging step (step 1 in the dataflow diagram) - use --skip staging option.

It is also wise to create a separate, say recovery, schema in Redshift to load the recovered events into. You would review your recovered events and copy them over into atomic schema on successful completion once you confirmed all is good.

Do note that recovered data might also contain good data that have already been successfully processed and present in atomic schema. You might want to filter out that data before copying the recovered data over (if duplicated events is an issue for you). This is due to the fact that POST payload goes to bad as a whole - together with good events that have passed validation successfully. It simply because there’s no easy mechanism to separate good from bad from the payload.

1 Like