Hi!
I managed to fix my payload using snowplow-event-recovery tool and got bunch of part-r-XXXXX.lzo files in an S3 bucket. What should I do next to get them into the database (we use Redshift)?
Thanks.
Hi!
I managed to fix my payload using snowplow-event-recovery tool and got bunch of part-r-XXXXX.lzo files in an S3 bucket. What should I do next to get them into the database (we use Redshift)?
Thanks.
@gherolyants, you process them as usual in your Snowplow pipeline. The difference is the bucket the files have been recovered into would be served as processing bucket. That means when running the pipeline you do so by skipping the staging step (step 1 in the dataflow diagram) - use --skip staging
option.
It is also wise to create a separate, say recovery
, schema in Redshift to load the recovered events into. You would review your recovered events and copy them over into atomic
schema on successful completion once you confirmed all is good.
Do note that recovered data might also contain good data that have already been successfully processed and present in atomic
schema. You might want to filter out that data before copying the recovered data over (if duplicated events is an issue for you). This is due to the fact that POST
payload goes to bad as a whole - together with good events that have passed validation successfully. It simply because thereβs no easy mechanism to separate good from bad from the payload.