Snowplow Event Recovery on GCP

Stefania_Iellamo · September 26, 2023, 9:31am

Hi everyone,
we want to use the Snowplow Event Recovery on GCP to recover enrichment failures, my question is related to the events recovered: in BigQuery is there a field to recognize the events that are recovered by this process?

Thanks.

davidher_mann · September 26, 2023, 12:48pm

Hi @Stefania_Iellamo,

there is no field. You need to compare the load_tstamp columns with collector_tstamp to validate the recovery. Additionally you can see in the recovery job itself how many events where successfully recovered. Unfortunate event recovery for GCP is not properly documented IMO and I found it pretty hard to get it running. I am planning to write a short guide about it.

Hope that helps.
David

Stefania_Iellamo · September 26, 2023, 1:01pm

Hi @davidher_mann,
thanks for the reply. It helps!

Can I ask you how do you usually manage the recovery? I need to recover enrichment failures and during my tests I see that the files - used as input for the recovery - are not deleted from the “badrows” bucket. In order to not recover them multiple times, do you move them before running the recovery command to another bucket?

Thanks in advance.
Stefania

davidher_mann · September 27, 2023, 6:45am

Hi Stefania,

I think recovering events multiple times is only a real issue if a lot of people doing recoveries on the same pipeline and I would ignore it for now. In general it would be possible to transfer the files in GCP to a dedicated folder, since you can select a bucket/folders of your choice in the inputDirectory parameter. It think its not worth the effort because:

If you recover the same events multiple times the duplicates will be filtered out in the data modeling process, since it is normal to have duplicates in the raw table anyway.
To run the job you need to specify the folders in the bad events bucket (inputDirectory parameter). Since the buckets are partitioned by year/month/day/hour the start and end date/hour can be specified. Additionally you can include a filter e.g. on collector time stamp (config parameter). Combined with a proper config you can select bad events for your recovery job very specific, which reduces the risk.

You mentioned that you did a test: have you already managed to successfully create a recovery job in dataflow or where are you currently stuck?

Best,
David

Stefania_Iellamo · September 27, 2023, 8:12am

Hi David,
thanks for your reply.

I already managed a successfull recovery in test. I’m thinking to the best procedure to recover a lot of failures we had before removing an enrichment that was causing errors and I think in the future we will not need to recover often.

Thanks for your support.
Stefania

Topic		Replies	Views
Running Hadoop Event Recovery with Dataflow Runner [tutorial] Troubleshooting	1	1847	October 12, 2017
Snowplow bad events reprocessing	7	1439	February 23, 2021
Deduplication of events on GCP GCP pipeline	0	1144	October 21, 2019
Snowplow event recovery - emr recovery errors For engineers	3	732	April 12, 2021
Reprocessing / Rerunning logs from IGLU server failure for unstructured events Enrichment	4	2872	June 25, 2017

Snowplow Event Recovery on GCP

Related topics