Archiving raw events in GCP

Timmycarbone · February 21, 2020, 11:27pm

Hello!

This is more of a discussion than a real problem.

We’re about to move from an AWS Batch pipeline to a realtime pipeline in GCP.

In AWS, we used to have an archive of all our raw events that the collector logged.
I can see how to replicate this in GCP (with a GCS Loader, although it could be a bit expensive) but these raw events (right out of the collector) are thrift records and I can’t find any schema allowing me to decode them.

I’ve tried with:
CollectorPayload (https://github.com/snowplow/snowplow/blob/master/2-collectors/thrift-schemas/collector-payload-1/src/main/thrift/collector-payload.thrift)
&
SnowplowRawEvents (https://github.com/snowplow/snowplow/blob/master/2-collectors/thrift-schemas/snowplow-raw-event/src/main/thrift/snowplow-raw-event.thrift),

both without luck.

I think CollectorPayload is a schema to decode the events later on (after the enrichment step, mainly the bad rows), while SnowplowRawEvents is for another kind of collector. Is that correct?

So first, did I mess things up when decoding and one of those schema should work?

Else, I’m curious to what’s everyone’s approach regarding raw events in GCP. For us, it would bring a feeling of security, knowing that we could replay things in case of hardcore failure, like we could in AWS.

Is it something you’re abandoning? Are you building custom solutions? Is archiving raw events not considered a “good practice” anymore? Or am I just missing something dumb?

Thanks in advance for your help and your input

dilyan · February 25, 2020, 5:22pm

Hi @Timmycarbone. The Scala Stream Collector (the one used in real-time Snowplow pipelines) outputs a thrift-encoded binary payload. Those schemas are indeed valid but the problem is that you cannot easily decode the binary data, unlike you would do with string.

You can use GCS Loader to persist data in blob storage. If you want to be able to query the data, it might be a better idea to sink the Pub/Sub topic that has all the enriched data, rather than the raw collector data. Of course, this way you don’t have a permanent record of the data pre enrichment. You can also sink the raw topic – the challenge in querying it will be much greater but you can still use it to replay it through Enrichment, etc.

Hope this make sense.

Timmycarbone · February 25, 2020, 6:23pm

Makes total sense, thanks a lot!

I tried to use GCS Loader to persist the raw events in GCP then use Pub/Sub to pull the raw events out of GCP and send them to Enrichment but I failed. I guess I did something wrong. Maybe raw events got re-encoded before storage to GCP or something.

I’ll retry this approach then!

Else, I guess persisting the output of Enrichment (both good and bad rows) makes sense too. It doesn’t cover a “critical failure” of the Enrichment step though. I guess it shouldn’t happen but better safe then sorry

Thanks a lot!

Topic		Replies	Views
Streaming bad events are not queryable Enrichment	6	2443	October 18, 2018
Processing Collector's Raw Good Events Collectors	5	634	October 26, 2023
Snowplow Events from Google Bucket to BigQuery Storage targets	1	1110	July 29, 2020
Reading raw events from kinesis-s3 log AWS real-time pipeline	5	3318	May 24, 2017
How to get structured event data without a separate enrich step? For engineers	2	721	September 16, 2020

Archiving raw events in GCP

Related topics