Hello!
This is more of a discussion than a real problem.
We’re about to move from an AWS Batch pipeline to a realtime pipeline in GCP.
In AWS, we used to have an archive of all our raw events that the collector logged.
I can see how to replicate this in GCP (with a GCS Loader, although it could be a bit expensive) but these raw events (right out of the collector) are thrift records and I can’t find any schema allowing me to decode them.
I’ve tried with:
CollectorPayload (https://github.com/snowplow/snowplow/blob/master/2-collectors/thrift-schemas/collector-payload-1/src/main/thrift/collector-payload.thrift)
&
SnowplowRawEvents (https://github.com/snowplow/snowplow/blob/master/2-collectors/thrift-schemas/snowplow-raw-event/src/main/thrift/snowplow-raw-event.thrift),
both without luck.
I think CollectorPayload is a schema to decode the events later on (after the enrichment step, mainly the bad rows), while SnowplowRawEvents is for another kind of collector. Is that correct?
So first, did I mess things up when decoding and one of those schema should work?
Else, I’m curious to what’s everyone’s approach regarding raw events in GCP. For us, it would bring a feeling of security, knowing that we could replay things in case of hardcore failure, like we could in AWS.
Is it something you’re abandoning? Are you building custom solutions? Is archiving raw events not considered a “good practice” anymore? Or am I just missing something dumb?
Thanks in advance for your help and your input