Hello Snowplow Community,
I’d like to propose a discussion regarding the data format in our Snowplow pipeline. Currently, collectors ingest data in Thrift, which is later converted to TSV after validation and enrichment. This process adds complexity to our downstream pipelines in terms of decoding and dependency on the Iglu schema registry. Shouldn’t we use schema-rich formats in the enrichment stage? This can be very efficient for stream processing use cases.
I believe simplifying our data format could improve both processing efficiency and reduce dependencies on the schema registry.
I think this depends on what you have in mind as “schema-rich”. Most systems or serialization frameworks are pretty similar to Snowplow in that they’ll point to a schema rather than embed the schema within the event. This saves a significant number of bytes and processing by allowing consumers to resolve schemas when they need to, rather than marking the events larger.
I agree on the storage and retrieval part @mike
I think the bigger challenge is how schemas are defined in isolation and later stitched together as the payload in Snowplow. This makes it difficult for downstream to fetch a schema later on to deserialize the entire payload.