Streamline Data Format In Enrich [TSV -> Avro/Thrift]

Jayant_Kumar · October 28, 2023, 9:20am

Hello Snowplow Community,

I’d like to propose a discussion regarding the data format in our Snowplow pipeline. Currently, collectors ingest data in Thrift, which is later converted to TSV after validation and enrichment. This process adds complexity to our downstream pipelines in terms of decoding and dependency on the Iglu schema registry. Shouldn’t we use schema-rich formats in the enrichment stage? This can be very efficient for stream processing use cases.

I believe simplifying our data format could improve both processing efficiency and reduce dependencies on the schema registry.

mike · October 29, 2023, 10:48pm

I think this depends on what you have in mind as “schema-rich”. Most systems or serialization frameworks are pretty similar to Snowplow in that they’ll point to a schema rather than embed the schema within the event. This saves a significant number of bytes and processing by allowing consumers to resolve schemas when they need to, rather than marking the events larger.

Jayant_Kumar · October 30, 2023, 5:50am

I agree on the storage and retrieval part @mike
I think the bigger challenge is how schemas are defined in isolation and later stitched together as the payload in Snowplow. This makes it difficult for downstream to fetch a schema later on to deserialize the entire payload.

Topic		Replies	Views
Avro as event serialization format to reduce number of formats? Enrichment	1	1123	March 2, 2021
Collector -> S3 loader Collectors	3	1476	June 7, 2020
Enriched file in enriched/good schema? For data modelers & consumers	17	2748	December 17, 2018
Understanding schema validation and caching in Snowplow Enrichment	8	4563	July 1, 2016
Enriched TSV, columns contexts and derived_contexts schema order For data modelers & consumers	2	1570	May 9, 2021

Streamline Data Format In Enrich [TSV -> Avro/Thrift]

Related topics