Avro as event serialization format to reduce number of formats?

Jan-Eric_Duden · March 2, 2021, 2:20pm

I am looking into snowplow currently. From what I understand from the documentation and the code repos, is that the following formats are used to encode an (enriched) event:

query-string + json between client → collector
thirft between collector → enrich
tsv between enrich → storage / processors
json between storage / processors →

i am wondering if my picture correct / complete.

I saw comments in the code that there is/was an intent to move to avro. I assume in order to reduce the number of formats, currently in use.

Is somewhere up to date information about that transition ?
I am wondering if this transition is still indented to be executed ?

dilyan · March 2, 2021, 4:31pm

Hi @Jan-Eric_Duden , that is a pretty accurate picture. The tsv in step 3 is a mix of plain-vanilla tsv and JSON (where some of the values are JSON blobs).

We are constantly thinking about how to improve the formats but we don’t currently have any short-term plans for avro support. Different formats have different strengths, depending on the use case, so even though streamlining and simplifying the pipeline is something we are very keen to do, it’s not necessarily the main driving force for these decisions. Eg, adding support for Parquet in step 3 would further diversify the formats but would also unlock use cases for data in data lakes and such.

Topic		Replies	Views
Streamline Data Format In Enrich [TSV -> Avro/Thrift] Enrichment	2	597	October 30, 2023
Avro support in the Javascript tracker Tracking SDKs	2	1707	May 16, 2017
Does Snowplow Analytics SDK Support JSON to TSV Conversion? Analytics SDKs	0	51	February 28, 2025
Golang Kinesis Reader For engineers	5	1362	March 10, 2018
Snowplow SDK version restrictions For engineers	11	764	July 22, 2021

Avro as event serialization format to reduce number of formats?

Related topics