Avro as event serialization format to reduce number of formats?

I am looking into snowplow currently. From what I understand from the documentation and the code repos, is that the following formats are used to encode an (enriched) event:

  1. query-string + json between client → collector
  2. thirft between collector → enrich
  3. tsv between enrich → storage / processors
  4. json between storage / processors →

i am wondering if my picture correct / complete.

I saw comments in the code that there is/was an intent to move to avro. I assume in order to reduce the number of formats, currently in use.

Is somewhere up to date information about that transition ?
I am wondering if this transition is still indented to be executed ?

Hi @Jan-Eric_Duden , that is a pretty accurate picture. The tsv in step 3 is a mix of plain-vanilla tsv and JSON (where some of the values are JSON blobs).

We are constantly thinking about how to improve the formats but we don’t currently have any short-term plans for avro support. Different formats have different strengths, depending on the use case, so even though streamlining and simplifying the pipeline is something we are very keen to do, it’s not necessarily the main driving force for these decisions. Eg, adding support for Parquet in step 3 would further diversify the formats but would also unlock use cases for data in data lakes and such.

1 Like