Hi Folks,
As part of our custom javascript event ( trackSelfDescribingEvent ) we wanted to configure “AVRO” file format. This way Javascript analytics pixel takes care of AVRO serialization to s3 location. any further downstream ETLs, Enrichment, Storage directly work on AVRO format.
Any idea how to create custom JSON schema for an AVRO data set & how to implement as part of Javascript part ??
At the moment all of our trackers only support JSON to express self-describing events and custom contexts. We are starting some work to understand how we can automatically translate JSONs (based on their JSON Schemas) to Avro, but this would only occur initially very downstream of the original collection.
Moving Avro upstream into the trackers is an interesting idea - I see a few challenges with it:
Avro binary is basically impossible for non-data-engineers to successfully generate
Avro JSON is not well-supported in most programming languages
Avro JSON has a few sharp edges which could trip up an engineer (e.g. handling of union types)
More deeply, even if we did support Avro from trackers onwards, I don’t think it would solve your problem, because Snowplow enriched events are hard to represent in Avro (or Protobuf or Thrift - basically any “strict” schema technology). The Snowplow enriched event is very heterogenous: a given event can contain 10 or 20 discrete entities, all independently versioned.
This means that, if you sent in your self-describing event as Avro, you would easily not gain “any further ETLs, Enrichment, Storage directly work[ing] on AVRO format” for free.
Sorry the answer isn’t more positive. Interesting idea though!