Setting additionalProperties to true in Iglu JSON Schemas

I was recently asked this question by a client:

I have a question about additionalProperties attribute in json schema. I’m wondering whether it can give us more flexibility as we could add more tracked properties ad-hoc on the client side. Do you have any example how it could be used? Are the additional properties ignored when the StorageLoader runs?

It’s a great question, so I’ve posted it here in case others are wondering. There’s a trade-off:

More ‘relaxed’ schemas i.e. set additionalProperties to true

  • Pro: this means developers can add new properties to the an event / context without it breaking validation. (So the data will still be successfully processed and e.g. loaded into Redshift
  • Con: the extra properties added wont be accessible to downstream processes (e.g. loaded into Redshift).

The alternative is to set additionalProperties to false i.e. go for a ‘stricter’ schema

  • Con: events with extra properties will now fail validation
  • Pro: you should see that something’s gone wrong (increase in the # of bad rows e.g. by checking Kibana) and can
    • update your schema to accommodate the new data
    • reprocess the data that has failed validation. (In practice this is fiddly / time consuming at the moment, but we’re working on a toolset to make it easier.)

Hi Yali,

Waking up a dead thread right after Halloween :wink:

We are experimenting with additionalProperties: true for our AB context, where we keep running experiments and the visitor’s allocation in Test and Control groups.

There are a few tests that we run all the time and those we explicitly define in the schema with an enumeration of the possible values, but the ability to launch a new experiment and start capturing in Snowplow without a schema change is enticing.

I was thinking of extracting the values from the raw JSON but it seems that after shredding ue_json / ue_properties does not make it to Redshift anymore (which makes perfect sense).

Are there any new thoughts or tooling in this regard? Honestly it is not a huge ordeal to introduce the AB experiment in the schema first, but bumping the AB context schema every time we run an experiment sounds like an unnecessary overhead.