I’ve looked at a couple of schemas and I’m wondering if there are any best practices. My example would be an email tool with a couple of events:
- email open
- soft bounce
- hard bounce
Basically these events could have some additional data:
There are at least three different ways to build these schemas:
- One schema with an event key that differentiates the type of event that happened
- One schema holding the data, A second schema that holds the event key (As seen in the Enhanced Ecommerce Schema)
- One schema for each event repeating the same data points (For example: https://github.com/snowplow/iglu-central/tree/master/schemas/com.mailgun)
Do you have any recommendations or thoughts around this? I am leaning towards option 1 for now. Maybe I will have some fields that are only filled in specific events, but it shouldn’t be a big problem as far as I understand for now.
I’ll go through your options one by one:
I’m not a fan of this idea. It means you end up having to
WHERE event_key=... all the time to narrow things down in the warehouse. I like my schemas to describe a single event type, however I’m well aware many webhook APIs don’t follow this principle and often return entirely different responses based on some key/type field within the response. Urgh.
This is an improvement, as it reduces the duplication of the “entity” data. You can build a single “Entity” schema which contains all the common properties (campaign, source, send_time, etc.) and then a single “Email Event” schema which has a type field. However, I think this could be improved by mixing 2 and 3 together.
Create a seperate schema for each event type (or key). So separate schemas for
soft_bounce, etc. These schemas might have 0 data points beyond the self-describing properties, but could also have a couple of properties that are specific to those event types too - this offers nice flexibility for the future. You then have the same “entity” schema I described above, which you attach to each event.
With 3. the queries become quite nice in the warehouse too. You end up looking at the specific column or table for the event type you want to analyse and then join to the entity/context information. This only becomes tricky if you want to analyse all the different event types as a single “thing” as you’d need to join them altogether but perhaps thats less likely - if that is your main way of analysing the data though, than maybe option 2 is better.
Ultimately, the way you’re going to use the data downstream is going to be the driver for what style of schema you apply. All 3 would work, but hopefully my explanations are enough to help you make a decision for your use case
thank you so much for the reply and the explanations and your support in this forum.
This makes a lot sense and I will choose the path you recommended. For the separate event type schemas (
soft_bounce …) you would not send any data other than attaching the basic schema? If I understand it correctly this would mean querying for
is not null to get the right combination?
Yeah I think thats how I’d go. It also leaves you open to add specific data properties to those schemas in the future should you require a property that is only for
soft_bounce for example.
How you query it depends a little on the warehouse you’re using but you’re on the right track. You should also find the
event_name is populated with the schema name in the atomic.events table too which might help.