Event validation

dcow · August 26, 2016, 12:30am

I’m working on revamping an existing analytics flow. We use snowplow. The pain-point is keeping events consistent between different client implementations. For example, a screen may be called RegistrationActivity on one client and LoginController on another, or an event may be button_edit_click or edit_button_click. So, naturally, one might wish to enumerate acceptable values for events in a schema.

From my initial understanding, it appears snowplow supports custom schema-validated events. However, I’d like to add stricter validation to the builtin events e.g. screen and track. Is this possible?

I guess the alternative is to track the world and build more complex queries to sift through all the data (the approach Snowplow advocates). Its is more enticing the more I think about this…

alex · August 27, 2016, 4:35pm

Hi @dcow,

It’s not possible to add stricter validation to the built-ins. Our structured event is modeled after a Google Analytics event, and these are deliberately very “loosely typed”. The screen view event is similarly permissive. But if you look at the screen view event’s schema:

{
  "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
  "description": "Schema for a screen view event",
  "self": {
    "vendor": "com.snowplowanalytics.snowplow",
    "name": "screen_view",
    "format": "jsonschema",
    "version": "1-0-0"
  },

  "type": "object",
  "properties": {
    "name": {
      "type": "string"
    },
    "id": {
      "type": "string"
    }
  },
  "minProperties": 1,
  "additionalProperties": false
}

com.snowplowanalytics.snowplow/screen_view/jsonschema/1-0-0

From the schema you can see that it would be possible to create your own com.dcow version of the screen view, which you could make much more strongly typed. For example, instead of allowing the name of the screen to be a free-form string, you could make it a JSON Schema enum and thus enforce that the screen name comes from a pre-agreed list of legal values.

dcow · August 27, 2016, 10:24pm

Thanks! Do the clients validate events before sending, or does that only happen in the ETL layer?

alex · August 27, 2016, 10:40pm

Validation only happens in the ETL layer. This ensures that all validation failures are captured within the Snowplow pipeline - if a client were to do validation before sending, then there would be nowhere for a validation failure to go…

This said, in a strongly typed environment like Android or Obj-C, there’s no reason why you couldn’t mandate that all self-describing events and contexts should be created via pre-defined classes/structs (with a helper method to convert them into JSON dicts). This would give you compile time guarantees around all of your entities. It’s not something we’ve tried - let us know how you get on if you give it a go!

dcow · August 27, 2016, 11:32pm

It’s an idea I’ve been juggling around but I’m not sure it’s smart for exactly the reasons you mention. At most you could add client logging when an invalid json appears. The real win would be compile time validation of event structures, as you suggest. But once you start going down that road you want to generate those data classes from your json schema anyway. A tool that lets you feed self-describing json into existing json code generator utilities might be useful.

alex · August 28, 2016, 9:39am

Exactly - if you do runtime validation, you still have to somehow get the validation failures to a back-end for analysis, which would involve adding some other kind of “logging side-pipe”. It’s easier just to pass them un-validated to the Snowplow pipeline and get all the failure reporting in one place. Plus it means you have the option of recovering the failures, using Hadoop Event Recovery.

Yes - we have plans for an Iglu registry to handle that auto-generation itself - e.g. for Android/Java/Scala, your Iglu registry would also host a Maven-compatible repository containing POJOs/Scala case classes for all your entities. The ticket to follow here is Placeholder for Maven repo inside Iglu Server #88. This is useful both for enforcing correctness in your tracking instrumentation, but also for making analytics at the other end easier (e.g. writing AWS Lambda functions that operate on the data).

Thinking about this some more - there’s no reason why we couldn’t auto-generate TypeScript-compatible classes from an Iglu registry too. This would get us the same kind of correctness guarantees for the browser environment - ticket created: Placeholder for auto-generating TypeScript case classes from schemas #205.

dcow · August 28, 2016, 7:42pm

I may be interested in getting the ball rolling on a jsonschema → Swift library too.

alex · August 28, 2016, 7:55pm

That would be cool @dcow! It might be worth checking out this project too, it seems to be actively developed: https://github.com/cknadler/nidyx

Topic		Replies	Views
Javascript tracker with unstructured events	5	1605	January 14, 2020
Self-Describing Events Implementation For engineers	4	756	July 9, 2023
Trouble Using Custom Iglu Schemas for Snowplow Micro Troubleshooting	3	1542	June 10, 2022
Using snowplow-micro with custom events Collectors	2	921	February 20, 2023
Documentation for custom context Iglu	2	2906	April 26, 2017

Event validation

Related topics