Defining Snowplow Iglu Schema For Generic Map/Object

Hi Team,

I am trying to figure out a way to define a Snowplow schema for a generic map or object type to allow setting arbitrary key-value pairs. Can you please share your thoughts on this?

Here is the Avro schema which I am trying to translate into an equivalent Snowplow schema;

please look at the event_prop property:

{
  "namespace": "com.qb.avro",
  "name": "QBKafkaRecord",
  "type": "record",
  "fields": [
    {
      "name": "event_id",
      "type": "string"
    },
    {
      "name": "event_timestamp",
      "type": "long",
      "default": 0
    },{
      "name": "event_prop",
      "type": {
        "type": "map",
        "values": [
          "null",
          "string",
          "int",
          "long",
          "double",
          "bytes",
          "float",
          "boolean",
          {
            "type": "array",
            "items": [
              "null",
              "string"
            ]
          }
        ]
      },
      "default": {}
    }
]
}

I’d do something like

{
  "type": "object",
  "properties": {
    "custom_props": {
      "type": "array",
      "properties": {
        "key": {
          "type": "string"
        },
        "value": {
          "type": "string"
        }
      }
    }
  }
}

with the caveat that you’ll need to stringify any values in the value column (depending on your warehouse though you might be able to relax this).

I am not sure if it will allow array types as well. But I was wondering if we can define an object type loosely without properties to infer that as a generic type.

As a semi-structured type, it will help if we can preserve the actual type at the raw layer.

Thank you for the response, @mike

I think this would depend on what your use case is.

In theory JSON schema will allow you to do this (as additionalProperties defaults to true anyway for an object) but ultimately depending on what warehouse you are loading into in order for the column to be created the type needs to be known ahead of time. Something like Snowflake will be fine (because properties can be sent through as part of a VARIANT) but that isn’t going to work the same way in Redshift and BigQuery / Databricks where the types are more structured.

If you define a field as simply "object": {}, with no fields specified, then you will get the behaviour you want.

In Snowflake this will be loaded as VARIANT type. In the other databases it’ll be cast to JSON string and you’ll need to parse it when you query it.

Note that the data in this object won’t get validated, we recommend specifying fields int he schema where possible.

1 Like

Okay, I want to confirm if this is a valid schema for Snowplow.
@Colm @mike

{
    "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
    "description": "Schema for ParaphraserVisited",
    "self": {
        "vendor": "com.copilot.write",
        "name": "WriterVisited",
        "format": "jsonschema",
        "version": "1-0-0"
    },
    "type": "object",
    "additionalProperties": false,
    "properties": {
        "event_name": {
            "description": "A string representing the paraphraser interaction event that was emitted.",
            "type": "string"
        },
        "event_id": {
            "description": "Unique identifier for the event",
            "type": "string"
        },
        "event_timestamp": {
            "description": "Time at which the event was emitted",
            "type": "string"
        },
        "event_props": {
            "description": "Unstructured JSON object containing event properties",
            "type": "object"
        },
        "user_props": {
            "description": "Unstructured JSON object containing user properties",
            "type": "object"
        }
    },
    "required": [
        "event_name",
        "event_id",
        "event_timestamp",
        "event_props",
        "user_props"
    ]
}

Superb @Colm That’s exactly what I am looking forward to.
Thank you so much, Let me try and test.

will this work?

"user_props": {
            "description": "Unstructured JSON object containing user properties",
            "type": "object"
        }

Yeah - that’s fine. This is effectively treated as user_props having additionalProperties set to true.

1 Like