Snowbridge Enriched JSON format - why is the major schema version used as a suffix?

davidher_mann · May 17, 2024, 7:25pm

First of all, I want to mention that I like Snowbridge!
We just rolled out our first realtime recommender realized via Snowbridge → Kafka. What would interest me is, why the major schema version is appended as a suffix to all entities / events for example:

contexts_com_mycompany_product_1

In my opinion, this generates much more pain than added value, because breaking schema changes must be considered in multiple components of data pipelines:

1. Snowbridge filtering / transform functions:

the out of the box functions can only reference one major version, no wildcard functions are available
it is possible to build a solution via JS transformations (example below), but it seems cumbersome and costs unnecessary performance

function main(input) {
    var spData = input.Data;

    var entities = new Set();
    for (var majorVersion = 1; majorVersion <= 10; majorVersion++) {  // check major version 1 - 10
        entities.add(`contexts_com_mycompany_product_${majorVersion}`);
    }

    // Check if of product entity version 1-10 exist in spData
    var hasProductEntity = Array.from(entities).some(entity => entity in spData);

    if (
        spData["event"] == "page_view" && 
        ("user_id" in spData) &&
        spData["page_urlpath"].includes("/product/") &&
        hasProductEntity
    ) {
        return {
            FilterOut: false,
            Data: spData
        };
    } else {
        return {
            FilterOut: true
        };
    }
}

2. Downstream consumes Kafka, PubSub, GTM servers-side etc.:
All consumers needs to find a way to handle breaking changes, which is complex and or costs performance.

Conclusion

Including the major version as a suffix would make sense, if a breaking change should stop event forwarding or downstream consumption, but that is extremely unlikely in reality!!
I think quite a lot of breaking changes happen due to additional required fields. If downstream consumer only need e.g. property a,b,c but not the additional required fields, they don’t care about the breaking change.

Suggestion:
Include the full schema version as a property, similar to some warehouse loaders, would make working with Snowbridge way easier. Example for Version 1-0-5:

	"contexts_com_mycompany_product": [
		{
		"_schema": "1-0-5",
		"id": "1234567",
		"productTypeId": 614,
		...
		}
	],

I am looking forward to your thoughts, suggestions and solutions.
Kind regards.
David

mike · May 19, 2024, 11:48pm

I’ll leave this to @Ada and @Colm who can probably confirm but I suspect it might be due to the nature of the analytics SDKs only ever emitting model versions rather than the full schemavers. This is something that’s been rectified in more modern versions of the loader (by explictly adding the schema version) but I think all of the analytics SDKs (Snowbridge uses the Go one) only emit the model version.

Colm · May 21, 2024, 2:45pm

Hey @davidher_mann!

I love to hear you’re working with and engaging with Snowbridge!

As mike points out, what you’re describing is the output of the golang analytics SDK, rather than a Snowbridge-specific thing. This Snowbridge transformation simply runs the analytics SDK on the data.

The scala sdk and loaders do include the full schema version metadata in the object (I would be happy to review a PR to add this to the go sdk), but I believe they still separate major versions by appending to the key - which seems to me a behaviour we’d want to preserve.

The SDKs do this simply because this is the contract for major versions is that it means breaking change - so by definition at least in theory it belongs in a separate place.

I understand the use case you describe though, and you’re right, it is a bit clunky to handle that. I think perhaps the crux of this issue is that for some cases - like the one you’ve described here - it is more convenient to treat a major version update as non-breaking.

I would worry that treating this as the default use case has much more difficult-to-handle implications. A major version can mean changing the type of an object, or having completely different data structures. So ultimately the choice is between:

The default behaviour always keeps conflicting definitions separate, but sometimes separates those that could be combined
vs.
The default behaviour by default can combine conflicting data structures

I think my own view is that it’s easier to build reliable consumers under the first contract than the second - but for some specific use cases like this one, there is a convenience trade-off.

^^ This applies to the default behaviour of ‘transform to json’ functionality. We are in design phase for a separate feature that I believe would give you a better option to handle this use case via configuration - so watch this space.

I think it might help undestand options better if I address some of your points specifically:

the out of the box functions can only reference one major version, no wildcard functions are available

This is true - there’s a limit to how complex or nuanced we can reasonably make the default filter functions - their original inception was just to provide a way to filter by app_id - but we’ve tried to make them such that you at least have a way to satsify more cases than that.

In this case, it might be possible to make a change to suit the problem you’re describing, I’m chatting with our product team about it.

it is possible to build a solution via JS transformations (example below), but it seems cumbersome and costs unnecessary performance

This kind of thing is the exact reason we introduced JS transformations - there’s a limit to the nuance of built-in filters, but we can support arbitrary logic via a transformation.

On performance - running the js engine is significantly slower than the built-in go transformations, but this difference is typically in tens of miliseconds. When we were introducing these features ~2 years ago (or maybe more? man I feel old now ), we did consider whether we would need to reduce the latency further when using this feature - but we haven’t yet encountered a use case that needs it.

For a bit of context - variance in network speeds generally has as big of an impact on the overall system as running JS instead of go for a filter.

If you do have a super-low latency requirement and are concerned about performance here, please reach out & let us know! We’ll help find solutions.

All of this is just one guy’s opinion btw! We’ll keep thinking about these suggestions - tagging @stanch for a product FYI.

Topic		Replies	Views
Schema Breaking and Non-Breaking Changes For data modelers & consumers	0	63	July 25, 2024
Ensuring Non-Breaking Changes Within the Same Model Version in SchemaVer and Iglu Serve	4	53	October 21, 2024
Old Cached Schemas Enrich Error Enrichment	13	1682	September 22, 2020
Snowplow Normalize, multiple versions For engineers	5	533	November 10, 2023
BigQuery Loader with Unstructured Events GCP pipeline	6	1497	February 3, 2020

Snowbridge Enriched JSON format - why is the major schema version used as a suffix?

Related topics