Validating custom schema in the collector?

lfib · December 1, 2021, 2:26pm

Currently I have the collector set up in AWS, all is fine and working. Now I was wondering if it was possible to validate a custom schema during the collection process? I have the following defined as a browser tracking event:

          trackPageView({
            context: [{
              schema: 'https://bucket-namehere.s3.eu-west-1.amazonaws.com/schemas/com.vendor/pageview_event/jsonschema/1-0-0',
              data: snowplowTrack
            }]
          });

This sends the event to the collector fine (ends up in the good-data-bucket), however it prompts the following console error in the browser:

The URL to the JSON in the bucket is valid, and the data is also properly defined in the JSON:

{
“$schema”: “http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#”,
“description”: “Pageview event data schema”,
“self”: {
“vendor”: “com.vendor”,
“name”: “pageview_event”,
“format”: “jsonschema”,
“version”: “1-0-0”
},
“type”: “object”,
“properties”: {
“campaign_id”: {
“type”: “number”
},
“customer_id”: {
“type”: “number”
},
“id”: {
“type”: “number”
},
“page”: {
“type”: “string”
}
},
“required”: [“campaign_id”, “customer_id”, “id”, “page”],
“additionalProperties”: false
}

I was wondering if it’s possible to validate it using the S3 bucket or do I need to look for another solution?

mike · December 1, 2021, 9:17pm

All validation happens in the enrichment process (and downstream) rather than in the collector itself. In this instance you are sending through a HTTPS URI for the schema reference whereas the enricher will be expecting a Iglu URI instead. This Iglu URI will be resolved during enrichment using the Iglu resolver configuration which will point to your static S3 repository.

   trackPageView({
            context: [{
              schema: 'iglu:com.vendor/pageview_event/jsonschema/1-0-0',
              data: snowplowTrack
            }]
          });

lfib · December 2, 2021, 7:39am

Thanks. So then could you explain what’s the purpose of the schema attribute here in the trackPageView? I already set up the enrichment process with schema validation successfully, and I know that all schema references need to be set up there as well.

mike · December 2, 2021, 8:37am

Sure - so in this case the reference to the schema allows the enricher to validate the data against the instance of that schema. Without the reference Snowplow isn’t too sure how it should validate the data payload, so by specifying a URI which acts as a pointer (rather than sending the full schema with each event) it knows that this instance of the data should be validated against that specific schema. This URI isn’t used until enrichment time.

From a flow point of view this looks like

event (with data and schema references) => collector (no validation) => enricher (validation as part of Iglu client) => destination

The enricher uses an Iglu client to attempt to resolve the schema to a repository / location - so this is why a URI is used rather than a URL directly to the schema. This allows you to set different priorities, upload schemas to different repositories (public and private) as well as doing things like setting caching settings and permissions.

Topic		Replies	Views
Keep getting Data validity : Could not fetch schema Enrichment	14	1725	October 23, 2020
Unable to get schema validation in the Enrichment process working	11	1447	November 24, 2021
schemaKey vs schemaCriterion Collectors	8	822	June 21, 2023
Snowplow : Custom Schema violation error and getting data in bad Troubleshooting	3	764	March 20, 2023
Custom Schema Error	2	1219	September 15, 2020

Validating custom schema in the collector?

Related topics