Currently I have the collector set up in AWS, all is fine and working. Now I was wondering if it was possible to validate a custom schema during the collection process? I have the following defined as a browser tracking event:
This sends the event to the collector fine (ends up in the good-data-bucket), however it prompts the following console error in the browser:
The URL to the JSON in the bucket is valid, and the data is also properly defined in the JSON:
“description”: “Pageview event data schema”,
“required”: [“campaign_id”, “customer_id”, “id”, “page”],
I was wondering if it’s possible to validate it using the S3 bucket or do I need to look for another solution?
All validation happens in the enrichment process (and downstream) rather than in the collector itself. In this instance you are sending through a HTTPS URI for the schema reference whereas the enricher will be expecting a Iglu URI instead. This Iglu URI will be resolved during enrichment using the Iglu resolver configuration which will point to your static S3 repository.
Thanks. So then could you explain what’s the purpose of the
schema attribute here in the
trackPageView? I already set up the enrichment process with schema validation successfully, and I know that all schema references need to be set up there as well.
Sure - so in this case the reference to the schema allows the enricher to validate the data against the instance of that schema. Without the reference Snowplow isn’t too sure how it should validate the data payload, so by specifying a URI which acts as a pointer (rather than sending the full schema with each event) it knows that this instance of the data should be validated against that specific schema. This URI isn’t used until enrichment time.
From a flow point of view this looks like
event (with data and schema references) => collector (no validation) => enricher (validation as part of Iglu client) => destination
The enricher uses an Iglu client to attempt to resolve the schema to a repository / location - so this is why a URI is used rather than a URL directly to the schema. This allows you to set different priorities, upload schemas to different repositories (public and private) as well as doing things like setting caching settings and permissions.