Unable to get schema validation in the Enrichment process working

I am trying to set it up using a custom Iglu repository, however, nothing has been successful and I’ve been at it for days.

I have the following URL set up for the custom schema:
https://bucket-name-here.s3.eu-west-1.amazonaws.com/schemas/vendor-name-here/pageview_event/jsonschema/1-0-0

This URL has the following file:

{
“$schema”: “http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#”,
“description”: “Pageview event data schema”,
“self”: {
“vendor”: “vendor-name-here”,
“name”: “pageview_event”,
“format”: “jsonschema”,
“version”: “1-0-0”
},
“type”: “object”,
“properties”: {
“campaign_id”: {
“type”: “number”
},
“customer_id”: {
“type”: “number”
},
“id”: {
“type”: “number”
},
“page”: {
“type”: “string”
}
},
“required”: [
“campaign_id”,
“customer_id”,
“id”,
“page”
],
“additionalProperties”: false
}

Then I have the following Iglu config file:

{
“schema”: “iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1”,
“data”: {
“cacheSize”: 500,
“repositories”: [
{
“name”: “Iglu Central”,
“priority”: 0,
“vendorPrefixes”: [ “com.snowplowanalytics” ],
“connection”: {
“http”: {
“uri”: “http://iglucentral.com
}
}
},
{
“name”: “Custom Iglu Server”,
“priority”: 1,
“vendorPrefixes”: [ “vendor-name-here” ],
“connection”: {
“http”: {
“uri”: “https://bucket-name-here.s3-website-eu-west-1.amazonaws.com
}
}
}
]
}
}

After pushing to the stream, it always ends up in the bad stream with the following error:

        {
           "schemaKey":"iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-0",
           "error":{
              "error":"ValidationError",
              "dataReports":[
                 {
                    "message":"$[0].schema: does not match the regex pattern ^iglu:[a-zA-Z0-9-_.]+/[a-zA-Z0-9-_]+/[a-zA-Z0-9-_]+/[0-9]+-[0-9]+-[0-9]+$",
                    "path":"$[0].schema",
                    "keyword":"pattern",
                    "targets":[
                       "^iglu:[a-zA-Z0-9-_.]+/[a-zA-Z0-9-_]+/[a-zA-Z0-9-_]+/[0-9]+-[0-9]+-[0-9]+$"
                    ]
                 }
              ]
           }
        }

I can’t wrap my head around this, can you guys help?

Can you share your tracking code too? The error suggests to me that there is an issue with a schema property on one of your contexts. Perhaps missing iglu: at the start or something?

Currently I have the following tracking code:

            trackPageView({
              context: [{
                schema: 'https://s3-bucket-name.s3.eu-west-1.amazonaws.com/schemas/vendor.package/pageview_event/jsonschema/1-0-0',
                data: snowplowTrack
              }]
            });

I also tried to use the following with iglu: included at the start:

            trackPageView({
              context: [{
                schema: 'iglu:vendor.package/pageview_event/jsonschema/1-0-0',
                data: snowplowTrack
              }]
            });

But then I received another kind of error:

“error”:{
“error”:“ResolutionError”,
“lookupHistory”:[
{
“repository”:“Custom Iglu Server”,
“errors”:[
{
“error”:“NotFound”
}
],
“attempts”:1,
“lastAttempt”:“2021-11-23T08:38:54.624Z”
},
{
“repository”:“Iglu Central”,
“errors”:[
{
“error”:“NotFound”
}
],
“attempts”:1,
“lastAttempt”:“2021-11-23T08:38:54.717Z”
},
{
“repository”:“Iglu Client Embedded”,
“errors”:[
{
“error”:“NotFound”
}
],
“attempts”:1,
“lastAttempt”:“2021-11-23T08:38:54.632Z”
}
]
}

So the second example is correct:

 trackPageView({
              context: [{
                schema: 'iglu:vendor.package/pageview_event/jsonschema/1-0-0',
                data: snowplowTrack
              }]
            });

(assuming snowplowTrack is an object with your properties in).

The way it works is that your iglu: schemas will use the Iglu Resolvers which you specify in the config file to perform a look up.

The fact you’re getting a NotFound suggests there’s something wrong with when Enrich tries to find your schema in your S3 bucket.

I’m not entirely sure why it’s not finding the schema though given what you’ve described. If sending a GET request to http://s3-bucket-name.s3.eu-west-1.amazonaws.com/schemas/vendor.package/pageview_event/jsonschema/1-0-0 correctly returns the JSON Schema then I’d expect it to work fine.

Thanks, and yes

snowplowTrack

is an object with all the properties listed in our custom schema :slight_smile:

I tried it once again and still getting the errors “Custom Iglu Server NotFound”, “Iglu Central NotFound” and “Iglu Client Embedded NotFound”.

Maybe I’m an idiot, but do we need to have a paid subscription to the Snowplow services or?

Nope, no reason to pay for anything. It’s Open Source and Iglu Central is a service we offer to everyone for free.

If you have curl installed on your machine, can you try running:

curl --request GET \
  --url http://s3-bucket-name.s3.eu-west-1.amazonaws.com/schemas/vendor.package/pageview_event/jsonschema/1-0-0

Does that return the content of your schema?

It does actually … :sweat_smile: I’m totally clueless at this point, no idea what’s going on :upside_down_face: Anyway I’ll keep trying, will post my resolution if I managed to find out the problem

Hopefully someone else comes along and spots what might be the issue, it’s evading me too.

One other option is to try and set up a full Iglu Server, rather than a static repo.

You can do that from the Open Source Quick Start terraform modules: Quick Start Installation Guide on AWS - Snowplow Docs

Specific part on Iglu Servers: Quick Start Installation Guide on AWS - Snowplow Docs

Hey @lfib,

This might be a cache issue. Schema caching is described in here if you would like to get more details.

Could you restart enrich and try to send the event again, please ?

1 Like

I finally got it working. In my iglu.json where I define my custom schema, I needed to insert http instead of https:man_facepalming:

So now it looks like this:

{
“schema”: “iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1”,
“data”: {
“cacheSize”: 500,
“repositories”: [
{
“name”: “Iglu Central”,
“priority”: 0,
“vendorPrefixes”: [ “com.snowplowanalytics” ],
“connection”: {
“http”: {
“uri”: “http://iglucentral.com
}
}
},
{
“name”: “Custom Iglu Server”,
“priority”: 1,
“vendorPrefixes”: [ “com.vendor” ],
“connection”: {
“http”: {
“uri”: “http://my-bucket.s3-website-eu-west-1.amazonaws.com
}
}
}
]
}
}

Thanks everyone for the support, it all came together in the end!

Although the processed output looks a bit weird, is this how it supposed to be? Can’t it save it as JSON or something?

This is copied directly from the file:

campaign-view web 2021-11-24 13:49:49.359 2021-11-24 13:28:09.688 2021-11-24 13:28:09.544 page_view 5435d5f1-6ef4-4456-ab32-9e92584f8cc3 pageview_tracker js-3.1.5 snowplow-stream-collector-sqs-2.4.1-sqs streamCommon-2.0.3-common-2.0.3 152.115.82.162 e611feba-86e2-4ab9-ab45-943028c32017 19 87a491cb-f970-4380-af4f-78d5d3b3844c http://url-xxxxxxx active | xxx xxx http://url-xxx http localhost 8086 /url-xxx http localhost 8086 /url-xxx {“schema”:“iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-0”,“data”:[{“schema”:“iglu:com.vendor/pageview_event/jsonschema/1-0-0”,“data”:{“campaign_id”:339,“customer_id”:2,“id”:1223,“page”:“/url-xxx”}},{“schema”:“iglu:com.snowplowanalytics.snowplow/web_page/jsonschema/1-0-0”,“data”:{“id”:“6aaa869f-5fa7-446e-ada6-fcd54481a988”}}]} Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 en-GB 1 24 2560 574 2560 1440 UTF-8 2545 574 2021-11-24 13:28:09.546 0e6f9e02-5acb-4fd5-9e8a-353554ce84dc 2021-11-24 13:28:09.686 com.snowplowanalytics.snowplow page_view jsonschema 1-0-0

It has some JSON in it, but it would be nice to have the whole wrapped in JSON or some object, not just plain text.

Enriched events have TSV format. You can get more info about enriched event TSV format in here.

If you want to convert enriched events from TSV to JSON, you can use one of the analytics SDK specified in here. Analytics SDK’s have methods for parsing enriched TSV format to JSON format. You can find more information about how to do it in the respective SDK documentation.

Also, you can use one of the loaders to load enriched events to supported storage targets. Currently, we support Redshift, BigQuery, Snowflake as warehouse storage target. Additionally, we have loaders for Postgres and Elasticsearch too. You can get more information about loaders in here.

4 Likes