What is the expected behavior if a custom context's schema doesn't exist?

When we test our custom contexts against the public facing schema registry, iglucentral.com they are still labelled as GOOD by our enricher. Is this expected?

Hi abrhim,

The expected behaviour if your custom context schema doesn’t exist is to expect all events sent with the custom context attached to end up in bad rows with an error message along the lines of ‘could not find schema with key…’.

I’m not entirely sure what you mean by testing your contexts against iglu central though - perhaps you could outline the steps you took?

Best,

Hey Colm. Thanks for the reply.

We have an internal schema registry that we have provisioned. However, we are having some troubles with it so we pointed our enricher’s resolver.js is pointing to iglucentral.com, like so:

{
  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1",
  "data": {
    "cacheSize": 500,
    "repositories": [
      {
        "name": "Iglu Central",
        "priority": 0,
        "vendorPrefixes": [ "com.snowplowanalytics" ],
        "connection": {
          "http": {
            "uri": "http://iglucentral.com"
          }
        }
      }
    ]
  }
} 

and custom structured events that have custom contexts attached are labeled as GOOD and appear in our GOOD stream.

Ah ok I follow you now, thanks for explaining.

Assuming your enricher previously had an iglu resolver which pointed to your own schema registry, the most likely case is that the schemas are still in the cache.

If that’s the case, rebooting the enricher should produce expected behaviour here.

Yes, we rebooted it. After changing the resolver.js I ran this command:

sudo supervisorctl stop enrich; sudo supervisorctl start enrich;

Our internal registry doesn’t work at all for some reason, but that’s another topic, so it wouldn’t make sense that it would be caching those results. Our internal registry marks everything as bad.

Could you show an example of the custom data you send that is being marked as “good” (but shouldn’t)? Does it happen to reference the event/contexts JSON schema for which is also present in Iglu Central?

Sure. Here is the JS code that fires the event and the custom context that is attached.

const productContext = {
    "schema": "iglu:com.ourCompany/product/jsonschema/1-0-0",
    "data": {
        "productId": 702,
        "name": "Argus All-Weather Tank",
        "sku": "MT07",
        "description": "<p>The Argus All-Weather Tank is sure to become your favorite base layer or go-to cover for hot outdoor workouts. With its subtle reflective safely trim, you can even wear it jogging on urban evenings.</p>\n<p>&bull; Dark gray polyester spandex tank.<br />&bull; Reflective details for nighttime visibility. <br />&bull; Stash pocket.<br />&bull; Anti-chafe flatlock seams.</p>",
        "shortDescription": null,
        "specialFromDate": null,
        "specialToDate": null,
        "attributeSetId": 9,
        "metaTitle": null,
        "metaKeywords": null,
        "metaDescription": null,
        "newFromDate": null,
        "newToDate": null,
        "createdAt": "2019-05-10 01:50:01",
        "updatedAt": "2019-05-10 01:50:01",
        "manufacturer": false,
        "countryOfManufacture": " ",
        "categories": [
            "Tanks",
            "Eco Friendly",
            "Default Category"
        ],
        "productType": "configurable",
        "specialPrice": 0,
        "tierPricing": 0,
        "price": 0,
        "basePrice": 22,
        "currencyCode": "USD",
        "canonicalUrl": "https://ourWebsite.test/argus-all-weather-tank.html",
        "mainImageUrl": "https://ourWebsite.test/media/catalog/product/m/t/mt07-gray_main_1.jpg"
    }
}

const contexts = {
       schema: baseSchemaUrl,
       data: [productContext]
};


snowplow_events(
              "trackStructEvent",
              "product",
              "edit-quantity",
              item.product_sku,
              item.product_name,
              null,
              contexts
            );
          });

Hey @abrhim,

To me it feels as if this is most likely something simple we’re missing out in the process. It’s all quite involved for open-source users, and if you’re not used to the workflow there’s a lot of things that can easily be missed. So let’s see if we can dig that up by looking at what I think are the strongest possibilities step-by-step.

Our internal registry doesn’t work at all for some reason, but that’s another topic, so it wouldn’t make sense that it would be caching those results. Our internal registry marks everything as bad.

When you say it doesn’t work at all, what specific behaviour do you mean? That it marks all the events as bad, or that there’s some failure of communication between enrich and Iglu?

If it’s marking everything as bad, what are the error messages it produces?

After changing the resolver.js I ran this command:

That’s probably a typo but just in case it’s not - the iglu resolver file needs to be .json.

sudo supervisorctl stop enrich; sudo supervisorctl start enrich;

I’m not hugely familiar with supervisorctl, but if it’s parallel to stopping or starting aws resources by other means, then stopping the instance may not have produced what we want in order to have the cache cleared - reboot is what normally does the trick.

Finally,


const contexts = {
       schema: baseSchemaUrl,
       data: [productContext]
};


snowplow_events(
              "trackStructEvent",
              "product",
              "edit-quantity",
              item.product_sku,
              item.product_name,
              null,
              contexts
            );
          });

Looks like the ‘contexts’ object is a JSON containing an array, rather than an array of self-describing JSON. The context objects you pass through the tracker should resolve to this format:

[{
  "schema": "iglu:com.acme_company/movie_poster/jsonschema/2-1-1",
  "data": {
    "movie_name": "Solaris",
    "poster_country": "JP"
  }
}]

Each item in the context array needs to refer to the schema.

Have you looked at the ‘good’ data that has come through the pipeline, and seen your custom contexts in it? I’m not entirely sure what the behaviour of the tracker or the validation process would be if you send a different object in as a context. I would expect bad rows but there’s a chance the whole object is being ignored completely, and your data is passing validation without the contexts.

Hope that helps pin down the issue, let us know if any of these turn out to be the problem.

Internal Schema Registry: Everything is marked as bad. It doesn’t produce any error messages that we are aware of, where would we find those? We know it is marked as bad by reading the ./logs/enrich.log file.

Reboot: I will test this out and let you know.

Contexts: I have found this documentation which says to format it above as an object with an array of self describing objects and this documentation which states it needs to be an array of JSON (similar to what you said) gives an example in the third code block. Which one is right? I believe they are conflicting, unless I misunderstand their context and use case.

The event goes through the entire pipeline and the attached context is there as well.

Internal Schema Registry : Everything is marked as bad. It doesn’t produce any error messages that we are aware of, where would we find those? We know it is marked as bad by reading the ./logs/enrich.log file.

When data fails validation it’s not lost, it goes to the bad events stream. Depending on what you’ve set up, you commonly you would use Elasticsearch or Athena to debug.

Each bad row will have one or more error message, which will give you an indication as to why the data failed validation. Taking a look at those will likely make the whole setup process easier for you.

Contexts : I have found this documentation which says to format it above as an object with an array of self describing objects and this documentation which states it needs to be an array of JSON (similar to what you said) gives an example in the third code block. Which one is right? I believe they are conflicting, unless I misunderstand their context and use case.

Both do agree as I read it, but I can see where you’re coming from - the first states that each individual context should be a self-describing JSON, then gives a description of the output, which is a slightly different format. The second example is probably the better one to follow since it gives the specific tracking code, rather than the first which is more about definitions.

To quickly add, we have a tool called Snowplow Mini, which is quick and easy to set up.

It’s a small scale instance of Snowplow in a box, with elasticsearch querying capabilities for the good and bad streams.

It’s a pretty useful resource when it comes to testing custom schemas and tracking setup - it’ll give you a faster feedback loop.

1 Like

I have used it before, but didn’t consider it for that use case. We will consider using Snowplow-mini as a testing environment for contexts.