Understanding schema validation and caching in Snowplow

alex · May 13, 2016, 11:09am

Recently we have had some Snowplow users and customers reporting unexpected behaviors around schema validation in Snowplow. This thread is a brief explanation of schema validation and caching to help explain those behaviors.

Schema validation in Snowplow

Components which perform schema validation are:

Hadoop Enrich - which validates unstructured events and custom contexts. Derived contexts which are added to the event by Hadoop Enrich itself (such as with the new API Request Enrichment) are not currently validated by Hadoop Enrich
Hadoop Shred - which validates unstructured events, custom contexts and derived contexts prior to loading into Redshift. Very little fails validation here - typically only any derived contexts added in Snowplow Hadoop Enrich, or in the very rare situation where a schema is changed between Hadoop Enrich and Hadoop Shred running
Stream Enrich - which validates unstructured events and custom contexts. Like Hadoop Enrich, derived contexts which are added to the event by Stream Enrich itself are not currently validated by Stream Enrich
Snowplow Mini - as Snowplow Mini uses Stream Enrich under the hood, the schema validation behavior is the same

The exact specifics of schema validation in Snowplow are out of scope of this guide; we’ll post a separate guide on this in the future.

Schema caching in Snowplow

The four components that perform schema validation above all cache the schemas that they retrieve from Iglu registries.

Remember that a Snowplow event stream can consist of many millions of entities (unstructured events and custom contexts) which must all be validated; without schema caching Snowplow would effectively be launching a denial of service attack against the specified Iglu registries.

Schema caching in Snowplow uses in-memory LRU caches which evict the Least Recently Used schemas in favor of schemas which are being more actively referenced. This prevents the LRU cache from growing to an unlimited size.

Understanding cache scope and lifetime

It’s important to understand the scope and lifetime of the schema caches. These vary by Snowplow component:

Hadoop Enrich & Hadoop Shred

There is a cache for each Hadoop worker node - not one cache shared between nodes
Although they both run on the same EMR cluster, Hadoop Enrich and Hadoop Shred have independent caches
The caches will live for as long as that EMR jobflow step is running - e.g. when the Hadoop Enrich jobflow step completes, the cache is lost

Stream Enrich

There is a cache for each instance of the Stream Enrich app - and we recommend running one app per server, so there will effectively be one cache per server running Stream Enrich
The cache will live as long as that Stream Enrich app instance is not terminated and restarted (e.g. by a server reboot) - the LRU algorithm means that the cache can happily go on adding and evicting values for many years or months

Snowplow Mini

Under the hood a Snowplow Mini instance has a single Stream Enrich app running, so the same rules apply

Where cached schemas can cause problems

In theory Iglu schemas should be immutable, but there are two relatively common scenarios where caching schemas can cause problems:

Late added schemas: if events referencing a schema arrive before the schema has been uploaded into the Iglu registry, then Snowplow will cache that schema as being unavailable for the lifetime of that cache
Patched schemas: sometimes a schema already uploaded to Iglu is found to be incorrect and is therefore patched. This breaks the immutability guarantee around schemas in Iglu, and any Snowplow schema cache will continue to hold the old version of the schema for the lifetime of that cache

Resolving problems with cached schemas

With Snowplow Mini and Stream Enrich, you will need to restart the relevant servers to clear the caches.

With the Snowplow batch pipeline, because the caches are short-lived, things are more straightforward: the next batch pipeline run will re-build the caches from scratch, picking up the latest schemas.

Recovering events which failed validation before the schema caching problem was resolved is out of scope of this guide; we’ll post a separate guide on this in the future.

christoph-buente · June 30, 2016, 1:05pm

Hi Alex,

we upgraded our stream enricher to the latest version, and it stopped working because of schema validations. The error message says, the schema could not be found in any iglu central.

We have published our schemas here: http://b-iglu.liadm.com/ but the message is:

Could not find schema with key iglu:com.retentiongrid/content_details/jsonschema/1-0-0 in any repository

But it’s clearly available.

This is our resolver conf:

{
  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-0",
  "data": {
    "cacheSize": 500,
    "repositories": [
      {
        "name": "Iglu Central",
        "priority": 0,
        "vendorPrefixes": [
          "com.snowplowanalytics"
        ],
        "connection": {
          "http": {
            "uri": "http://iglucentral.com"
          }
        }
      },{
        "name": "Iglu LiveIntent",
        "priority": 5,
        "vendorPrefixes": [
          "com.retentiongrid",
          "com.liveintent"
        ],
        "connection": {
          "http": {
            "uri": "http://b-iglu.liadm.com"
          }
        }
      }
    ]
  }
}

Cheers, Chris

alex · June 30, 2016, 2:47pm

You are right, the file is available: http://b-iglu.liadm.com/schemas/com.retentiongrid/content_details/jsonschema/1-0-0

Have you tried:

Bouncing the Stream Enrich boxes
Confirming the URI is accessible from the Stream Enrich boxes

christoph-buente · June 30, 2016, 3:28pm

Yes, the schemas are publicly available and i can fetch them from the enricher boxes. What do you mean by:

Bouncing the Stream Enrich boxes

alex · June 30, 2016, 3:30pm

I mean restarting the box (in case your Stream Enrich cached the schema as not existing before you uploaded it)?

christoph-buente · June 30, 2016, 3:32pm

I restarted the service, which did not seem to have an influence. However, after really stop/start the enrichment process, the cache was emptied and i saw those error message pop up. However. Unsuccessful lookups should mabye not be cached

alex · June 30, 2016, 7:23pm

If we don’t cache unsuccessful lookups, then a single missing schema will slow enrichment to a crawl and launch a DDoS on every Iglu registry in your resolver (because every event will have to make HTTP requests to every registry looking for the schema)…

christoph-buente · July 1, 2016, 9:58am

Killing the instance did the trick, thx.

alex · July 1, 2016, 10:47am

Ah great! Thanks for letting us know… We are thinking about putting a TTL on cache entries so that a missing schema is re-checked in the registries every hour or so…

Topic		Replies	Views
Old Cached Schemas Enrich Error Enrichment	13	1685	September 22, 2020
Error: Could not find schema with key Troubleshooting	3	3110	July 3, 2018
Enrich schema resolver did not restart Enrichment	5	1605	February 18, 2020
Keep getting Data validity : Could not fetch schema Enrichment	14	1713	October 23, 2020
Enricher fails to refresh updated schema Enrichment	5	949	August 2, 2022