Recently we have had some Snowplow users and customers reporting unexpected behaviors around schema validation in Snowplow. This thread is a brief explanation of schema validation and caching to help explain those behaviors.
Schema validation in Snowplow
Components which perform schema validation are:
- Hadoop Enrich - which validates unstructured events and custom contexts. Derived contexts which are added to the event by Hadoop Enrich itself (such as with the new API Request Enrichment) are not currently validated by Hadoop Enrich
- Hadoop Shred - which validates unstructured events, custom contexts and derived contexts prior to loading into Redshift. Very little fails validation here - typically only any derived contexts added in Snowplow Hadoop Enrich, or in the very rare situation where a schema is changed between Hadoop Enrich and Hadoop Shred running
- Stream Enrich - which validates unstructured events and custom contexts. Like Hadoop Enrich, derived contexts which are added to the event by Stream Enrich itself are not currently validated by Stream Enrich
- Snowplow Mini - as Snowplow Mini uses Stream Enrich under the hood, the schema validation behavior is the same
The exact specifics of schema validation in Snowplow are out of scope of this guide; we’ll post a separate guide on this in the future.
Schema caching in Snowplow
The four components that perform schema validation above all cache the schemas that they retrieve from Iglu registries.
Remember that a Snowplow event stream can consist of many millions of entities (unstructured events and custom contexts) which must all be validated; without schema caching Snowplow would effectively be launching a denial of service attack against the specified Iglu registries.
Schema caching in Snowplow uses in-memory LRU caches which evict the Least Recently Used schemas in favor of schemas which are being more actively referenced. This prevents the LRU cache from growing to an unlimited size.
Understanding cache scope and lifetime
It’s important to understand the scope and lifetime of the schema caches. These vary by Snowplow component:
Hadoop Enrich & Hadoop Shred
- There is a cache for each Hadoop worker node - not one cache shared between nodes
- Although they both run on the same EMR cluster, Hadoop Enrich and Hadoop Shred have independent caches
- The caches will live for as long as that EMR jobflow step is running - e.g. when the Hadoop Enrich jobflow step completes, the cache is lost
Stream Enrich
- There is a cache for each instance of the Stream Enrich app - and we recommend running one app per server, so there will effectively be one cache per server running Stream Enrich
- The cache will live as long as that Stream Enrich app instance is not terminated and restarted (e.g. by a server reboot) - the LRU algorithm means that the cache can happily go on adding and evicting values for many years or months
Snowplow Mini
- Under the hood a Snowplow Mini instance has a single Stream Enrich app running, so the same rules apply
Where cached schemas can cause problems
In theory Iglu schemas should be immutable, but there are two relatively common scenarios where caching schemas can cause problems:
- Late added schemas: if events referencing a schema arrive before the schema has been uploaded into the Iglu registry, then Snowplow will cache that schema as being unavailable for the lifetime of that cache
- Patched schemas: sometimes a schema already uploaded to Iglu is found to be incorrect and is therefore patched. This breaks the immutability guarantee around schemas in Iglu, and any Snowplow schema cache will continue to hold the old version of the schema for the lifetime of that cache
Resolving problems with cached schemas
With Snowplow Mini and Stream Enrich, you will need to restart the relevant servers to clear the caches.
With the Snowplow batch pipeline, because the caches are short-lived, things are more straightforward: the next batch pipeline run will re-build the caches from scratch, picking up the latest schemas.
Recovering events which failed validation before the schema caching problem was resolved is out of scope of this guide; we’ll post a separate guide on this in the future.