Iglu JSON caching

Hello,
First-time Snowplow/Iglu user here.

My main goal is to setup Snowplow for production use inside company. As I am playing around I was thinking of using s3 as a static repo for Iglu Repository. I am very interested to build a robust and highly available and cheap(!) solution.

I am wondering if every incoming event is validated by the Iglu Schema Validator. Does this mean that the load of the static server will be proportional to incoming data? Do we do any kind of caching?

I am not very familiar with Scala, but I have searched Enricher code and Iglu client repository trying to find if we do some caching. Haven’t found anything.

Do you think that this affects performance? Should we use like a loadbalancer or is it overkill?

Please share your experience and insights of Iglu load requirements.

Thank you.

1 Like

Hey @alexopoulos7,

Sure thing, Scala Iglu client uses cache! Its size can be configured by cacheSize setting in resolver configuration and TTL by cacheTtl. Under the hood this is LRU cache, which I believe is most efficient approach here.

Most likely, your registry will receive as many HTTP requests as many schemas you have in dataset (plus few auxiliary schemas * number of nodes), which is usually very small amount, so I don’t think this can be a real performance concern.

3 Likes

@anton beat me to it!

If you’re interested in the logic of the LRU cache it lives here.

2 Likes

Hi everyone:
Hi @mike - checking the source code:

  1. Is cacheTtl value in seconds? for 1hour : “cacheTtl”: 3600?

  2. is cacheSize just the number of schemas like for 1000 schemas “cacheSize”: 1000? or something else?

Could anyone point to the source code of where this is used?

Yes - this in seconds.

Yes - the LRU cache stores according to a key composed of vendor, name, format and full Iglu schema version. The code for parts of the resolver can be found here.

Awesome . Thanks @mike

cc @mpeychet