My main goal is to setup Snowplow for production use inside company. As I am playing around I was thinking of using s3 as a static repo for Iglu Repository. I am very interested to build a robust and highly available and cheap(!) solution.
I am wondering if every incoming event is validated by the Iglu Schema Validator. Does this mean that the load of the static server will be proportional to incoming data? Do we do any kind of caching?
I am not very familiar with Scala, but I have searched Enricher code and Iglu client repository trying to find if we do some caching. Haven’t found anything.
Do you think that this affects performance? Should we use like a loadbalancer or is it overkill?
Please share your experience and insights of Iglu load requirements.
Sure thing, Scala Iglu client uses cache! Its size can be configured by cacheSize setting in resolver configuration and TTL by cacheTtl. Under the hood this is LRU cache, which I believe is most efficient approach here.
Most likely, your registry will receive as many HTTP requests as many schemas you have in dataset (plus few auxiliary schemas * number of nodes), which is usually very small amount, so I don’t think this can be a real performance concern.
Yes - the LRU cache stores according to a key composed of vendor, name, format and full Iglu schema version. The code for parts of the resolver can be found here.