A few hours ago our Enrich suddenly started diverting 100% of its input to its bad stream, tagging each record with:
"error: Unexpected exception fetching iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-0 in HTTP Iglu repository IgluCentral: java.io.IOException: Server returned HTTP response code: 500 for URL: http://iglucentral.com/schemas/com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-0\n level: \"error\"\n"
Enrich continued rejecting all of its input with this same message until an hour later when we spotted the problem and restarted it. We’re resubmitting those bad stream records to Enrich now, and it is accepting them this time.
Any suggestions to what might have caused this, and maybe how we could prevent its recurrence?
(in case it matters, we’re running Enrich 0.10.0. And yes I realize that this is kind of old and we need to upgrade it someday…)
Editing to add: I should also have mentioned that I looked in my /var/log/enrich.log for anything potentially related, but didn’t find anything.
Unfortunately, S3 (iglucentral.com is hosted there) outages happen sometimes, which result in this kind of errors. Probably, the best way to avoid it is to set cacheTtl property in resolver configuration to not-very-high value. But cacheTtl is available since R93 (Stream Enrich 0.11.0), so you’ll have to upgrade your stack first. Also Cloudwatch alarms can be useful to get timely notifications - after you get one you can just restart enrich process and it will clean up the cache.
Thanks for your suggestions, Anton. We already have Cloudwatch and other alarms, and they’re useful. But I’m hoping for a way to avoid this situation entirely. I’m OK with having transient schema lookup failures lose us 1 or 10 or even 100 records every now and then, that’s no big deal, but I really don’t like that 1 or 2 seconds worth of lookup failures can cause our stream enrich to suddenly start discarding everything.
And I don’t know if tweaking the cacheTtl would be helpful. Setting a low cacheTtl value would reduce the damage that we’d get into each time this happened, but it would also increase the number of lookups and therefore the number of times that one of these transient lookup failures would ambush us.
Is this something that other data engineers here have simply learned to live with?
Our enrich went live several months ago and ran problem-free for most of that time, but the incident described above was the third time this month that our enrich started to discard all its incoming traffic and needed to be defibrillated. Others must have run into this before. How much trouble have people gone to to deal with this? For example, has anyone completely automated the process of (1) detecting the problem, (2) restarting enrich, (3) resubmitting the incorrectly-discarded records? I’m thinking I may have to do that, but it seems like a wrong thing to be doing.
I didn’t create this and this is the first time I’ve really looked at it, and the first thing I see is that the cacheSize seems very small. I just asked, and apparently we are using considerably more than 10 schemas overall. I’m guessing that if our cache is too small that would greatly increase the number of HTTP lookups our enrich would have to do, which in turn would increase its vulnerability to this sort of error?
Hi @drm.tgam - yes, your cache is definitely too small.
Also - you don’t have our Iglu Central mirror, which is hosted on Google Cloud, as an alternate Iglu Central provider. This is going to increase your vulnerability to outages.
For details on this, check out section 5.3 Updating your Iglu resolver in the R95 Ellora release: