Enrich schema resolver did not restart

mbondarenko · February 7, 2018, 6:22pm

We had a glitch on AWS for around 15 or 20 seconds that caused DNS lookups to stop working. Enrich was actually unable to write to Kinesis during that time and after short bit Kinesis client re-started; however, it seem that Enrich schema resolver never recovered from that glitch. So for the next little while we saw good data going to bad stream with messages like:

"errors": [
{
"level": "error",
"message": "error: Could not find schema with key iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-0 in any repository, tried:\n    level: \"error\"\n    repositories: [\"TGAMIgluRepo [HTTP]\",\"Iglu Client Embedded [embedded]\",\"IgluCentral [HTTP]\"]\n"
},{
"level": "error",
"message": "error: Unknown host issue fetching iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-0 in HTTP Iglu repository IgluCentral: iglucentral.com\n    level: \"error\"\n"
}
]

Have anyone experienced this before? How can we prevent it from happening again?
If we were not monitoring bad stream on constant basis we could have had hours of good data go missing.

mike · February 7, 2018, 8:40pm

This may have been caused by the enricher caching those schemas as ‘bad’ when they didn’t resolve during the brief DNS outage. After DNS lookups had recovered these Iglu lookups still would have been hitting the bad cache. The easiest way to evict this cache is to restart the enrichment process.

If you haven’t already I’d also recommend setting Cloudwatch alarms on your bad Kinesis stream (both for excessive and no traffic) which would help alert about this issue in the future.

anton · February 8, 2018, 6:16am

Hey @mbondarenko,

Another way to avoid sending data to bad due network issues is to use cache TTL. It is available in RT pipeline since R93.

mbondarenko · February 20, 2018, 1:37pm

Thanks for confirming! @mike his is pretty much the set up we have (Cloud-watch on Kinesis stream) that is why we caught it early but we did not restart for good 90 minutes as we thought it should recover from this automatically. That was wishful thinking and we had to reprocess those 90 minutes of data as a result.

@anton Thank you for the tip! We do need to upgrade to take advantage of that feature and we will be upgrading soon as a result of the glitch.

jimy2004king · February 18, 2020, 10:33am

Hey @anton,

I tried to find it in docs but couldn’t. What is the unit of time for cacheTTL. Is it minute, second, hour ?

anton · February 18, 2020, 11:05am

Hi @jimy2004king. It is seconds.

Topic		Replies	Views
Enrich errors from iglucentral Enrichment	6	2102	February 26, 2018
Enricher fails to refresh updated schema Enrichment	5	949	August 2, 2022
Enricher lost connection to S3 buckets. Can't read iglu schemas	4	68	December 13, 2024
Schema Violation - Repo Failure For engineers	3	836	February 23, 2022
Errors getting the Enrich working - With and without the enrich flag AWS real-time pipeline	11	1746	May 31, 2020

Enrich schema resolver did not restart

Related topics