Enrich 3.2.1 - WARN with ERROR

kfitzpatrick · July 6, 2022, 3:46pm

Hey Guys,

So 3.2.0 and previous (back to 3.1.3) is giving us issues mostly around Kinesis shards, the patch notes for each new version are a mirror of our experience. We have now found that 3.2.1 is very much improved, and load tested it heavily.

However from an automation point of view i.e. the pod should be killed and restarted based off the logs/parsing/polling, the following errors probably should be changed to something else other than “error”:

ERROR software.amazon.kinesis.retrieval.polling.PrefetchRecordsPublisher
ERROR com.snowplowanalytics.snowplow.enrich.kinesis.Sink
ERROR software.amazon.kinesis.coordinator.Scheduler - Worker.run caught exception

They aren’t really errors while enrich is running more informational warnings, a full log example would be (note the warn then error):

[pool-1-thread-2] WARN com.snowplowanalytics.snowplow.enrich.kinesis.KinesisRun - Skipping checkpointing of shard shardId-000000000011 because this worker no longer owns the lease
[prefetch-cache-shardId-000000000011-0000] ERROR software.amazon.kinesis.retrieval.polling.PrefetchRecordsPublisher -

A suggested change would be to change away from error to something like caught exception:

[pool-1-thread-2] WARN com.snowplowanalytics.snowplow.enrich.kinesis.KinesisRun - Skipping checkpointing of shard shardId-000000000011 because this worker no longer owns the lease
[prefetch-cache-shardId-000000000011-0000] CAUGHT EXCEPTION software.amazon.kinesis.retrieval.polling.PrefetchRecordsPublisher -

This came about as we parse the logs we get warn (yellow) then error (red), so for log parsing and polling it makes it difficult. Generally if we see an error we would trigger a restart while a warn would just be logged.

Thanks
Kyle

istreeter · July 6, 2022, 4:36pm

Hi @kfitzpatrick thanks for sharing details of your setup! I had not heard of anyone using log message level to trigger a restart of Snowplow apps, so it’s always interesting to hear about different production setups.

Regarding the error message in com.snowplowanalytics.snowplow.enrich.kinesis.Sink (this one), I think you are correct this could be a warning. Enrich is able to tolerate and recover from a failure to publish messages to Kinesis, so an informative warning is more appropriate.

Regarding the other error messages you mentioned… those ones come from a third party library that we don’t control, e.g. here and here. I don’t think there’s anything that we (Snowplow) can do to stop the Kinesis consumer from logging those errors.

The more I think about it… maybe you should consider disabling your log parsing. Enrich-kinesis is actually pretty good at shutting itself down and exiting cleanly whenever a critical error occurs. You just need to make sure your pod restarts whenever the app terminates. I’m not aware of any critical error which would require external intervention trigger a restart.

I would love to hear your thoughts on this.

Topic		Replies	Views
[scala] [enrich] exception while sync'ing Kinesis shards and leases AWS real-time pipeline	8	4474	February 19, 2020
Enricher doesn't scale out correctly and keep getting errors Enrichment	2	895	April 29, 2024
Errors getting the Enrich working - With and without the enrich flag AWS real-time pipeline	11	1746	May 31, 2020
Enrichment process failed Enrichment	2	1847	August 8, 2017
[ERROR] Updating Snowplow Enricher - ResolutionError Enrichment	5	1160	August 5, 2022

Enrich 3.2.1 - WARN with ERROR

Related topics