Enrich 3.2.1 - WARN with ERROR

Hey Guys,

So 3.2.0 and previous (back to 3.1.3) is giving us issues mostly around Kinesis shards, the patch notes for each new version are a mirror of our experience. We have now found that 3.2.1 is very much improved, and load tested it heavily.

However from an automation point of view i.e. the pod should be killed and restarted based off the logs/parsing/polling, the following errors probably should be changed to something else other than “error”:

  1. ERROR software.amazon.kinesis.retrieval.polling.PrefetchRecordsPublisher
  2. ERROR com.snowplowanalytics.snowplow.enrich.kinesis.Sink
  3. ERROR software.amazon.kinesis.coordinator.Scheduler - Worker.run caught exception

They aren’t really errors while enrich is running more informational warnings, a full log example would be (note the warn then error):

[pool-1-thread-2] WARN com.snowplowanalytics.snowplow.enrich.kinesis.KinesisRun - Skipping checkpointing of shard shardId-000000000011 because this worker no longer owns the lease
[prefetch-cache-shardId-000000000011-0000] ERROR software.amazon.kinesis.retrieval.polling.PrefetchRecordsPublisher -

A suggested change would be to change away from error to something like caught exception:

[pool-1-thread-2] WARN com.snowplowanalytics.snowplow.enrich.kinesis.KinesisRun - Skipping checkpointing of shard shardId-000000000011 because this worker no longer owns the lease
[prefetch-cache-shardId-000000000011-0000] CAUGHT EXCEPTION software.amazon.kinesis.retrieval.polling.PrefetchRecordsPublisher -

This came about as we parse the logs we get warn (yellow) then error (red), so for log parsing and polling it makes it difficult. Generally if we see an error we would trigger a restart while a warn would just be logged.

Thanks
Kyle

1 Like

Hi @kfitzpatrick thanks for sharing details of your setup! I had not heard of anyone using log message level to trigger a restart of Snowplow apps, so it’s always interesting to hear about different production setups.

Regarding the error message in com.snowplowanalytics.snowplow.enrich.kinesis.Sink (this one), I think you are correct this could be a warning. Enrich is able to tolerate and recover from a failure to publish messages to Kinesis, so an informative warning is more appropriate.

Regarding the other error messages you mentioned… those ones come from a third party library that we don’t control, e.g. here and here. I don’t think there’s anything that we (Snowplow) can do to stop the Kinesis consumer from logging those errors.

The more I think about it… maybe you should consider disabling your log parsing. Enrich-kinesis is actually pretty good at shutting itself down and exiting cleanly whenever a critical error occurs. You just need to make sure your pod restarts whenever the app terminates. I’m not aware of any critical error which would require external intervention trigger a restart.

I would love to hear your thoughts on this.

2 Likes