Enrich wouldn't read from Kinesis due to connection pool error and acquire timeout

Recently our Enrich instances running on EC2 with Kinesis input and output had stalled and and were throwing just this one exception over and over:

[prefetch-cache-shardId-000000000113-0000] ERROR software.amazon.kinesis.retrieval.polling.PrefetchRecordsPublisher - tracking.snowplow.raw:shardId-000000000113 :  Exception thrown while fetching records from Kinesis

software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Acquire operation took longer than the configured maximum time. This indicates that a request cannot get a connection from the pool within the specified maximum time. This can be due to high request rate.
Consider taking any of the following actions to mitigate the issue: increase max connections, increase acquire timeout, or slowing the request rate.

Increasing the max connections can increase client throughput (unless the network interface is already fully utilized), but can eventually start to hit operation system limitations on the number of file descriptors used by the process. If you already are fully utilizing your network interface or cannot further increase your connection count, increasing the acquire timeout gives extra time for requests to acquire a connection before timing out. If the connections doesn't free up, the subsequent requests will still timeout.

[...]

The weird thing is we didn’t see a surge in traffic prior, it happened out of the blue on a normal day traffic-wise. The error message is super detailed and quite helpful but it seems a very rare issue to run into based on my web searches.

I increased timeouts for checkpointing DynamoDB and for writing to the good and bad Kinesis output streams. I also reduced the maximum batch size from the default 10,000 to 5,000 events just to be sure. After applying the changes the error was gone and Enrich continued to consume from Kinesis, so either the new settings fixed it or the connection pool for the HTTP client was bad and creating a new instance with a fresh pool fixed it.

Usually, Enrich will print errors when writing to Kinesis fails. Since I don’t see those in the logs, I’m thinking Enrich couldn’t even consume unprocessed events from raw Kinesis stream and the checkpointing to DynamoDB failed which left Enrich in this idle state.

Any suggestions on how I could confirm this is the case? Or is this one of these situations where many smaller issues led to a bigger failure, making it hard to pinpoint to something specifically?

Hi @lenn4rd ,

Thank you for providing all the details.

That’s weird that all your instances stalled at the same time with the same error message. That sounds like a network glitch. Was this happening again and again until you changed the configuration?

Have you tried to put the settings back as before to see if the problem comes back?

Yes that must be what happened, as the error states : Exception thrown while fetching records from Kinesis.

Indeed if we can’t reproduce the error it’s hard to understand exactly what happened.

Hey @BenB, apologies for the late response. I didn’t undo the configuration changes because I was drawn to other topics.

A few days ago this weekend this issue happened again, with the increased timeouts active. We had one Enrich instance running and it wasn’t processing anything and idling and throwing above error over and over again.

I added started another Enrich instance and it took over the leases and started processing events. I left the old instance running so I can inspect it and it hasn’t self-healed.

The instance was up for 4 days before it stalled. Looking at the auto scaling activity, usually instances are refreshed every 5-7 days. We usually deploy once a week with a custom base image, so instances usually don’t live longer than 7 days.

This smells like a resource leak to me. I didn’t see anything obvious on the instance: memory usage, file descriptors and disk space are all fine. Looks like the HTTP connection pool for Netty is double the number of CPU cores so we shouldn’t see resource limits on the instance but rather in the application code if it is indeed some sort of leak.

The instance is still running if you’d like me to run some checks on it. Also I’m happy to move this to a GitHub issue if it makes things easier.