Recently our Enrich instances running on EC2 with Kinesis input and output had stalled and and were throwing just this one exception over and over:
[prefetch-cache-shardId-000000000113-0000] ERROR software.amazon.kinesis.retrieval.polling.PrefetchRecordsPublisher - tracking.snowplow.raw:shardId-000000000113 : Exception thrown while fetching records from Kinesis software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Acquire operation took longer than the configured maximum time. This indicates that a request cannot get a connection from the pool within the specified maximum time. This can be due to high request rate. Consider taking any of the following actions to mitigate the issue: increase max connections, increase acquire timeout, or slowing the request rate. Increasing the max connections can increase client throughput (unless the network interface is already fully utilized), but can eventually start to hit operation system limitations on the number of file descriptors used by the process. If you already are fully utilizing your network interface or cannot further increase your connection count, increasing the acquire timeout gives extra time for requests to acquire a connection before timing out. If the connections doesn't free up, the subsequent requests will still timeout. [...]
The weird thing is we didn’t see a surge in traffic prior, it happened out of the blue on a normal day traffic-wise. The error message is super detailed and quite helpful but it seems a very rare issue to run into based on my web searches.
I increased timeouts for checkpointing DynamoDB and for writing to the good and bad Kinesis output streams. I also reduced the maximum batch size from the default 10,000 to 5,000 events just to be sure. After applying the changes the error was gone and Enrich continued to consume from Kinesis, so either the new settings fixed it or the connection pool for the HTTP client was bad and creating a new instance with a fresh pool fixed it.
Usually, Enrich will print errors when writing to Kinesis fails. Since I don’t see those in the logs, I’m thinking Enrich couldn’t even consume unprocessed events from raw Kinesis stream and the checkpointing to DynamoDB failed which left Enrich in this idle state.
Any suggestions on how I could confirm this is the case? Or is this one of these situations where many smaller issues led to a bigger failure, making it hard to pinpoint to something specifically?