Snowbridge - Kinesis - Failed to pull next Kinesis record from Kinsumer client: connection reset by peer

Rob_Ellison · October 31, 2024, 4:47pm

Hi,

We are using Snowbridge and we are experiencing the following error:

Failed to pull next Kinesis record from Kinsumer client: shard error (shardId-000000000324) in getRecords: RequestError: send request failed\ncaused <mark>by</mark>: Post \"https://kinesis.eu-west-1.amazonaws.com/\": read tcp 10.93.97.96:49068->99.80.34.228:443: read: <mark>connection</mark> <mark>reset</mark> <mark>by</mark> <mark>peer</mark>" error="Failed to pull next Kinesis record from Kinsumer client: shard error (shardId-000000000324) in getRecords: RequestError: send request failed\ncaused <mark>by</mark>: Post \"https://kinesis.eu-west-1.amazonaws.com/\": read tcp 10.93.97.96:49068->99.80.34.228:443: read:

I found a related issue raised here by @Colm. Is there a recommended fix to prevent this from occurring?

Colm · October 31, 2024, 5:48pm

Hey @Rob_Ellison ,

That particular problem was mitigated by increasing the number of ports available on the box that Snowbridge is running on.

But I’m not sure this 100% looks like the same thing - the error you posted is differently formatted to what I’ve seen in that scenario (too many open files is typical of that issue), and seems like an error that I would understand as an indication that something at a network level went wrong. (although neither of these are evidence against it being the same thing)

That might just be a transient issue - eg, the network connection to kinesis is slow for a period, this error gets thrown once or twice, then it recovers.

Or it might be something to do with the network around the deployment - ie. if something changed or went wrong in VPCs/gateways/nginx or any other network-y bits relevant to it.

Sorry I can’t be more specific, but hopefully that at least helps to debug & find root cause.

Some questions that might spark ideas:

Are you consistently getting this error?
Are any pods/instances successfully processing data?
Are you seeing any other errors?

Rob_Ellison · November 1, 2024, 11:31am

Hi @Colm ,

Thanks for your quick response!

Are you consistently getting this error?

Yes this is happening much more consistently on environments under constant load. It is even happening on some under very little load.

Are any pods/instances successfully processing data?

Yes nearly all data is being processed successfully. Very rarely the container gets this respose triggering a container restart.

Are you seeing any other errors?

I see some errors relating to dynamodb endpoint most relating to kinesis. All connection reset by peer.

Pods all look well within healthy. Also have some istio-proxy sidecars running alongside that also look healthy ranges. I also get the feeling this isn’t related to Snowbridge.

Colm · November 1, 2024, 12:06pm

Interesting.

So both contacting kinesis and contacting DDB happen via kinsumer - which is where that issue about not closing connections comes from. If you’re seeing similar errors related to both, then to me that indicates that it might be something network-y, and quite possibly specific to the part of the network that communicates with other AWS services (unless you get similar target erros too).

It’s quite possible that it is a different symptom of a similar issue, and is something to do with the number of available ports.

There are some relevant options in the kinesis source.

Leader action frequency determines how often we check both DDB and kinesis to identify if the shard count we’re working from is up to date.

Shard check frequency (counterintuitively) determines how often each kinsumer client checks DDB for the shard count and number of clients.

They’re both about detecting when the app needs to re-balance which client owns which shard. Extending them may lead to increases in latency when either the app or the source stream scales - but this would only have a meaningful impact if your use case is sensitive to very low latency.

So I think if it were me I’d want to look for evidence of what the root issue might be at the network level, and also experiment to see if amending those values mitigates the problem.

Hope that’s helpful!

Rob_Ellison · November 14, 2024, 4:46pm

Hi, Just an update on this topic for those that could possibly encounter similar issues. The issue was due to Istio handles the traffic outside of the k8s cluster. I believe the cause may be related to the following thread.

The kinesis and dynamodb requests were going through a PassthroughCluster is a general cluster added to sidecars for traffic considered “not inside the mesh”, and the settings on it are not optimized since it does not know what kind of traffic to expect.

Adding service entries for these domains significantly improves these issues although the issues are not completely resolved.

Topic		Replies	Views
Stream Transformer Failing to Fetch Records from Kinesis Stream	4	841	August 19, 2022
Snowbridge - Kinesis shard doesn't exist For engineers	1	9	November 14, 2024
Snowbridge 2.0.2 Released New releases	0	736	March 10, 2023
Snowbridge 2.1.0 Released New releases	0	752	May 18, 2023
Enrich wouldn't read from Kinesis due to connection pool error and acquire timeout AWS real-time pipeline	2	1399	May 22, 2023

Snowbridge - Kinesis - Failed to pull next Kinesis record from Kinsumer client: connection reset by peer

Related topics