Snowbridge - Kinesis - Failed to pull next Kinesis record from Kinsumer client: connection reset by peer

Hi,

We are using Snowbridge and we are experiencing the following error:

Failed to pull next Kinesis record from Kinsumer client: shard error (shardId-000000000324) in getRecords: RequestError: send request failed\ncaused <mark>by</mark>: Post \"https://kinesis.eu-west-1.amazonaws.com/\": read tcp 10.93.97.96:49068->99.80.34.228:443: read: <mark>connection</mark> <mark>reset</mark> <mark>by</mark> <mark>peer</mark>" error="Failed to pull next Kinesis record from Kinsumer client: shard error (shardId-000000000324) in getRecords: RequestError: send request failed\ncaused <mark>by</mark>: Post \"https://kinesis.eu-west-1.amazonaws.com/\": read tcp 10.93.97.96:49068->99.80.34.228:443: read:

I found a related issue raised here by @Colm. Is there a recommended fix to prevent this from occurring?

Hey @Rob_Ellison ,

That particular problem was mitigated by increasing the number of ports available on the box that Snowbridge is running on.

But I’m not sure this 100% looks like the same thing - the error you posted is differently formatted to what I’ve seen in that scenario (too many open files is typical of that issue), and seems like an error that I would understand as an indication that something at a network level went wrong. (although neither of these are evidence against it being the same thing)

That might just be a transient issue - eg, the network connection to kinesis is slow for a period, this error gets thrown once or twice, then it recovers.

Or it might be something to do with the network around the deployment - ie. if something changed or went wrong in VPCs/gateways/nginx or any other network-y bits relevant to it.

Sorry I can’t be more specific, but hopefully that at least helps to debug & find root cause.

Some questions that might spark ideas:

Are you consistently getting this error?
Are any pods/instances successfully processing data?
Are you seeing any other errors?

Hi @Colm ,

Thanks for your quick response!

Are you consistently getting this error?

Yes this is happening much more consistently on environments under constant load. It is even happening on some under very little load.

Are any pods/instances successfully processing data?

Yes nearly all data is being processed successfully. Very rarely the container gets this respose triggering a container restart.

Are you seeing any other errors?

I see some errors relating to dynamodb endpoint most relating to kinesis. All connection reset by peer.

Pods all look well within healthy. Also have some istio-proxy sidecars running alongside that also look healthy ranges. I also get the feeling this isn’t related to Snowbridge.

Interesting.

So both contacting kinesis and contacting DDB happen via kinsumer - which is where that issue about not closing connections comes from. If you’re seeing similar errors related to both, then to me that indicates that it might be something network-y, and quite possibly specific to the part of the network that communicates with other AWS services (unless you get similar target erros too).

It’s quite possible that it is a different symptom of a similar issue, and is something to do with the number of available ports.

There are some relevant options in the kinesis source.

Leader action frequency determines how often we check both DDB and kinesis to identify if the shard count we’re working from is up to date.

Shard check frequency (counterintuitively) determines how often each kinsumer client checks DDB for the shard count and number of clients.

They’re both about detecting when the app needs to re-balance which client owns which shard. Extending them may lead to increases in latency when either the app or the source stream scales - but this would only have a meaningful impact if your use case is sensitive to very low latency.

So I think if it were me I’d want to look for evidence of what the root issue might be at the network level, and also experiment to see if amending those values mitigates the problem.

Hope that’s helpful! :slight_smile:

Hi, Just an update on this topic for those that could possibly encounter similar issues. The issue was due to Istio handles the traffic outside of the k8s cluster. I believe the cause may be related to the following thread.

The kinesis and dynamodb requests were going through a PassthroughCluster is a general cluster added to sidecars for traffic considered “not inside the mesh”, and the settings on it are not optimized since it does not know what kind of traffic to expect.

Adding service entries for these domains significantly improves these issues although the issues are not completely resolved.

1 Like