Stream collector health check fails with slight increase in producer latency of Kafka

Hey there,

We are using stream collector 0.17.0 with Kafka. We are observing that when there is an increase in producer latency of Kafka. The health api of the collector starts to fail. The latency of entire collector goes in minutes. Even the health check api. This makes the whole infra (all the nodes) to go into an unhealthy state and bringing the entire pipeline to a halt.

We are currently going to somewhere 2k/s req in the peak (1/3rd of the day). There are 50 partitions for the collector topic. Everything goes well when the producer latency is at 2ms levels. But everything falls apart when the latency is increased to 4-5 ms levels. I understand the latency goes 2X. But it doesn’t that high to bring everything to halt.

I have also noticed that when this unhealthy situation happens, the system.disk.read_time_pct metric goes to 30X. The system.disk.read_time_pct is increased in line with some activity like partitions reassignment. The EBS burst balance falling due to a sudden surge of incoming messages in some other topic. While all other systems (producers and consumers) stays unaffected. The snowplow pipeline comes to a halt.

What could be the reason for this. Do you think we need to make some configuration adjustment to make the above scenario work? We haven’t made any change to default configuration.

Thanks for the help

Hi @jimy2004king,

My understanding is that:

  1. There is some Kafka activity like partitions reassignment, as indicated by the system.disk.read_time_pct metric that goes to 30X on the brokers
  2. As a result producer latency increases a bit on the collector
  3. Collector starts to misbehaves

1 and 2 are normal behavior but 3 is not.

To better understand what’s happening I have a few questions:

a. Where/how do you measure this latency?
b. Does this happen only when there is a burst or it can happen randomly?
c. How many instances of the collector do you have?
d. How does CPU and memory usage of the collector behave when this happens?

e. So when an event is sent to the collector, this event makes it to Kafka a few minutes later?

f. You mean that the collector can’t receive any event any more?

We have datadog agent running on the Kafka host. And it keeps reporting the producer latency.

This only happens when system.disk.read_time_pct is increased either of all volumes of Kafka or any one volume of Kafka on which snowplow topics resides.

It’s on autoscaling with container of 200 CPU and 300 MEM (in regards to AWS ECS standards). The number of containers go from 2 to 12 with an avg of 6.

The CPU and MEM utilization shows 3X drops for that period. And due to drop in CPU scaling in kicks in.

Actually, no events comes to the collector. As I told you the health check starts failing because of high latency. So all of a sudden all the targets in TG are marked unhealthy and all requests starts failing with 503.

Yes. Due to above stated reason.

@BenB Any update here ?

Hi @jimy2004king, sorry we were not able to investigate the issue yet, we plan to do it in the beginning of next week.

Hi @jimy2004king,

You mean that one container has 200 CPU cores and 300 MB of memory?

Probably due to the fact that the collector is not receiving events any more.

Higher latency is observed on the broker and is normal if there is a rebalancing in Kafka. The only reason that I can think of for the health check not responding is that all the threads are busy waiting for writing to Kafka, but this seems unlikely, as we use this function of KafkaProducer to send events to Kafka and documentation states The send is asynchronous and this method will return immediately once the record has been stored in the buffer of records waiting to be sent. This allows sending many records in parallel without blocking to wait for the response after each one. so an increase of latency should not block threads. Also, if collector is waiting for Kafka to be available, it should continue to receive events and its memory usage should increase, not drop. It might be that there is an unhandled exception when trying to write to Kafka while it’s rebalancing.

Could you paste the collector logs (with DEBUG would be better) of when this is happening please?

Also, when there is the re balancing on Kafka, does the collector crash right after or it takes some time?

By 200 CPU is like (1 vcpu / 1024) * 200. So it is ~ 1/5 of 1 vcpu.

This could be because the ALB is marking the target as unhealthy and not sending any new requests to it. Because of health check failure.

I doubt this to be true. Because we have 3 replication factor. So even if Kafka is rebalancing, there wouldn’t be unavailability of any partition as a leader. Also this scenario also happens when Kafka is rebalancing some other topic and not of collector topic.

I would try to replicate this on a staging environment and provide you the logs.

It crashes right after.