Bottleneck while testing with avalache

Hello everyone,

I am trying to perform a load simulation with avalanche to test the performance of my collector.
I am using a kafka stream collector hosted in a local instance (single core CPU), kafka is running in another instance and my avalanche vm on a third. All instances are in the same local net.

I have already made lots of simulations with avalanche with 1000 baseline users and 5000 peak users.
Kafka is recieving messages easily wihtout any problem, not using a big amout of resources.

My problem is that some of the avalanche requests, in the peak time, resultin a timeout caused by the collector, (timeout is at 60 sec) but the collector never uses more than 35 % of the machine’s CPU!!

I also performed simulations with different collector configurations inside akka.
(dispatchers, excecutors, throughput) Some of the metrics were changing like req/sec, response time and total requests but every time i am facing the same problem. Timeouts with 30% CPU usage by the collector.

This is an example of the avalanche results with the default akka configuration and a throughput of 5.

Am i missing something while configuring the collector? I cant figure out if there is a bottleneck somewhere and the collector is stuck on 35% of CPU.

Thank you in advance.

Hi @bambachas79,

There’s a good degree of guesswork in this answer so take it with a pinch of salt - normally a collector which expects to handle a production level of traffic would consist of multiple distinct collector instances which sit behind a load balancer. We normally provision enough availability such that there’s enough network capacity to handle all of the traffic even if an entire instance, or an entire region goes down.

The collector doesn’t do much work - it accepts what it receives and forwards that on. So I probably wouldn’t expect CPU usage to be the weak link in the chain, the hard work it handles is mostly on the network side of things. The aim of the above configuration is to provide high availability for incoming requests.

Since you’ve mentioned the collector you set up is on a local instance, my ‘stab in the dark’ guess as to why you’re seeing these results is just that you’re flooding ports with more traffic than the ports on your local instance can deal with. Like I said it’s a bit of a guess, but 700-1000 requests per second is certainly more than I would my laptop, for example, to be able to deal with on its own, so it seems to make sense in my head.

I hope that makes sense/helps get your head around what you’re seeing?

Thank you very much for your quick and illustrative response!

I ve made some changes in the process and i moved some processes on the cloud.
I am running avalanche in my local computer and i transfered Kafka and and the collector in two seperate ec2 instances with the same resourses. But I am getting the same results. Simillar numbers with low cpu usage. I am gonna check the network load on the ports and check if traffic causes the bottleneck.

1 Like

While ports seem to be ok with the incoming traffic when i moved the collector to an ec2 instance i made a test with much more users. But the results seem a little bit strage to me.

While in the baseline period, i can see these failing peaks every one minute (timeout period) but i am not able to understand why is it happening. Also, i am setting baseline_users to 10.000, but are there 10.000 open connections? Or constantly open and close after every request?

Here you can see the response time diagram that seem quite strange to me too.
Thank you!

@bambachas79 are you using a load balancer?

No, i am not… This is a single collector instance

Ah - hold on, I think I may have gone the wrong direction here.

If your avalanche instance is on a local machine, then the bottleneck may be there. Generally these load testing tools, if sending a lot of requests, would be configured with concurrency across multiple instances. For example, I’ve used locust in the past, and achieving ~1000-1200 requests per second required 3 t3.medium AWS ec2 instances.

Like I said before, this isn’t something that I’m an expert in, so I’m not 100% certain but I think this might be a good explanation for what you’re seeing.

One thing that might be worth looking into in terms of finding evidence for this is to look at the collector logs, and see if the responses give you a hint.

If that is the case then apologies for leading you down a blind alley!

For clarity, by this I mean that the bottleneck might be on how many network requests your machine can handle at a time.

Thank you very much @Colm, i will check it!