Is it possible to scale Enricher vertically?
We are using EKS pods and streams are handled through Kinesis.
There cpu utilization of collector and loader seems fine but Enricher is something which creates bottleneck.
I have a similar issue indeed that the enricher cannot follow the speed at which events arrive and seems to be the bottleneck in my setup, both the enricher and collector run on ECS and we use kinesis streams with on demand capacity between the collector and enricher.
The maxRecords value in the enricher config is already set to 10 000 which I believe is the maximum possible value to get records on a kinesis stream anyway.
Hi @Alexandre5602 both the Collector and Enrich can scale horizontally. So you should be able to add auto-scaling rules based on CPU for both of them to add extra pods as demand increases. Scaling up at 60-70% CPU across the group is generally a good rule of thumb and scaling down when you hit less than 20% CPU should work well.
The only caveat here is that for Enrich you shouldn’t have more pods than you have shards in the Kinesis Stream. Even though you are using on_demand mode under the hood you still have a certain number of shards being allocated that need to be distributed and if you have more pods than shards you will skew the auto-scaling logic as it will need to be able to grab any shards to process (so CPU will be artificially low).
This caveat is true for any Kinesis consumer application.
Thanks a lot for your response I thought about scaling vertically the ecs task but not horizontally, it does makes sense to create several tasks in parallel for more read throughput.
The only issue with what you mentioned above is with the auto scaling right? Let’s say I fix the kinesis stream to have 3 shards but the number of enricher tasks will be defined by the CPU, I might end up with sometimes only 1 ecs task reading 3 shards (when traffic is low) and sometimes 3 ecs tasks to read the 3 shards(which would be optimal of course).
Hi @Alexandre5602 this is the whole function of auto-scaling! During peaks you have more consumers and during lulls in traffic it reduces the number of tasks. This is how we configure scaling to work internally for our customers and it works quite well.
I would start with implementing the auto-scaling policies and playing with the CPU thresholds to trigger scaling on until you achieve the stability and throughput you need.