Handle up events per second

Hey Guys,

In Snowplow Open Source guide, describe that the Quick Start Handle up to ~100 events per second. To consider receive more events per seconds, which services should I increase and how to increase for each amount of events?

And if I sent 200 events per seconds, where will be the others 100 events? At the bad events?

Thanks by Advance,
Fernando

Hey @nando_roz so the Snowplow Pipeline is horizontally scalable, without too many complications, up to huge volumes (we have load tested up to 10’s of thousands of RPS). To increase throughput you need to scale the components of the pipeline until latency is being maintained.

For example to receive more events at the Collector you might add 4 collector servers which will increase the amount of data that can be received → but then you will also need to look at the “raw” stream (at least on AWS) that it is pushing to and increase the number of Kinesis shards allocated to it. If that increases you might find next that Enrich is struggling to keep up and will also need more servers allocated to it.

This continues down the chain until everything from Collector → Warehouse has sufficient capacity to keep up with your traffic volume. Auto-Scaling can be used for a lot of this to make changes in traffic easier to handle - so rather than manually scaling the services you would instead trigger scaling based on CPU utilization.

Hi @josh thanks for your reply.

And about the events that couldn’t be sent, in my example, if I sent 200 events in a second, 100 will be collected, and the others 100 was lost? Or kept in some wait list?

And just scale horizontally the server, automatically we will receive more events, or some configuration to consider this new server should be done?

I’ll jump here as I think I can help you understand what you’re dealing with.

And about the events that couldn’t be sent, in my example, if I sent 200 events in a second, 100 will be collected, and the others 100 was lost? Or kept in some wait list?

100 events per second isn’t much, so I doubt it’d break anything. I don’t think that’s what you mean, I realise that you just picked a number to illustrate teh question - just clarifying in case someone in future gets the wrong impression. I’ll describe what happens when you do send enough volume to break a collector (which doesn’t scale up).

If the collector cannot accept an event, then it either will return a non-2XX response, or it won’t respond within the request timeout. In that situation, all of the trackers that we maintain will store the event in a queue, and retry them later (unless configured not to do so).

This means that as long as the user returns to the website or app, the data is not lost. If the user never returns, or clears their cache before returning then the data would be lost.

For things like webhooks or server-side tracking, it depends on the behaviour that you’ve set up.

So, if the collector goes down, there is risk of some data loss, but that risk is somewhat minimised by the trackers’ retry behaviour.

Once the data reaches the collector successfully the risk of data loss is next to nil.

For this reason, we always recommend setting up collectors with multiple instances, and lots of memory & CPU overhead, to deal with spikes in traffic.

And just scale horizontally the server, automatically we will receive more events, or some configuration to consider this new server should be done?

Most of our apps are compatible with - and built for - horizontal scaling, including the collector. (I think the databricks loader is the only one that isn’t - but it’s early life and we’re scoping it).

So you won’t need to change anything about configuration, you just need to worry about configuring the autoscaling correctly.

I hope that’s helpful!

Hi @Colm thanks a lot for your explanation.

The number of 100 is what is describe on then documentation, and the terraform module was configured for, just keep low cost of infrastructure. This was my doubt, I know each company has your particular volume os events and t3micro supports 100 events per seconds. I am checking with dev team how many requisition we have to mensure the probably amount of events. But, with your experience, 100 events per seconds is not so much?

So I can choose, keep with t3micro and auto scale up horizontally and see if every time the infra is scaled up and if it’s true, upgrade t3micro to another higher, every time monitoring my environment to see the CPU and Memory usage.

Is this a good way?

PS: (should I configure auto scale if I understood, collector, kinesis and enrich server?)

The number of 100 is what is describe on then documentation, and the terraform module was configured for, just keep low cost of infrastructure.

Ah, I spoke out of turn then! I’m not very familiar with these terraform modules, so trust those docs more than me! Apologies for the confusion.

So I can choose, keep with t3micro and auto scale up horizontally and see if every time the infra is scaled up and if it’s true, upgrade t3micro to another higher, every time monitoring my environment to see the CPU and Memory usage.

Seems like a sound strategy to me - just make sure to set the CPU and Memory thresholds low - it’s best to over-provision collectors, since they’re the riskiest part of the pipeline (due to what I described in the previous answer).

PS: (should I configure auto scale if I understood, collector, kinesis and enrich server?)

Yes - kinesis is slow to scale, and causes knock-on effects through the system. So a bottleneck in enrich can slow things down both downstream and upstream, as Josh describes (Josh much more of an expert in this topic than me, I just thought I could help out with additional context for his answer) :slight_smile:

HI @Colm perfect. Thanks again for your help.

I’ll try it and some problems, returns for you guys.

@nando_roz I’d like to clarify:

So I can choose, keep with t3micro and auto scale up horizontally and see if every time the infra is scaled up and if it’s true, upgrade t3micro to another higher, every time monitoring my environment to see the CPU and Memory usage.

My answer above assumed that you meant setting up autoscaling. If by this you mean “I can choose to scale manually instead of setting up autoscaling” - then the answer is that yes you’re free to do whatever you like, but it’s not how we would recommend doing it.

If you’re planning on managing manually it’s easy to get outages and data loss. So only choose this if you’re ok to accept that risk. In my opinion it’s worth the effort to just set up autoscaling. :slight_smile: