Scala Stream Collector - scaling

kjcsb · August 17, 2016, 10:00am

Just getting started with Snowplow - congratulations it looks amazing.

What are the recommendations for running the Scala Stream Collector at scale? I see the Clojure Collector has a recipe for Elastic Beanstalk. Does the same approach apply to the Scala Collector?

alex · August 17, 2016, 10:24am

Hi @kjcsb - it’s a good question.

With the AWS real-time pipeline, you have lots of workers running all the time - not just the collectors but also Stream Enrich, ES Sink, S3 Sink etc. We have found that Elastic Load Balancers plus Auto-Scaling Groups have been a good fit for these - using these directly has all the upside of Elastic Beanstalk but with less magic to go wrong.

Shin · August 17, 2016, 10:50am

I’ve been meaning to ask the same question for a while but in terms of Enrich and the other Kinesis apps.

Autoscaling the collectors makes sense to me because they all write to the same stream (so I just need to make sure there’s enough capacity). But how does running multiple workers of Enrich work?

Is it as simple as making sure I shard the Kinesis streams, run multiple workers and let the KCL library do the rest?

alex · August 17, 2016, 11:43am

Hi @Shin,

Pretty much this - though substitute “workers” for “servers.” You have one KCL instance (i.e. one Stream Enrich or similar) per server, but you may have more than one worker inside each KCL. You can have more workers than shards, but no more than one worker working on one shard at a time.

The whole thing is a bit more complicated than it should be - we have developed an in-house scaling and monitoring platform for real-time called Tupilak, which we hope to open-source later this year. We’ll do a preview post on this new tech (it’s pretty exciting) in a month or so…

kjcsb · August 18, 2016, 6:52pm

Thanks, that clarifies it.

spatialy · January 25, 2017, 7:51pm

Hi Alex

Any updates on the Tulipak release?

We are making test with SP and we are sure in production we need to apply some similar solution to manage the scaling.

Best

alex · January 25, 2017, 10:37pm

Hi @spatialy,

We’ve been using Tupilak in production with our Managed Service RT customers since last year - it’s been working well. You can find out more about Tupilak here:

Tupilak is one of the core components of the Managed Service RT so it’s unclear to us at this point if/when we’ll open-source it.

spatialy · January 25, 2017, 10:38pm

Hi @alex

Thanks for the info

Topic		Replies	Views
AWS minimum Config for Setting Scala collector For engineers	7	2310	January 30, 2020
Setting up the real-time pipeline on AWS AWS real-time pipeline	24	5963	May 25, 2021
Scala Stream Collector Collectors	4	2274	November 22, 2017
Making the Stream Enricher Highly Available (autoscaling group) Enrichment	12	3471	November 10, 2016
Handle up events per second Collectors	8	935	November 25, 2022

Scala Stream Collector - scaling

Related topics