Just getting started with Snowplow - congratulations it looks amazing.
What are the recommendations for running the Scala Stream Collector at scale? I see the Clojure Collector has a recipe for Elastic Beanstalk. Does the same approach apply to the Scala Collector?
With the AWS real-time pipeline, you have lots of workers running all the time - not just the collectors but also Stream Enrich, ES Sink, S3 Sink etc. We have found that Elastic Load Balancers plus Auto-Scaling Groups have been a good fit for these - using these directly has all the upside of Elastic Beanstalk but with less magic to go wrong.
I’ve been meaning to ask the same question for a while but in terms of Enrich and the other Kinesis apps.
Autoscaling the collectors makes sense to me because they all write to the same stream (so I just need to make sure there’s enough capacity). But how does running multiple workers of Enrich work?
Is it as simple as making sure I shard the Kinesis streams, run multiple workers and let the KCL library do the rest?
Pretty much this - though substitute “workers” for “servers.” You have one KCL instance (i.e. one Stream Enrich or similar) per server, but you may have more than one worker inside each KCL. You can have more workers than shards, but no more than one worker working on one shard at a time.
The whole thing is a bit more complicated than it should be - we have developed an in-house scaling and monitoring platform for real-time called Tupilak, which we hope to open-source later this year. We’ll do a preview post on this new tech (it’s pretty exciting) in a month or so…
We’ve been using Tupilak in production with our Managed Service RT customers since last year - it’s been working well. You can find out more about Tupilak here:
Tupilak is one of the core components of the Managed Service RT so it’s unclear to us at this point if/when we’ll open-source it.