I was wondering whether it was possible to make the scala stream enricher highly-available using an autoscaling group. Can this be done in principle without getting duplicate records?
Essentially yes, but the benefit of scaling the stream enricher is going to depend on the sharding of your stream.
There’s a little bit of detail that Alex has posted before about this here.
You won’t get duplicate records for this reason - only one worker can work on one shard.
Yes, we run the enricher with an autoscaling group, even though we just have one instance (which is idling with 20MM events per day). But even if we had only one shard and more than one enricher: Each of them gets a “lease” on the stream for a given time. Once the time is up, the lease is released and can be grabbed by another worker (or the same one again).
On the other hand: What is the purpose of availability? AWS guarantees that your data stays in there for 24h. In case your ec2 instances crashes or terminated, you are save. Simple bring up a new instance, either manually or using an autoscaling group to resume the stream.
Thanks, @mike!
Which stream should the shards be matched on - the one for the collected events, the enriched ones, or should they both have the same number of shards? Also - would this magically work, i.e. let’s say i have X shards and X instances in the autoscaling group listening and writing to the same two streams - should it just work?
Thanks @christoph-buente,
Our use case is in the realtime space, so CloudWatch alerts and Nagios aside, we should always have something up and running.
@vivricanopy If that is your usecase, then I’d recommend at least 2 instances. The number of shards depends on the throughput. One Snowplow event is roughly bout 2kb in size. It can be bigger, depending on the custom contexts and derived contests. The Limits for Kinesis are as follow:
- Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second.
- Each shard can support up to 1,000 records per second for writes, up to a maximum total data write rate of 1 MB per second (including partition keys). This write limit applies to operations such as PutRecord and PutRecords.
So you do the math for whatever capacity you need. Also the number of clients is important. If you were running a kinesis-to-s3 sink to store the raw events additionally to the enricher. You have double the read events on the raw stream and thus you need to have more shards.
@christoph-buente, thanks! the way we’re using it is by a lambda at the receiving end of each shard, and by deploying a library to autoscale the kinesis shards as needed.
That is very interesting @vivricanopy Can you share some details on the kinesis autoscaling part?
Definitely!
The lib we’re using is by awslabs: https://github.com/awslabs/amazon-kinesis-scaling-utils
We’re sticking it into an elastic beanstalk application, works as advertised so far.
One thing to keep in mind - if you’re using lambda to read off the shard and process, is that Lambda doesn’t have exactly-once semantics. That is, your data/batch will be processed at least once by a Lambda function, but it may be processed more than once.
That’s interesting @mike ; I haven’t seen that in practice or heard about it. AFAIK lambdas attach themselves 1-1 to a shard, and using the LATEST
cursor process Kinesis events exactly once. Will it be the multiple enricher instances that will loosen those guarantees?
It’s more so a property (internally) of Lambda itself it seems
Invocations occur at least once in response to an event and functions must be idempotent to handle this.
Hey @mike, from what I read (and experienced, so far), this is just a property of direct lambda invocations - not the managed kinesis link. If I’ll notice dups anytime in the future, I’ll definitely write here for everyone’s benefit.