Making the Stream Enricher Highly Available (autoscaling group)

vivricanopy · November 8, 2016, 6:43pm

I was wondering whether it was possible to make the scala stream enricher highly-available using an autoscaling group. Can this be done in principle without getting duplicate records?

mike · November 8, 2016, 9:23pm

Essentially yes, but the benefit of scaling the stream enricher is going to depend on the sharding of your stream.

There’s a little bit of detail that Alex has posted before about this here.

You won’t get duplicate records for this reason - only one worker can work on one shard.

christoph-buente · November 8, 2016, 9:48pm

Yes, we run the enricher with an autoscaling group, even though we just have one instance (which is idling with 20MM events per day). But even if we had only one shard and more than one enricher: Each of them gets a “lease” on the stream for a given time. Once the time is up, the lease is released and can be grabbed by another worker (or the same one again).

On the other hand: What is the purpose of availability? AWS guarantees that your data stays in there for 24h. In case your ec2 instances crashes or terminated, you are save. Simple bring up a new instance, either manually or using an autoscaling group to resume the stream.

vivricanopy · November 8, 2016, 10:02pm

Thanks, @mike!

Which stream should the shards be matched on - the one for the collected events, the enriched ones, or should they both have the same number of shards? Also - would this magically work, i.e. let’s say i have X shards and X instances in the autoscaling group listening and writing to the same two streams - should it just work?

vivricanopy · November 8, 2016, 10:04pm

Thanks @christoph-buente,

Our use case is in the realtime space, so CloudWatch alerts and Nagios aside, we should always have something up and running.

christoph-buente · November 8, 2016, 10:17pm

@vivricanopy If that is your usecase, then I’d recommend at least 2 instances. The number of shards depends on the throughput. One Snowplow event is roughly bout 2kb in size. It can be bigger, depending on the custom contexts and derived contests. The Limits for Kinesis are as follow:

Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second.
Each shard can support up to 1,000 records per second for writes, up to a maximum total data write rate of 1 MB per second (including partition keys). This write limit applies to operations such as PutRecord and PutRecords.

So you do the math for whatever capacity you need. Also the number of clients is important. If you were running a kinesis-to-s3 sink to store the raw events additionally to the enricher. You have double the read events on the raw stream and thus you need to have more shards.

vivricanopy · November 9, 2016, 3:12pm

@christoph-buente, thanks! the way we’re using it is by a lambda at the receiving end of each shard, and by deploying a library to autoscale the kinesis shards as needed.

christoph-buente · November 9, 2016, 3:15pm

That is very interesting @vivricanopy Can you share some details on the kinesis autoscaling part?

vivricanopy · November 9, 2016, 4:57pm

Definitely!
The lib we’re using is by awslabs: https://github.com/awslabs/amazon-kinesis-scaling-utils
We’re sticking it into an elastic beanstalk application, works as advertised so far.

mike · November 9, 2016, 10:09pm

One thing to keep in mind - if you’re using lambda to read off the shard and process, is that Lambda doesn’t have exactly-once semantics. That is, your data/batch will be processed at least once by a Lambda function, but it may be processed more than once.

vivricanopy · November 9, 2016, 10:29pm

That’s interesting @mike ; I haven’t seen that in practice or heard about it. AFAIK lambdas attach themselves 1-1 to a shard, and using the LATEST cursor process Kinesis events exactly once. Will it be the multiple enricher instances that will loosen those guarantees?

mike · November 9, 2016, 11:04pm

It’s more so a property (internally) of Lambda itself it seems

Invocations occur at least once in response to an event and functions must be idempotent to handle this.

vivricanopy · November 10, 2016, 4:36pm

Hey @mike, from what I read (and experienced, so far), this is just a property of direct lambda invocations - not the managed kinesis link. If I’ll notice dups anytime in the future, I’ll definitely write here for everyone’s benefit.

Topic		Replies	Views
Scala Stream Collector - scaling Collectors	7	3520	January 25, 2017
Kinesis Stream Enrich Enrichment	3	1326	June 11, 2019
Compute profiles of Scala Collector & Enricher Enrichment	3	1460	November 29, 2016
Autoscaling Kinesis in AWS stream architecture AWS real-time pipeline	2	1860	May 7, 2020
Resharding Kinesis and the Enricher AWS real-time pipeline	3	2242	September 23, 2016

Making the Stream Enricher Highly Available (autoscaling group)

Related topics