Kinesis stream in front of collector

boba · February 14, 2020, 2:24am

We’re discussing how to set up our snowplow pipeline and the idea came up to add a Kinesis stream in front of the Collector as a buffer to make sure we’re not losing events in case of the collector being unavailable for a while. Our tracking happens from within a mobile app, so we don’t really have to reply to received events with a cookie. Is anyone doing this, or are there good reasons to not set it up this way? Any advice is appreciated!

grzegorzewald · February 14, 2020, 7:53am

Hi @boba,

Quite interesting approach but still i can see bottlenecks - you need something in front of kinesis stream in order to put data in there (so literally you need collector for the collector) - i don’t see what you can win here. Moreover you would need a kinesis stream consumer to push data to current collector. IMHO does not make any sense. Of course, you can rebuild tracker to push data directly into raw kinesis stream - but in such a case you do not need anything additional.

TBH I would go for HA/HR collector (LB + autoscaling) for collector. Data loss you can observe would be statistically negligible.

Colm · February 14, 2020, 12:11pm

I pretty much agree with @grzegorzewald.

The key is this part:

to make sure we’re not losing events in case of the collector being unavailable for a while

If you set up more than one collector, each in a different availability zone (no less than two but more AZs = more availability), a load balancer, and autoscaling, then the chances of what you’re concerned about happening are negligible.

We’ve got hundreds of pipelines and have been running for years and AFAIK we haven’t once had a collector availability issue with this strategy.

dilyan · February 17, 2020, 3:14pm

To add to what @Colm said, a bigger worry is that Kinesis cannot scale up quickly enough in case traffic from the collector spikes. We’re currently experimenting with adding SQS as a buffer for overflowing traffic and hopefully will be able to address it in a forthcoming release.

mike · February 17, 2020, 9:15pm

To add to what has been mentioned adding a Kinesis stream will just increase the likelihood of a failure scenario.

Depending on the tracker you’re using many trackers will keep a local buffer - this is important as mobile devices frequently go offline - and won’t be able to send events. Likewise - almost all the trackers will attempt to queue and resend events if the collector responds with a non 200 status.

In ~5 years I’ve only seen one instance where data loss was at increased risk and this was due to API issues that impacted an entire region (and all services within). To mitigate this you can either run in multiple regions, or in multicloud but there are cost implications with doing so.

Topic		Replies	Views
Cutting one step on real time pipeline : stream-collector > kinesis > elasticsearch AWS real-time pipeline	2	2294	April 12, 2016
Stream Collector 2.1.0 released New releases	0	1265	December 14, 2020
Kinesis Streams vs S3 Buckets AWS real-time pipeline	4	5453	February 17, 2020
Why is Snowplow using Kinesis/Kafka for real-time pipeline? AWS real-time pipeline	4	6038	July 12, 2016
Stream Collector 2.0.0 released New releases	0	1437	September 15, 2020

Kinesis stream in front of collector

Related topics