Question on the Kinesis hops

kfitzpatrick · October 22, 2020, 12:56pm

Guys,

I have SP configured in Beanstalk with 3 hops over Kinesis with Good & Bad Streams.

Are there any plans to reduce the numbers of Hops required for the SP implementation?

Are there any plans to change the thrift like turn it into CSV? doing this may allow us to architecture something a little different.

Edit - is it possible with consulting to reduce the hops?

Thanks
Kyle

mike · October 22, 2020, 9:40pm

Hi Kyle,

I’m not too sure exactly what you mean by hops here but a standard implementation typically looks like:

User => Load balancer => Collector(s) => Kinesis raw => Stream enrich => Kinesis good / bad

Although the hops adds a small degree of latency it follows a microservices-style design that allows for scaling at each part of the pipeline and helps protect against any backpressure.

Re: Thrift - I don’t think this will get changed any time soon, but if it did I think the likely option would be another serialisation format (e.g., Protobuf) that is typed and can be serialised to a binary format. Unfortunately text formats like CSV / JSON result in larger messages (in Kinesis) and have an increased deserialisation cost at the enricher (increased CPU cost + latency).

kfitzpatrick · October 23, 2020, 12:57pm

Thanks for the response Mike.

Thats exactly what I mean by hops, the Kinesis streams. As the implementation grows out to more domains I need to reserve more Kinesis bandwidth to somewhere around the mid point of low to peak. Reserving this much bandwidth has a cost and then there is the cost per “record”. For 3 domains these days its roughly 200-300K hits (PageViews) per day, COVID impacts those volumes, it is usually significantly higher. I’ve well over 30 Domains to push SP to, maybe half that volume in additional Native Apps also. I haven’t fully crunched all the numbers yet but the the reserved Kinesis bandwidth is going to become costly.

Records repeat also, useragent (1 example) is aways sent regardless of struct, unstruct, pageview, ping. The storage of this is fine but its the processing to get it to SnowFlake that has the main cost. I’ve asked about this before but this won’t change anytime soon.

My thoughts around removing thrift was to cut one of the streams and do something else with the enricher. Or if Snowplow themselves had considered lessening the number of required Kinesis streams in v3.0 or beyond?

My main struggle ultimately is cost as the implementation is straightforward. If I can reduce the number of streams, aggregate like with X, Y offset, I’m going to need to do.

Colm · October 23, 2020, 2:15pm

Hey @kfitzpatrick,

To give you a bit of context, the principle design priorities around the product that are relevant to this discussion are completeness and reliability.

A kinesis stream between the collector and enricher, for example, ensure that the collector is self-contained, and there is minimal risk of data loss as long as the collector is up. In this respect, where there’s a trade-off between cost and reliability, the design favours reliability.

Within that design, we do optimise to keep cost down in terms of operations. For the pipelines we run as part of the Snowplow Insights product (for those unfamiliar - we run the infrastructure in the customer’s cloud), we have proprietary tech that we’ve built to manage scaling kinesis (and other components), so we don’t always have to over-provision resources.

Having said all that, in our experience even with that cost trade-off, the cost to run doesn’t normally land on a very high number. There’s a minimum provisioning which means that below a certain volume it’s expensive per-event - 2-300k events is just below that minimum scale. Just doing some back of the envelope maths, I would expect kinesis costs to fall somewhere near the hundred dollar mark, for 4 kinesis streams, (2 bad, 1 good, 1 raw). There is scope to bring this down if you choose retention periods of less than 7 days (which generally is well above how long you’d realistically need to ensure ‘safe’ recovery from issues - especially if you have s3 sinks).

I believe the GCP pipeline can work out as cheaper to run than AWS at lower volumes, because PubSub is natively flexible, so that minimum provisioning/over-provisioning problem disappears. I’m risking spending a lot of time on this comment, so forgive me for not pulling the numbers on that one.

Apologies for the essay. We have had a lot of recent activity on discourse and across other forums from people who are just getting started with Snowplow so I’m conscious of an audience that may not have all the context.

TL;DR: The direct answer to your question is that no, we don’t plan on changing the design to reduce the number of kinesis streams. But we do actively work to reduce cost to run on an ongoing basis where possible.

I hope that’s helpful.

kfitzpatrick · October 23, 2020, 3:32pm

@Colm Very much so and I appreciate the detailed response.

Topic		Replies	Views
Scaling kinesis enricher for high loads Enrichment	11	2342	December 11, 2018
Resharding Kinesis and the Enricher AWS real-time pipeline	3	2241	September 23, 2016
Why is Snowplow using Kinesis/Kafka for real-time pipeline? AWS real-time pipeline	4	6038	July 12, 2016
Compute profiles of Scala Collector & Enricher Enrichment	3	1460	November 29, 2016
Scala Stream Collector - scaling Collectors	7	3520	January 25, 2017

Question on the Kinesis hops

Related topics