Recent cost information for Snowplow

groodt · March 3, 2019, 3:12am

I’ve seen the old https://snowplowanalytics.com/blog/2013/09/27/how-much-does-snowplow-cost-to-run/

I’m just wondering if there is more recent cost information that anybody is willing to share. It doesn’t need to be a detail cost model. Anecdotal cost information is still interesting to me.

I’d love to see some cost information for an AWS stack.

Batch stack
Realtime stack
Batch cost per 1 Million events
Realtime cost per 1 Million events

I know it’s not that simple, and there are lots of caveats.

christoph-buente · March 7, 2019, 11:11am

Hi @groodt,

i had a check on our costs for the AWS realtime pipeline.
We pay around $150/day to run the stack.

Split by service:

40% traffic and load balancer cost
25% kinesis
20% S3
15% EC2

We ingest around ~250,000,00 events a day. So it comes down to $0.6 per million events. Those costs do not contain the people maintaining the cluster and make sure it’s up and running.

As you see, there is no costs for data storage except S3, as we move that data on to different other streamings systems. We also run an ElasticSearch cluster with short retention time, that adds another $150/day on top.

Hope that was helpful.

jakethomas · March 7, 2019, 1:55pm

I was going to say something similar for larger setups. The primary snowplow pipeline I work on sees 230-260M events/day (running @ ~350k events/min for a sustained period of time) and @christoph-buente’s breakdown is almost exactly what I’ve accounted.

For smaller pipelines I’ve found that Kinesis is the most expensive component, due to its shard-hour pricing model.

Kinesis:
$0.36/day/shard in us-east-1. Running six, one-shard streams (collector good/bad, enricher good/bad, enricher pii, s3 sink bad) immediately puts you at ~$70/month. If you want to scale the primary streams (collector good, enricher good) up a couple shards each, you’ll pay $100/month for kinesis alone.

ALB
Application load balancers are billed based on load balancer hours (the alb is running) and LCU/hour. LCU is a four-dimension (new/active connections, processed bytes, rule evaluations) “load balancer capacity unit”.
You’ll pay $20/month in us-east-1 to keep the load balancer up, and not much thereafter until your traffic really starts heating up. $10/month for smaller installations (millions of events per month) is an overestimation, but when running a lot of traffic (hundreds of millions of events per month) through an ALB this drastically increases to be a real part of the equation.

EC2/ECS
This varies depending on the reliability/redundancy/risk profile you want, and if you’re using on-demand or reserved resources. To keep it simple:

Running three on-demand t3.small collector nodes costs ~$50/month in us-east-1 based on $0.0208/hr pricing.

Running three on-demand t3.small enricher nodes costs the same, while running a single on-demand m5.large enricher node costs ~$75.

These costs can be drastically reduced by switching to reserved instances, and building to your risk profile but nothing more.

S3
While storage in S3 is cheap, this data definitely piles up fast. Pricing here is all over the place, and mostly depends on tracking/site volume. For low-traffic, high-margin companies this is barely even factored into the equation.

Storage
This varies, and all depends on how you want to access events. A single-node redshift dc1 is cheap, a Snowflake data warehouse is (usually) pretty pricey .

Conclusion:
A very rough approximation I’ve found to work pretty well for small-to-medium-volume sites is $200 per month, pipeline infra only. With this being said, you can pay $50/month if you’re a thrill-seeker and $5000+/month if your site has a lot of traffic/event volume. Again, pipeline infra only.

Notes

There are definitely ways to make this more efficient - ECS or ASG’s are great for cutting costs if your traffic profile is spiky or if you just want to have a cool system. If you know the system will be up long term, reserving resources drastically cuts cost. If you don’t need everything in S3, you can merge objects and roll to glacier, etc.
I’ve intentionally left out monitoring/instrumentation infra and engineering costs
I’ve also intentionally left out all costs (explicit or implicit) associated with navigating points of scale, and knowing what to do when things happen.

groodt · March 11, 2019, 2:32am

Thanks very much for the detailed responses. That pricing looks very attractive.

Valdym · November 13, 2020, 9:27am

Hello Christoph,

First of all, thanks for your detailed information regarding running cost of Snowplow pipeline.
We are at the middle of decision process of which stack we will use(GCP, AWS). So, On Google Analytics, one of our customers sends 1,7mil. hits on one day. If we use 10 customers with this segment, which tech stack you might recommend to us?
All answers appreciated, thank you

robkingston · November 13, 2020, 12:34pm

For smaller volumes, like ours, you might incur a slightly higher cost.

We run both AWS (batch) and GCP (realtime pub/sub beam enrich) pipelines at Mint Metrics, each with 100M events/month.

Our GCP bill for Snowplow ranges between US$125-200/month (including light BigQuery usage, storage and streaming inserts). Call it ~$1.50 / million events.

Haven’t got our AWS (batch) bill handy, but it’s some multiple more expensive.

anton · November 13, 2020, 11:27pm

Hi @robkingston,

I’m wondering if you considered our lastest Enrich FS2 as a replacement to Beam Enrich, which follows the same approach as recent Stream BQ Loader - just a docker image without Dataflow.

Valdym · November 17, 2020, 1:27am

Hey Rob,

Thanks for your reply. Hugely appreciated. I think we are gonna go with GCP pipeline.

Take Care

robkingston · November 20, 2020, 6:46am

Topic		Replies	Views
Average AWS Cost	1	996	September 9, 2020
Low cost stream pipeline	4	1715	January 30, 2020
RFC: making the Snowplow pipeline real-time end-to-end and deprecating support for batch processing modules RFCs	19	4620	February 22, 2020
Data Comparison AWS real-time pipeline	0	1056	February 14, 2019
Does snowplow support realtime streaming into aws snowflake? AWS real-time pipeline	3	889	February 24, 2023

Recent cost information for Snowplow

Related topics