Recent cost information for Snowplow

I’ve seen the old https://snowplowanalytics.com/blog/2013/09/27/how-much-does-snowplow-cost-to-run/

I’m just wondering if there is more recent cost information that anybody is willing to share. It doesn’t need to be a detail cost model. Anecdotal cost information is still interesting to me.

I’d love to see some cost information for an AWS stack.

  • Batch stack
  • Realtime stack
  • Batch cost per 1 Million events
  • Realtime cost per 1 Million events

I know it’s not that simple, and there are lots of caveats.

1 Like

Hi @groodt,

i had a check on our costs for the AWS realtime pipeline.
We pay around $150/day to run the stack.

Split by service:

  • 40% traffic and load balancer cost
  • 25% kinesis
  • 20% S3
  • 15% EC2

We ingest around ~250,000,00 events a day. So it comes down to $0.6 per million events. Those costs do not contain the people maintaining the cluster and make sure it’s up and running.

As you see, there is no costs for data storage except S3, as we move that data on to different other streamings systems. We also run an ElasticSearch cluster with short retention time, that adds another $150/day on top.

Hope that was helpful.

5 Likes

:wave:

I was going to say something similar for larger setups. The primary snowplow pipeline I work on sees 230-260M events/day (running @ ~350k events/min for a sustained period of time) and @christoph-buente’s breakdown is almost exactly what I’ve accounted.

For smaller pipelines I’ve found that Kinesis is the most expensive component, due to its shard-hour pricing model.

Kinesis:
$0.36/day/shard in us-east-1. Running six, one-shard streams (collector good/bad, enricher good/bad, enricher pii, s3 sink bad) immediately puts you at ~$70/month. If you want to scale the primary streams (collector good, enricher good) up a couple shards each, you’ll pay $100/month for kinesis alone.

ALB
Application load balancers are billed based on load balancer hours (the alb is running) and LCU/hour. LCU is a four-dimension (new/active connections, processed bytes, rule evaluations) “load balancer capacity unit”.
You’ll pay $20/month in us-east-1 to keep the load balancer up, and not much thereafter until your traffic really starts heating up. $10/month for smaller installations (millions of events per month) is an overestimation, but when running a lot of traffic (hundreds of millions of events per month) through an ALB this drastically increases to be a real part of the equation.

EC2/ECS
This varies depending on the reliability/redundancy/risk profile you want, and if you’re using on-demand or reserved resources. To keep it simple:

Running three on-demand t3.small collector nodes costs ~$50/month in us-east-1 based on $0.0208/hr pricing.

Running three on-demand t3.small enricher nodes costs the same, while running a single on-demand m5.large enricher node costs ~$75.

These costs can be drastically reduced by switching to reserved instances, and building to your risk profile but nothing more.

S3
While storage in S3 is cheap, this data definitely piles up fast. Pricing here is all over the place, and mostly depends on tracking/site volume. For low-traffic, high-margin companies this is barely even factored into the equation.

Storage
This varies, and all depends on how you want to access events. A single-node redshift dc1 is cheap, a Snowflake data warehouse is (usually) pretty pricey :slight_smile:.

Conclusion:
A very rough approximation I’ve found to work pretty well for small-to-medium-volume sites is $200 per month, pipeline infra only. With this being said, you can pay $50/month if you’re a thrill-seeker and $5000+/month if your site has a lot of traffic/event volume. Again, pipeline infra only.

Notes

  • There are definitely ways to make this more efficient - ECS or ASG’s are great for cutting costs if your traffic profile is spiky or if you just want to have a cool system. If you know the system will be up long term, reserving resources drastically cuts cost. If you don’t need everything in S3, you can merge objects and roll to glacier, etc.

  • I’ve intentionally left out monitoring/instrumentation infra and engineering costs :slight_smile:

  • I’ve also intentionally left out all costs (explicit or implicit) associated with navigating points of scale, and knowing what to do when things happen.

6 Likes

Thanks very much for the detailed responses. That pricing looks very attractive.

Hello Christoph,

First of all, thanks for your detailed information regarding running cost of Snowplow pipeline.
We are at the middle of decision process of which stack we will use(GCP, AWS). So, On Google Analytics, one of our customers sends 1,7mil. hits on one day. If we use 10 customers with this segment, which tech stack you might recommend to us?
All answers appreciated, thank you :slight_smile:

For smaller volumes, like ours, you might incur a slightly higher cost.

We run both AWS (batch) and GCP (realtime pub/sub beam enrich) pipelines at Mint Metrics, each with 100M events/month.

Our GCP bill for Snowplow ranges between US$125-200/month (including light BigQuery usage, storage and streaming inserts). Call it ~$1.50 / million events.

Haven’t got our AWS (batch) bill handy, but it’s some multiple more expensive.

2 Likes

Hi @robkingston,

I’m wondering if you considered our lastest Enrich FS2 as a replacement to Beam Enrich, which follows the same approach as recent Stream BQ Loader - just a docker image without Dataflow.

3 Likes

Hey Rob,

Thanks for your reply. Hugely appreciated. I think we are gonna go with GCP pipeline.

Take Care

2 Likes

:exploding_head: