Batch versus real-time: comparing infrastructure costs

travisdevitt · August 16, 2016, 3:47pm

I was hoping someone who has experience with both the Snowplow batch and real-time pipelines could chime in on the difference in costs from an infrastructure perspective. What was your approximate step up in cost to run the real-time pipeline (2X? 5X? 10X?) instead of the batch pipeline? We are on batch pipeline right now but want to understand what the cost might look like if we decide to go real-time later in the future.

Thanks!

Simon_Rumble · August 17, 2016, 12:09am

The biggest difference is that there’s a bunch of stuff you’ll be running
all the time, not just the collectors. So you need to calculate how many
Kinesis shards you’ll need, multiplied by the steps in the pipelines. Then
the different processing steps. That’s where the big costs come in.

Batch is ridiculously cheap, especially if you use spot instances for extra
nodes in the ETL. I wrote a simple script that calculated roughly how many
task nodes were needed to process the batch sitting in the incoming bucket
so it scaled up when needed.

13scoobie · August 18, 2016, 10:26pm

How did you calculate the number of task nodes per file count? Mind sharing the rough #'s you used? Very cool idea and great cost saver using spot instance

Topic		Replies	Views
Recent cost information for Snowplow For engineers	8	4853	November 20, 2020
Stream vs Batch For engineers	9	3374	April 4, 2018
On-premise Realtime Pipeline For engineers	2	2438	January 3, 2018
Migration from batch processing to (near) real-time For engineers	3	966	February 14, 2019
Data Comparison AWS real-time pipeline	0	1056	February 14, 2019

Batch versus real-time: comparing infrastructure costs

Related topics