Understanding GCP Cloud Storage Costs

Jimmy_Beaudoin · May 21, 2023, 9:09pm

Hi guys,

I’m trying to figure out how to understand how it can be viable to store events on GCP Cloud Storage. On our current platform (not using snowplow yet), we are currently ingesting ~500m events per month.

Maybe there is something i don’t understand but GCP cost per million for inserts is like 5$.
That would mean 2500$ USD ??? That’s w/o reads and network taken into consideration.

At that point i would rather store them in something like BigTable, sounds way cheaper.
Are you guys batching events or is it actually 1 insert per event ?

mike · May 22, 2023, 6:34am

Hi Jimmy - are you able to link the pricing documents you are referring to?

Generally GCP pipelines store data in BigQuery and you can optionally store this on Google Cloud Storage as well.

For BigQuery the streaming insert cost is based on bytes inserted, rather than operations based - 1 TB of streaming inserts would cost you approximately $50 USD / month.

evaldas · May 22, 2023, 10:43am

If you’re using BigQuery I think it’s worth asking do you really need streaming inserts in the first place, as it’s not really a real-time database so using just micro-batching every 30 min, or can be even 10 min might be more affordable as long as you are within writes per table per day limit.

Jimmy_Beaudoin · May 22, 2023, 6:48pm

Basically, writes are Class A and reads are Class B

Oh, i thought everything was going to GCS as per the documentation :

Events hit the Collector application and it saves the raw events to storage (S3 on AWS and Google Cloud Storage on GCP) and then sends them down the pipeline to Enrich.

I’ve digged a little in the GCS loader code and correct me if i’m wrong but i see windowing being used so i would imagine that it’s in fact batching events together ? (I’m not familiar with Dataflow)

With what you just said and looking at the pipeline documentation there is no intermediary storage between the collectors and the end of the pipeline (RT GOOD STREAM). It’s either going through PubSub and Dataflow ? Except for the non happy path, it does look like bad events are going to GCS, so are those batched ? I would also assume that BigQuery is optional since everything can just be put on PubSub ?

mike · May 22, 2023, 7:35pm

GCS is optional - some users don’t sink good data to GCS at all but instead only failed events or failed inserts. This is batched and typically only a small fraction of overall data so costs here are generally a couple of dollars a month depending on volume. The GCS loader takes a windowDuration parameter which helps define how often you would like to write these files out.

Yes - that’s correct the GCS loaders irrespective of what you read from can (and typically are) batched and then partitioned on GCS according to timestamp.

This is a higher level diagram rather than an infra diagram but yes - the data goes through PubSub and then an application (typically the BQ loader running in a Docker container) inserts data from PubSub into BigQuery.

Yes - these are batched as well.

BigQuery is indeed optional but most users tend to use this as the final destination for their data as it is typically where they perform analysis and run data models. There is nothing to stop you loading from GCS directly (as this operation is free) but the loaders handle a lot of this logic and mutation for you.

Jimmy_Beaudoin · May 22, 2023, 7:37pm

Okay ! Thanks, that pretty much answers all my questions.

Topic		Replies	Views
GCP Infra Pricing GCP pipeline	4	1015	April 17, 2023
Full trace of an event from tracker to bigquery GCP pipeline	3	948	January 20, 2022
Recent cost information for Snowplow For engineers	8	4855	November 20, 2020
Querying Failed BigQuery Events in GCS GCP pipeline	2	841	January 24, 2023
Google Cloud Dataflow example project released New releases	8	2871	April 23, 2017

Understanding GCP Cloud Storage Costs

Related topics