I’m trying to figure out how to understand how it can be viable to store events on GCP Cloud Storage. On our current platform (not using snowplow yet), we are currently ingesting ~500m events per month.
Maybe there is something i don’t understand but GCP cost per million for inserts is like 5$.
That would mean 2500$ USD ??? That’s w/o reads and network taken into consideration.
At that point i would rather store them in something like BigTable, sounds way cheaper.
Are you guys batching events or is it actually 1 insert per event ?
Hi Jimmy - are you able to link the pricing documents you are referring to?
Generally GCP pipelines store data in BigQuery and you can optionally store this on Google Cloud Storage as well.
For BigQuery the streaming insert cost is based on bytes inserted, rather than operations based - 1 TB of streaming inserts would cost you approximately $50 USD / month.
If you’re using BigQuery I think it’s worth asking do you really need streaming inserts in the first place, as it’s not really a real-time database so using just micro-batching every 30 min, or can be even 10 min might be more affordable as long as you are within writes per table per day limit.
I’ve digged a little in the GCS loader code and correct me if i’m wrong but i see windowing being used so i would imagine that it’s in fact batching events together ? (I’m not familiar with Dataflow)
With what you just said and looking at the pipeline documentation there is no intermediary storage between the collectors and the end of the pipeline (RT GOOD STREAM). It’s either going through PubSub and Dataflow ? Except for the non happy path, it does look like bad events are going to GCS, so are those batched ? I would also assume that BigQuery is optional since everything can just be put on PubSub ?
GCS is optional - some users don’t sink good data to GCS at all but instead only failed events or failed inserts. This is batched and typically only a small fraction of overall data so costs here are generally a couple of dollars a month depending on volume. The GCS loader takes a windowDuration parameter which helps define how often you would like to write these files out.
Yes - that’s correct the GCS loaders irrespective of what you read from can (and typically are) batched and then partitioned on GCS according to timestamp.
This is a higher level diagram rather than an infra diagram but yes - the data goes through PubSub and then an application (typically the BQ loader running in a Docker container) inserts data from PubSub into BigQuery.
Yes - these are batched as well.
BigQuery is indeed optional but most users tend to use this as the final destination for their data as it is typically where they perform analysis and run data models. There is nothing to stop you loading from GCS directly (as this operation is free) but the loaders handle a lot of this logic and mutation for you.