I set up the Snowplow open source pipeline on GCP using GKE, and streaming data to BigQuery. Thing is, our business needs to monitor multiple clients and currently, the StreamLoader and the associated services can only be configured to use one BigQuery table.
As we need to separate the data for each client, I’m wondering if there is any way to do it directly with Snowplow applications, or if I need to create a new table for each client and periodically export the data from the main one (used by Snowplow) ?
There are a couple of ways to do this but it tends to depend on business requirements and volumes. The easiest way is to view / model the data in BigQuery and separate out by something like app_id - in views this will give you up to date data and in tables you can easily incrementally refresh this.
If the data needs to be in different storage locations / datasets entirely then running multiple loaders with separate app_id subscriptions could achieve this by sinking into multiple tables.
Our business requires us to store data in separate google cloud projects, one for each of our client. From what I understand in your answer, if we replicate the bigQuery loader for each client, we will also need to replicate the mutator and the repeater, and as our number of clients grow, this will considerably impact the cost of the kubernetes cluster.
That is why I was wondering if keeping only one “main” bigquery table for one snowplow bigquery loader and exporting data periodically from this table to our client’s table based on the app_id could be a valid option.
I think again this is going to come down to your business requirements.
If you must separate client data in separate GCP projects then you need to do this as earlier as possible (e.g., at the load balancer) rather than anything downstream of this as by this stage you will have already processed and stored client data in the project.
If you aren’t actively mutating each client table independently then you would likely not need this service. There’s very little compute associated with these however if you did want to run them independently.