Multiple BigQuery tables

Ltsrc · July 11, 2022, 1:01pm

Hello,

I set up the Snowplow open source pipeline on GCP using GKE, and streaming data to BigQuery. Thing is, our business needs to monitor multiple clients and currently, the StreamLoader and the associated services can only be configured to use one BigQuery table.

As we need to separate the data for each client, I’m wondering if there is any way to do it directly with Snowplow applications, or if I need to create a new table for each client and periodically export the data from the main one (used by Snowplow) ?

Thanks

mike · July 11, 2022, 10:33pm

There are a couple of ways to do this but it tends to depend on business requirements and volumes. The easiest way is to view / model the data in BigQuery and separate out by something like app_id - in views this will give you up to date data and in tables you can easily incrementally refresh this.

If the data needs to be in different storage locations / datasets entirely then running multiple loaders with separate app_id subscriptions could achieve this by sinking into multiple tables.

Ltsrc · July 18, 2022, 7:03am

Thanks for the reply !

Our business requires us to store data in separate google cloud projects, one for each of our client. From what I understand in your answer, if we replicate the bigQuery loader for each client, we will also need to replicate the mutator and the repeater, and as our number of clients grow, this will considerably impact the cost of the kubernetes cluster.

That is why I was wondering if keeping only one “main” bigquery table for one snowplow bigquery loader and exporting data periodically from this table to our client’s table based on the app_id could be a valid option.

mike · July 18, 2022, 8:11am

I think again this is going to come down to your business requirements.

If you must separate client data in separate GCP projects then you need to do this as earlier as possible (e.g., at the load balancer) rather than anything downstream of this as by this stage you will have already processed and stored client data in the project.

If you aren’t actively mutating each client table independently then you would likely not need this service. There’s very little compute associated with these however if you did want to run them independently.

Topic		Replies	Views
BigQuery Loader - Mutator GCP pipeline	6	1602	May 7, 2020
BigQuery Loader - Time partitioned table GCP pipeline	2	1415	April 23, 2020
[RFC] Big Query Loader (Google Cloud Dataflow version) deprecation RFCs	0	845	July 8, 2022
Example: Running Snowplow real-time pipeline on GCP with Kafka and Kubernetes Kafka real-time pipeline	6	3084	June 1, 2017
About BigQuery startup script GCP pipeline	2	1059	October 16, 2021

Multiple BigQuery tables

Related topics