BigQuery Loader - Mutator

sdbeuf · May 6, 2020, 6:43am

I have a couple of questions regarding the BigQuery loader and mutator.
I have a pipeline in production which I completely provision using Terraform. It has a VM group for the collector and a VM group to facilitate the beam enrich and the BigQuery Mutator & Loader

When I would like to update my snowplow BigQuery Loader, my VM group will restart and re-execute the mutator task to create the table in BigQuery.
First question: If my table already exists, will the mutator overwrite my table?

Secondly, I would like to use the snowplow deployment to collect, enrich and store the data of multiple websites. Is it possible to separate the data into multiple tables while still using the same pipeline?

Thanks in advance,
Sam

mike · May 6, 2020, 7:59am

No, the mutator shouldn’t overwrite your existing table. By restarting it you’ll clear it’s internal cache with respect to what columns it has created, but this cache is refreshed on initialisation.

If you are running a single pipeline there’s a few different options:

Run multiple collector / pipelines (this might be desirable for first party cookie setting / ITP)
Split data out at enrich time (and have one BQ loader for each app_id for example)
Split data once the data has been sunk into BigQuery and either create an incremental table / view (you probably want a materialised view) per app_id that runs on a frequent basis.

sdbeuf · May 6, 2020, 8:30am

Thanks Mike! This answer the questions superbly.

sdbeuf · May 6, 2020, 9:44am

Just given this point a second thought, if I have multiple tags in different GTM’s coming to the same collector (same IP & domain) it aren’t first party cookies anymore?

mike · May 6, 2020, 10:28am

These can still be first party cookies (for domain_userid) but the question becomes if you want to stitch network_userids across these sites.

Re: multiple tags on different domains - depending on how many domains you are running from a single collector this increases the risk that ITP / Webkit will flag the collector domain as engaged in cross-site tracking.

sdbeuf · May 6, 2020, 11:07am

This is not the purpose, it would be to spread the costs of lower traffic sites.

Is there more reading material for this? What I need to do to prevent this?

Thanks for your clear explanation.

mike · May 7, 2020, 1:40am

I’ve found https://www.cookiestatus.com/ from @simoahava to be a particularly useful resource for this across different browser implementations.

Topic		Replies	Views
Multiple BigQuery tables GCP pipeline	3	865	July 18, 2022
BigQuery Loader - Time partitioned table GCP pipeline	2	1416	April 23, 2020
About BigQuery startup script GCP pipeline	2	1062	October 16, 2021
Snowplow BigQuery Loader 1.0.0 released New releases	7	1745	November 15, 2022
Bigquery mutator and repeater works abnormally GCP pipeline	5	1396	October 22, 2021

BigQuery Loader - Mutator

Related topics