Partitioning/Clustering with BigQuery Loader

ral · January 27, 2022, 1:19pm

Hi all,

I have a question about BigQuery Loader.
I can see in the documentation that there is available specification for column partition.

BigQuery also supports table clustering that can provide some performance and cost improvement.

Do you know if this cluster specification is with any modifications available?
Are there any technical limitations here that I might miss?

Thanks in advance.

jrluis · January 27, 2022, 1:40pm

Hi ral,

The clustered column are configured the moment the table is configured.

Is possible to specify up to four top-level columns and should be specified in descending priority order.

If you already have the table created, it will need to be recreated, is the only limitation I can see at the moment, other than that, it should work without further modifications.

Regards

EddieM · January 27, 2022, 5:18pm

Hi @ral
Welcome to the Snowplow Community.
Cheers,
Eddie

ral · January 28, 2022, 7:59am

Thanks for the response @jrluis

As I can see, one of the options to introduce clustering to the destination table would be to create it ready before inserting data into it. Do I get this one correctly?

Is it somehow also possible to define column clustering order as part of BigQuery Loaders mutator (sub-command create)?

Thanks @EddieM !

jrluis · January 28, 2022, 9:35am

Hi ral,

Yes, is a good option to manualy create the table before inserting the data.

Regarding the change in the mutator sub command, I believe is not currently on the roadmap, but we are happy to review any PR that would add that feature.

Regards

ral · January 31, 2022, 8:36am

Just for those stumbling up on this question at later point.

In scenario where you have already set up schema with partitioning, but would like to add clustering, the following can be considered.

It is possible to use DDL to create both BigQuery Partitions and Clusters.
In this Stackoverflow question an approach to derive created tables DDL from metadata can be seen.

SELECT
 table_name, ddl
FROM
 `bigquery-public-data`.census_bureau_usa.INFORMATION_SCHEMA.TABLES;

By this you can find the DDL for generating same table schema.
In order to add clustering, follow the Google Clouds documentation.

-- Set up a table with clustering.
CREATE TABLE myDataset.data (column1 INT64, column2 INT64)
PARTITION BY _PARTITIONDATE
CLUSTER BY column1, column2;

sdbeuf · March 1, 2022, 9:01am

Regarding the clustering, you can enable the clustering after you’ve created the table. Here is the command you can use in your cli.

bq update --clustering_fields=${CLUSTERING_FIELD} ${DATASET}.${TABLE}

mike · March 1, 2022, 8:58pm

This definitely works well - but it should be noted that this will only cluster data after this action has been performed - any data that is already in that table will remain unclustered.

Topic		Replies	Views
BigQuery Loader - Time partitioned table GCP pipeline	2	1415	April 23, 2020
Failed inserts for time Partitioned table Storage targets	4	1080	September 9, 2021
BigQuery Loader - Mutator GCP pipeline	6	1602	May 7, 2020
Snowplow BigQuery Loader 1.0.0 released New releases	7	1741	November 15, 2022
Multiple BigQuery tables GCP pipeline	3	865	July 18, 2022

Partitioning/Clustering with BigQuery Loader

Related topics