We have identified a security vulnerability in BigQuery Repeater in this version, which we’ve fixed in version 0.4.2. Please do not use this version and upgrade straight to 0.4.2. The upgrade notes below are still relevant in this regard.
We have released version 0.2.0 of BigQuery Loader, our family of apps that load Snowplow data into BigQuery.
This release brings two key additions and an important bugfix.
Repeater can now be deployed instead of Forwarder
Forwarder is the tool in the Snowplow BigQuery Loader app family that has up till now been the only option for retrying failed inserts. (For more on how mutation lag can lead to failed inserts, check out the documentation.)
Forwarder is a Google Cloud Dataflow job, which makes it well suited for processing large amounts of data. However, it has several drawbacks:
- It can idle for 99.9% of the time, which can make it very expensive to run. The alternative is to manually launch it any time failed inserts appear.
- There’s no way to tell Forwarder that it should take a pause before inserting rows back. Without the pause there’s a risk that Mutator doesn’t get a chance to alter the table.
- It keeps retrying all inserts indefinitely (default behaviour for streaming Dataflow jobs).
- In order to debug a problem with Forwarder, you need to inspect Stackdriver logs.
From 0.2.0 we’re adding a new component that can be used instead of Forwarder, called Repeater.
Repeater is a JVM app, which offers several advantages over Forwarder:
- It pauses by default to allow Mutator to do its job.
- It sends rows that repeatedly fail insertion to a dead-end bucket instead of retrying them forever.
- It can be more easily debugged by inspecting the contents of the dead-end bucket, which are all valid Snowplow bad rows. (For more on bad rows, see next section.)
For more information on how to set up Repeater, consult the setup guide.
New bad row format integration
In Snowplow R118 Morgantina, our first ever beta release, we introduced a new format for “bad rows” in the Scala Stream Collector and in Enrich jobs. Version 0.2.0 now brings the new format to the BigQuery Loader family of tools as well.
(For more details on the new bad row format, check out the RFC and the R118 release post.)
Fixing bug in Schema DDL library leads to new behaviour in Loader, Mutator
This release includes a number of dependency bumps, of which the upgrade of the Schema DDL library to 0.9.0 is particularly important.
Schema DDL is a library from the Snowplow ecosystem which exposes a set of Abstract Syntax Trees and generators for producing various DDL and Schema formats. Version 0.9.0 fixes a bug that affected the creation of BigQuery table DDLs in cases where one of the fields in the schema was a nullable array, ie a property defined as having:
"type": ["array", "null"]
In older versions, Loader would have cast those fields to STRING
and Mutator would have created columns for them of type NULLABLE STRING
rather than REPEATED RECORD
, which is what we want for arrays.
Upgrading
This bug is fixed in the latest versions of the two components. However, if you already have nullable array-typed fields in your schemas, some incompatibility might have been introduced.
It is possible that an older version of Loader has cast those fields to STRING
and that Mutator has created NULLABLE STRING
columns for them. After upgrading to 0.2.0, Loader will no longer cast the value in those fields to STRING
and so they will not be able to be inserted in the existing columns for them.
There are two ways this can be handled:
-
by introducing a new schema version that gets rid of the
[array, null]
type; -
by migrating all the data in the BigQuery table to a new table, with a schema that fixes the “stringified” column.
Introducing a new schema
You can upgrade the affected schemas to a new version, without really changing anything in the schemas. The new version of Loader will not cast these values to STRING
. Because the schemas have new versions, Mutator will create new columns for them and they will have the desired type of REPEATED RECORD
.
Migrating the data
Both the type and the mode of the affected columns needs to be changed (so we go from NULLABLE STRING
to REPEATED RECORD
).
Changing the type of column is not currently supported in BigQuery. To do it manually, you can:
-
use a SQL query that casts the data to the desired type and use the output of the query to create a new table (but this won’t work for changing the mode, see below);
-
unload the data from the table to GCS and use it to create a new BigQuery table with the desired proper schema.
Changing the mode of a column is currently only supported for going from REQUIRED
to NULLABLE
. Any other changes can only be done by unloading the data to GCS and then loading it into a new table with the desired schema.
(For the full details, refer to the GCP documentation: https://cloud.google.com/bigquery/docs/manually-changing-schemas.)