We’re very excited to have released version 1.0.0
of Snowplow BigQuery Loader, our family of apps that load Snowplow data into BigQuery.
The highlight of this release is the StreamLoader app, which has shed its experimental status and can now be deployed in anger. We have significantly improved its performance from the earlier version and it can now more than hold its own compared with the Dataflow-based Loader.
If you’re new to Snowplow and want to understand what the different apps do, the documentation pages are a good place to start.
New configuration format
This release brings a breaking change to the configuration format for all applications. Rather than passing that in as a self-describing JSON, the apps now expect a HOCON file.
See the setup guide and upgrade guide for more information.
New load_tstamp
column
We’ve added a much-requested change by introducing a load_tstamp
field to all events loaded into BigQuery. This timestamp represents the time when the data arrived in the warehouse and can be used for incremental processing of new data in data modeling.
This change is backwards compatible. If you downgrade back to 0.6.4
, the load_tstamp
column will remain in your table but any data loaded will have a null
value for it.
The new column is created by Mutator automatically on startup. It can occasionally take some time for it to become visible to all workers trying to write to the table. For this reason, we recommend that you upgrade Mutator first, before you upgrade the loader app (regardless of whether you’re using Loader or StreamLoader).
Mutator can now create partitioned tables
You can now use Mutator’s create
command to set up partitioned BigQuery tables by specifying a partition column and optionally enforcing a partition filter on all queries.
See the Mutator documentation for details.
Handling of unallowed characters in BigQuery column names
We’ve integrated a change from our schema-ddl library, which improved handling of invalid field names, and in particular fields that start with a numeric character.
This fixes a known problem when trying to load a openweathermap event.
Now, if your a schema contains a field called 1h
then it will be loaded with the name _1h
whereas previously it would not be loaded at all.
Metrics
StreamLoader and Repeater emit metrics using the StatsD protocol. The available metrics are:
- number of events loaded into BigQuery by StreamLoader (
good
) - number of failed events (
bad
) - number of failed inserts (
failed_inserts
) - number of events that Repeater could not ultimately load into BigQuery (
uninsertable
) - max time elapsed between
collector_tstamp
andnow()
, measured when StreamLoader receives a response to its insert request from BigQuery (latency
).
To see what these look like, you can start StreamLoader or Repeater locally, specifying the setting in config along the lines of:
"monitoring": {
"statsd": {
"hostname": "localhost"
"port": 1024
"tags": {}
"period": "5 sec"
"prefix": "snowplow.monitoring"
}
}
In a separate tab, run Netcat to listen to UDP traffic:
$ nc -z -v -u localhost 1024 // connect to port
$ nc -l -u 1024 // listen on port
Other improvements and changes
Alongside small bugfixes and dependency bumps, we’ve also now started publishing arm64
and amd64
docker images.
Forwarder, which was deprecated in 0.5.0
, has now been completely removed.
For the full list of changes and jar
files, see the release notes:
Thanks
Many thanks to Alex Fainshtein for contributing to this release.