RDB Loader 5.4.0 released

We’re pleased to announce we’ve released RDB Loader version 5.4.0

This release brings a few features improving stability and observability of RDB Loader.

Kinesis/PubSub sink for bad data

Transformation of enriched events to the desired output format can fail (e.g. when Iglu servers are not available and therefore schemas referenced by events can not be resolved).
When transformation fails, bad data is created by the transformer. Such bad data conforms to one of the generic JSON schemas designed for bad rows generated by all Snowplow loaders.

Before version 5.4.0, transformer would put bad unpartitioned data directly to S3 (for AWS) or GCS (for GCP).
Starting from version 5.4.0, similar to other Snowplow pipeline components like Enrich, it is possible to configure Kinesis or PubSub output for bad data.
This approach makes loading data to blob storage more flexible, allowing you to utilize all great features (like partitioning by bad row type) of s3-loader or gcs-loader.
It also increases visibility of errors produced by the transformer, which makes troubleshooting much easier.

To preserve compatibility, blob storage output is used by default for transformer’s bad data. To enable Kinesis/PubSub output you can take a look at the config reference in the docs or samples in the repository.

Bug fix for Databricks schema evolution

This bug could affect you if you transform events to parquet format (e.g. for loading to Databricks) and if you have a schema where a field was evolved from required to optional.

A bug was introduced in version 5.3.0, in which events using this schema would go to the failed events folder (success=bad). Starting from version 5.4.0, events using this type of schema will be correctly transformed and loaded to the warehouse using a nullable field.

Configurable temporary credentials session duration

This may affect you if you explicitly set TempCreds value for the storage.loadAuthMethod.type option in the loader’s (Redshift, Databricks, Snowflake) configuration file.

The RDB Loader 5.3.2 release brought some improvements to the management of a temporary credentials, which are used by all Loaders to access transformed input data stored in a configured S3 bucket.

Before version 5.4.0, requested session duration (the period of time when fetched credentials are valid) would be equal to the value of a configured timeout for loading data and folder monitoring.

Version 5.4.0 adds one more improvement - new storage.loadAuthMethod.credentialsTtl option (with default value 1 hour) which makes session duration fully configurable and independent of timeout settings.
The minimum value for the new setting is 15 minutes and the maximum is 1 hour (restrictions imposed by the AWS role chaining.

Metrics improvements

Before 5.4.0 release, RDB Loader would report metrics (e.g. in StatsD format)
containing details about number of successfully processed events. There were no details about the number of bad rows produced by transformation.

Starting from 5.4.0 version, RDB Loader receives a number of bad rows from the transformer and reports it alongside the number of successfully processed events.

Other improvements

  • Improved Spark parquet transformer by alternative way of writing files to blob storage. No more caching on Spark RDD, which could lead to unexpected performance issues.
  • Added alternative ‘ready check’ (SQL query verifying if the warehouse is ready to process actual queries loading a data) in the Snowflake Loader, which doesn’t need operate permission on the warehouse.
  • Experimental config option to adjust the output size of parquet files in the streaming transformer.

Full changelog available here.

Upgrading to 5.4.0

If you are already using a recent version of RDB Loader (3.0.0 or higher) then upgrading to 5.4.0 is as simple as
pulling the newest docker images. There are no changes needed to your configuration files.

docker pull snowplow/transformer-pubsub:5.4.0
docker pull snowplow/transformer-kinesis:5.4.0
docker pull snowplow/rdb-loader-redshift:5.4.0
docker pull snowplow/rdb-loader-snowflake:5.4.0
docker pull snowplow/rdb-loader-databricks:5.4.0

The Snowplow docs site
has a full guide
for running the RDB Loader.

1 Like