RDB Loader 5.2.0 released

We’re pleased to announce we’ve released RDB Loader version 5.2.0

This release brings Parquet support to Transformer Pubsub. Also, it brings various new features and improvements to RDB Loader applications.

Scheduler for running ‘optimize’ command on Databricks Loader

Loader applications are using a manifest table to keep track of information about the folders loaded so far. However, we’ve found that frequent manifest table updates result in a growing number of files backing this table in Databricks. This severely degrades the performance of loading. A similar problem impacts the event table too.

The remedy for the issue is to run an OPTIMIZE command on the table, compacting all updates into a small number of files.

In order to make this process smoother for users, we’ve added a scheduler that runs the OPTIMIZE command regularly, according to the given CRON statement.

OPTIMIZE scheduler can be configured like so:

  "schedules": {
    # CRON statement means that run the optimize command on event table every day at 00:00 (JVM timezone)
    "optimizeEvents": "0 0 0 ? * *",
    # CRON statement means that run the optimize command on manifest table every day at 05:00 (JVM timezone)
    "optimizeManifest": "0 0 5 ? * *"

Databricks Loader has the above values as default. If you want to disable these schedulers completely, you need to set them to ‘null’:

  "schedules": {
    "optimizeEvents": null,
    "optimizeManifest": null

Note: This feature requires collector_tstamp_date generated column in the event table. We recommend disabling this feature if you don’t have this column in your events table. If the feature is enabled and collector_tstamp_date colum doesn’t exist, you might see some errors in the application logs however those errors shouldn’t interfere with the normal function of the application.

Parquet support in Transformer Pubsub

In this release, Transformer Pubsub comes with the ability to output in Parquet format. You need to add the following section to the Transformer Pubsub config to enable Parquet output:

"formats": {
  "fileFormat": "parquet"

You can find the configuration reference to prepare the configuration file and instructions to deploy the application in the docs.

New authorization method in Redshift Loader

On version 4.1.0, we’ve introduced new authorization method in Snowflake Loader and Databricks Loader. We are adding same method to Redshift Loader in this release.

This method allows to generate temporary credentials using STS and pass these credentials to Redshift. This removes the need to pre-configure the warehouse with access permission.

To start using the new authorization method, you must add a loadAuthMethod to the storage block in your config file:

"storage": {
  // other required fields go here

  "loadAuthMethod": {
    "type": "TempCreds"
    "roleArn": "arn:aws:iam::123456789:role/example_role_name"

…where roleArn is a role with permission to read files from the S3 bucket. The loader must have permission to assume this role. More information about authorization options can be found in the docs.

Other improvements

  • We’ve introduced additional caching for flattening operation used during shred transformation in Transformer applications. This change gives additional performance gains. Related Github issue.

  • We’ve made some changes to the way loader produces minimum_age_of_loaded_data metric to make it more accurate. Related Github issue.

  • In previous versions of Snowflake Loader, if you were using Snowflake stages, you needed to provide the stage’s path to config as well. In this version, we’ve made this path populated automatically therefore you don’t have to provide the stage’s path manually.

  • Sentry integration is added to Transformer Kinesis and Transformer Pubsub.

Upgrading to 5.2.0

If you are already using a recent version of RDB Loader (3.0.0 or higher) then upgrading to 5.2.0 is as simple as pulling the newest docker images. There are no changes needed to your configuration files.

docker pull snowplow/transformer-pubsub:5.2.0
docker pull snowplow/transformer-kinesis:5.2.0
docker pull snowplow/rdb-loader-redshift:5.2.0
docker pull snowplow/rdb-loader-snowflake:5.2.0
docker pull snowplow/rdb-loader-databricks:5.2.0

The Snowplow docs site has a full guide to running the RDB Loader.

1 Like