Snowplow RDB Loader 3.0.0 released

dilyan · April 5, 2022, 2:51pm

We’re very excited to release version 3.0.0 of RDB Loader, our family of applications for loading Snowplow data into a data warehouse.

This release adds support for a new destination. Alongside Redshift, RDB Loader users will now also be able to load data into Snowflake.

Previously, Snowplow users who wanted to load into Snowflake could use our dedicated Snowplow Snowflake Loader or Snowflake’s Snowpipe.

Main improvements
Compared to the previously available ways for loading into Snowflake:

You can now load data without using EMR.
It’s easier than ever to get started using the Snowplow Open Source Quick Start Terraform modules.
Monitoring and observability: raise alarms if batches fail to load or if your data warehouse is unhealthy.
Reliability: improved retry and timeout logic.

In future, we’ll be adding new destinations to this framework.

New wide row format and `shredder` name change

Loading into a destination with RDB Loader is a two-step process. First, the enriched Snowplow events (in tsv format) need to be transformed into a format that is easier to load. Then, the transformed data gets loaded into the warehouse.

Historically we’ve been calling the transformation process ‘shredding’, and the application responsible for it – shredder. This comes from the fact that contexts (entities) and unstructured (self-describing) events are being split off from the core ‘atomic’ event and loaded into different Redshift tables. This continues to be the case in version 3.0.0. However, for loading into Snowflake, we’ve introduced a new format for the transformed data: wide row. Unlike shredding, wide row format preserves data as a single line per event, with one column for each different type of contexts and self-describing events.

This has ultimately prompted us to rename the ‘shredder’ assets and they are now called snowplow-transformer-batch and snowplow-transformer-kinesis, depending on which flavour you want to use.

How to start loading into Snowflake with RDB Loader

You will want to be sure that you configure your transformer to output the new wide row format by specifying this in the config.hocon file:

"formats": {
  "transformationType": "widerow"
}

Then, for the loader part you’ll need to:

setup the necessary Snowflake resources
prepare configuration files for the loader
deploy the loader app.

We are releasing several Terraform modules that you can use to create the required Snowflake resources and deploy the loader on EC2:

You also have the option to create the resources manually and deploy the app from its Docker image or jar file, for example like this:

$ docker run snowplow/rdb-loader-snowflake:3.0.0 \
  --iglu-config $RESOLVER_BASE64 \
  --config $CONFIG_BASE64

Migrating from Snowplow Snowflake Loader

If you have been using the old Snowplow Snowflake Loader (snowplow-snowflake-loader) to load into Snowflake, you can migrate to use rdb-loader-snowflake instead. In future, we will only support rdb-loader-snowflake.

You won’t need to create the Snowflake resources from scratch. You can continue with your existing setup. Just point rdb-loader-snowflake at your existing events table and it will be able to load into it without issues. You can delete the DynamoDB table used by snowplow-snowflake-loader – it is not required by rdb-loader-snowflake.

One difference to keep in mind is that you do not need to configure credentials for rdb-loader-snowflake via its config.hocon file. Instead, you create an external stage, using the authentication method you prefer, and then provide the stage’s name in the configuration file.

There’s one more thing to note if you want to use the folder monitoring option in rdb-loader-snowflake. With this option enabled, the loader will check the transformed data archive and compare it against its own load manifest. This allows you to discover problematic folders in the archive, which have not been loaded, or for which there was an issue during loading. When this check is enabled for the full archive, all folders from before you switched to the new loader will be considered problematic. To avoid this, use the since option in the config.hocon file.

Notes for Redshift users

Redshift loading is affected by these changes in the following ways:

The shredder artefacts now have new names.
The config.hocon file for the transformer should specify the transformation type:

"formats": {
  "transformationType": "shred"
}

The config.hocon file for the loader should specify the storage target type:

"storage": {
  "type": "redshift"
}

The setting is optional, with "redshift" being the default, but this is a reminder to not override it by mistake.

If you are using the folder monitoring options, the setting monitoring.folders.shredderOutput will have to be renamed to monitoring.folders.transformerOutput.

Further resources

For more information about how to setup and run the RDB Loader applications, refer to the documentation.

For the full changes in 3.0.0, refer to the release notes:

medicinal-matt · April 5, 2022, 4:02pm

Amazing news!

So it sounds like if you are using the Snowflake Loader now, it will only be to replace it with the RDB Loader as if nothing has happened? Is this a correct conclusion? This is still the best summary of the current format?

It’s easier than ever to get started using the Snowplow Open Source Quick Start Terraform modules.

This change isn’t released yet or am I missing something? When is this planned?

Also did you have time to investigate our great suggestion of adding an option to disable HTTP? Otherwise we have to keep ensuring our modified version of the quick start setup is in sync.

dilyan · April 6, 2022, 1:01pm

Hi @medicinal-matt ,

Yes, the format of data in Snowflake has not changed. You should be able to migrate smoothly; and indeed we encourage the switch, as we’ll be focusing on the RDB Loader framework in future. Please just bear in mind the caveats described in the post about replacing Snowflake Loader with this new one.

On the Terraform quick start examples, we have a few last kinks to iron out but they should be released shortly.

On your suggested change, I’ll comment on the Github issue you created, to try and keep as much of the info in one place as possible.

medicinal-matt · May 2, 2022, 3:47pm

Any update on these Terraform modules? It don’t want to complain, but the PR has not been updated in 14 days?

PaulBoocock · May 3, 2022, 8:10am

We’re just preparing the documentation, we anticipate it will all go live this week.

Topic		Replies	Views
RDB Loader 4.2.0 released	0	1023	July 20, 2022
RDB Loader 5.3.1 released (with important bug fix on Snowflake Loader) New releases	0	713	January 25, 2023
Snowplow Snowflake Loader 0.3.0 released New releases	0	1296	December 28, 2017
RDB Loader 4.0.0 (including Databricks Support) New releases	1	1593	June 28, 2022
Snowplow RDB Loader 0.13.0 released New releases	0	936	September 6, 2017