We’re very excited to release version 3.0.0 of RDB Loader, our family of applications for loading Snowplow data into a data warehouse.
This release adds support for a new destination. Alongside Redshift, RDB Loader users will now also be able to load data into Snowflake.
Previously, Snowplow users who wanted to load into Snowflake could use our dedicated Snowplow Snowflake Loader or Snowflake’s Snowpipe.
Main improvements
Compared to the previously available ways for loading into Snowflake:
- You can now load data without using EMR.
- It’s easier than ever to get started using the Snowplow Open Source Quick Start Terraform modules.
- Monitoring and observability: raise alarms if batches fail to load or if your data warehouse is unhealthy.
- Reliability: improved retry and timeout logic.
In future, we’ll be adding new destinations to this framework.
New wide row format and shredder
name change
Loading into a destination with RDB Loader is a two-step process. First, the enriched Snowplow events (in tsv format) need to be transformed into a format that is easier to load. Then, the transformed data gets loaded into the warehouse.
Historically we’ve been calling the transformation process ‘shredding’, and the application responsible for it – shredder
. This comes from the fact that contexts (entities) and unstructured (self-describing) events are being split off from the core ‘atomic’ event and loaded into different Redshift tables. This continues to be the case in version 3.0.0. However, for loading into Snowflake, we’ve introduced a new format for the transformed data: wide row. Unlike shredding, wide row format preserves data as a single line per event, with one column for each different type of contexts and self-describing events.
This has ultimately prompted us to rename the ‘shredder’ assets and they are now called snowplow-transformer-batch
and snowplow-transformer-kinesis
, depending on which flavour you want to use.
How to start loading into Snowflake with RDB Loader
You will want to be sure that you configure your transformer to output the new wide row format by specifying this in the config.hocon
file:
"formats": {
"transformationType": "widerow"
}
Then, for the loader part you’ll need to:
- setup the necessary Snowflake resources
- prepare configuration files for the loader
- deploy the loader app.
We are releasing several Terraform modules that you can use to create the required Snowflake resources and deploy the loader on EC2:
-
terraform-snowflake-target
module -
terraform-aws-snowflake-loader-setup
module -
terraform-aws-snowflake-loader-ec2
module.
You also have the option to create the resources manually and deploy the app from its Docker image or jar file, for example like this:
$ docker run snowplow/rdb-loader-snowflake:3.0.0 \
--iglu-config $RESOLVER_BASE64 \
--config $CONFIG_BASE64
Migrating from Snowplow Snowflake Loader
If you have been using the old Snowplow Snowflake Loader (snowplow-snowflake-loader
) to load into Snowflake, you can migrate to use rdb-loader-snowflake
instead. In future, we will only support rdb-loader-snowflake
.
You won’t need to create the Snowflake resources from scratch. You can continue with your existing setup. Just point rdb-loader-snowflake
at your existing events
table and it will be able to load into it without issues. You can delete the DynamoDB table used by snowplow-snowflake-loader
– it is not required by rdb-loader-snowflake
.
One difference to keep in mind is that you do not need to configure credentials for rdb-loader-snowflake
via its config.hocon
file. Instead, you create an external stage, using the authentication method you prefer, and then provide the stage’s name in the configuration file.
There’s one more thing to note if you want to use the folder monitoring option in rdb-loader-snowflake
. With this option enabled, the loader will check the transformed data archive and compare it against its own load manifest. This allows you to discover problematic folders in the archive, which have not been loaded, or for which there was an issue during loading. When this check is enabled for the full archive, all folders from before you switched to the new loader will be considered problematic. To avoid this, use the since
option in the config.hocon
file.
Notes for Redshift users
Redshift loading is affected by these changes in the following ways:
- The
shredder
artefacts now have new names. - The
config.hocon
file for thetransformer
should specify the transformation type:
"formats": {
"transformationType": "shred"
}
- The
config.hocon
file for theloader
should specify the storage target type:
"storage": {
"type": "redshift"
}
The setting is optional, with "redshift"
being the default, but this is a reminder to not override it by mistake.
- If you are using the folder monitoring options, the setting
monitoring.folders.shredderOutput
will have to be renamed tomonitoring.folders.transformerOutput
.
Further resources
For more information about how to setup and run the RDB Loader applications, refer to the documentation.
For the full changes in 3.0.0, refer to the release notes: