We’re pleased to announce we’ve released RDB Loader version 4.2.0
RDB Loader is Snowplow’s unified framework for batch loading Snowplow events into your data warehouse. Currently it supports Redshift, Snowflake, and Databricks, and we have ambitious plans to increase the scope of this project even further over the coming months.
New authorization options for Snowflake and Databricks
Up until version 4.1.0, RDB Loader required that the warehouse had been pre-configured to have read-access to the data files in S3.
For Snowflake, this meant setting up an external stage with a storage integration.
For Databricks, it meant setting up a cluster to assume an AWS instance profile.
Starting with version 4.2.0, RDB Loader is able to generate temporary credentials using STS and pass these credentials to Snowflake/Databricks. This removes the need to pre-configure the warehouse with access permission.
To start using the new authorization method, you must add a loadAuthMethod
to the storage
block in your config file:
"storage": {
// other required fields go here
"loadAuthMethod": {
"type": "TempCreds"
"roleArn": "arn:aws:iam::123456789:role/example_role_name"
}
}
…where roleArn
is a role with permission to read files from the S3 bucket. The loader must have permission to assume this role.
Our Github repo has some examples of this configuration for Snowflake and for Databricks.
Note, for Snowflake loading, depending on your event volume and warehouse configuration, there may still be an advantage to setting up the storage integration, because the underlying COPY INTO
statement is more efficient.
For Databricks loading, though, there should be no impact of changing to the new authorization method.
Specifying file format in the load statement in Snowflake Loader
Previously, the file format was specified by Snowflake stage in Snowflake Loader. In this release, we’ve made it possible to set the file format in the load statement according to the file format in the SQS message.
Thanks to this feature, it is possible to load Parquet and JSON formatted data with the same Snowflake Loader instance without any change on the Snowflake stage.
No change is needed in the config to enable this feature.
Adjusting the path appended to Snowflake stage in Snowflake Loader
Although we’ve added a new authorization option to Snowflake Loader in this release, it is still possible to set up authorization via an external stage with a storage integration.
Previously, Snowflake stage path needs to be exactly the path where transformed run folders reside. If the path of an upper folder is given as a stage path, loading wouldn’t work.
We’ve fixed this issue in this release. Even if the stage path is set to the path of transformed folder’s upper directory, loading would still work correctly.
To use this feature, you need to update transformedStage
and folderMonitoringStage
blocks:
"transformedStage": {
# The name of the stage
"name": "snowplow_stage"
# The S3 path used as stage location
"location": "s3://bucket/transformed/"
}
"folderMonitoringStage": {
# The name of the stage
"name": "snowplow_folders_stage"
# The S3 path used as stage location
"location": "s3://bucket/monitoring/"
}
Retry on target initialization
Initialization block is surrounded by retry block so that if an exception is thrown from initialization block instead of crashing the application, it will be retried according to the specified backoff strategy.
The possible values for the backoff strategy can be found in the configuration reference.
To enable this feature, initRetries
must be added to config file:
"initRetries": {
"backoff": "30 seconds"
"strategy": "EXPONENTIAL"
"attempts": 3,
"cumulativeBound": "1 hour"
},
Bug fix for streaming transformer on multiple instances
In the previous released of RDB Loader we announced that the streaming transformer can now scale to multiple instances, which was a really important requirement for high volume pipelines.
We got one little thing wrong though, and it lead to some app crashes with error messages about lost Kinesis leases. This bug is now fixed in version 4.2.0, and we hope this unlocks your pipeline from scaling to higher event volumes with the streaming transformer.
Upgrading to 4.2.0
If you are already using a recent version of RDB Loader (3.0.0 or higher) then upgrading to 4.2.0 is as simple as pulling the newest docker images. There are no changes needed to your configuration files.
docker pull snowplow/transformer-kinesis:4.2.0
docker pull snowplow/rdb-loader-redshift:4.2.0
docker pull snowplow/rdb-loader-snowflake:4.2.0
docker pull snowplow/rdb-loader-databricks:4.2.0
The Snowplow docs site has a full guide to running the RDB Loader.