RDB Loader 5.3.1 released (with important bug fix on Snowflake Loader)

We’re pleased to announce we’ve released RDB Loader version 5.3.1.

In 5.3.0, we’ve introduced a bug on Snowflake Loader that makes it not copy contexts and unstruct events to events table. We’ve fixed this problem in version 5.3.1. Thanks mgkeen for reporting this issue. Please look at the below section to find out how to recover the missing data.

Recovery Instructions

With RDB loader framework, the transformer prepares batches of events for the loader to ingest into Snowflake. Each batch of events resides in what we call a “run folder” — a folder on S3 or GCS.

Note: The process explained below will create new rows in the atomic events table, for which the previously missing contexts and unstruct events will be present in their respective columns. However, this process will not delete the wrongly loaded events for which those columns are empty. As a result, some data will be duplicated. You may wish to take additional steps to address that.

  1. Upgrade both transformer and Snowflake Loader to 5.3.1.

  2. You need to find the affected run folders. If you are using the same version of loader and transformer, you can find the affected run folders with the following query:

select base from atomic.manifest where processor_version = '5.3.0';

If you didn’t update the transformer to 5.3.0 together with your loader, you need to find the affected run folders based on the ingestion timestamp, similar to this:

select base from atomic.manifest where ingestion_tstamp >= '2023-01-06 15:20:54.069' and ingestion_tstamp <= '2023-01-24 17:45:58.069';

Save the output of this query to some file. We will use it later on.

  1. Delete the rows of affected run folders from the manifest table:

delete from atomic.manifest where processor_version = '5.3.0';

  1. Resend “shredding complete” messages of those run folders to the SQS queue of the loader. You can use the helper script we created for this purpose. You need to pass two arguments to this script:
  • the path of the file where you saved your run folders in the step 2, and

  • the SQS queue URL:

resend-shredding-complete.sh /path/to/run/folders https://sqs.eu-west-1.amazonaws.com/123456789/loader-sqs-queue.fifo

You can split run folders into smaller chunks and pass these small chunks to the script to not overload the loader.

Also, in this version, we’ve started to use VARCHAR instead of CHAR with standard fields when creating events table on Databricks Loader (Github issue).

Upgrading to 5.3.1

If you are already using a recent version of RDB Loader (3.0.0 or higher) then upgrading to 5.3.1 is as simple as pulling the newest docker images. There are no changes needed to your configuration files.

docker pull snowplow/transformer-pubsub:5.3.1
docker pull snowplow/transformer-kinesis:5.3.1
docker pull snowplow/rdb-loader-redshift:5.3.1
docker pull snowplow/rdb-loader-snowflake:5.3.1
docker pull snowplow/rdb-loader-databricks:5.3.1

The Snowplow docs site has a full guide to running the RDB Loader.