RDB Loader 5.0.0 (including GCP supported Snowflake Loader)

We’re excited to announce the first GCP supported applications in the RDB Loader application family in version 5.0.0: Snowflake Loader and Transformer Pubsub!

Additionally, this release brings a few bug fixes on Databricks Loader and Transformer Kinesis.

GCP support on Snowflake Loader and Transformer Pubsub

From inception, RDB Loader applications were developed to run on AWS. Our roadmap has been to enable RDB Loader to load into GCP. In this release, we pave the way by integrating GCP services to Snowflake Loader, making it possible to run completely with GCP services. Transformer Pubsub (The GCP counterpart of transformer) has also been developed. With these additions, it is possible to load Snowplow data from a GCP pipeline to Snowflake.

At the moment, Transformer Pubsub cannot output in Parquet format. This is also part of our roadmap and will make the Databricks Loader on GCP possible as well.

How to start loading into Snowflake on GCP

Initially, you will need to deploy the Transformer Pubsub. Minimal configuration file for Transformer Pubsub looks like following:

{
  # Name of the Pubsub subscription with the enriched events
  "input": {
    "subscription": "projects/project-id/subscriptions/subscription-id"
  }
  # Path to transformed archive
  "output": {
    "path": "gs://bucket/transformed/"
  }
  # Name of the Pubsub topic used to communicate with Loader
  "queue": {
    "topic": "projects/project-id/topics/topic-id"
  }
}

You can find the configuration reference to prepare the configuration file and instructions to deploy the application in the docs.

Then, for the Snowflake Loader part you’ll need to:

  • setup the necessary Snowflake resources
  • prepare configuration files for the loader
  • deploy the Snowflake Loader app

Important bit in the Snowflake Loader config is that Pubsub should be used as message queue:

  ...
  "messageQueue": {
    "type": "pubsub"
    "subscription": "projects/project-id/subscriptions/subscription-id"
  }
  ...

Full documentation for Snowflake Loader can be found here.

Bug fixes on Databricks Loader and Transformer Kinesis

  • It is reported that there was an issue in Databricks Loader when trying to load a batch where multiple parquet files with different schemas and optional column only exist in some of the files. This issue is fixed in version 5.0.0. Thanks drphrozen for reporting the issue and submitting a PR!

  • It is reported that Transformer Kinesis throws exception when Kinesis stream shard count is increased. This issue is fixed in version 5.0.0. Thanks sdbeans for reporting the issue!

Adding telemetry to loader apps and Transformer Pubsub

In Snowplow, we are trying to improve our products every day and understanding what is popular is important part of it to focus our development effort in the right place. Therefore, we are adding telemetry to loader apps and Transformer Pubsub. What it is doing basically sending heartbeats with some minimal meta-information about the application.

You can help us by providing userProvidedId in the config file:

"telemetry" {
  "userProvidedId": "myCompany"
}

Telemetry can be deactivated by putting the following section in the configuration file:

"telemetry": {
  "disable": true
}

More information about telemetry in RDB Loader project can be found here.

Upgrading to 5.0.0

If you are already using a recent version of RDB Loader (3.0.0 or higher) then upgrading to 5.0.0 is as simple as pulling the newest docker images. There are no changes needed to your configuration files.

docker pull snowplow/transformer-kinesis:5.0.0
docker pull snowplow/rdb-loader-redshift:5.0.0
docker pull snowplow/rdb-loader-snowflake:5.0.0
docker pull snowplow/rdb-loader-databricks:5.0.0

The Snowplow docs site has a full guide to running the RDB Loader.

3 Likes