RDB Loader 4.0.0 (including Databricks Support)

Martyn_Harvey · June 28, 2022, 7:15am

We are extremely excited about the addition of a new destination to the RBD Loader family in version 4.0.0! Our new Databricks Loader allows loading transformed data in wide row Parquet format into Delta Lake & Databricks.

Databricks pioneered the Lakehouse category, a new data architecture that combines the scale, flexibility and cost-efficiency of data lakes with the data management and ACID transactions of data warehouses.

Databricks is used by a wide variety of organizations for both advanced analytics and Machine Learning use cases. With this new integration, Business Intelligence and Data Science teams alike can leverage Snowplow to create rich, AI-ready behavioral data to power their BI use cases and build customer-centric ML models in Databricks.

Additionally, this release brings some improvements to Stream Transformer.

New Databricks Loader

In this release, we’ve created a loader for new destination which is Databricks. Databricks Loader will allow loading transformed data in wide row Parquet format to Databricks.

How to start loading into Databricks

Configure your transformer to output the new wide row Parquet format by specifying this in the config.hocon file:

"formats": {
  "transformationType": "widerow",
  "fileFormat": "parquet"
}

Then, for the loader part you’ll need to:

setup the necessary Databricks resources
prepare configuration files for the loader
deploy the loader app.

Docker image can be run like this:

$ docker run snowplow/rdb-loader-databricks:4.0.0 \
  --iglu-config $RESOLVER_BASE64 \
  --config $CONFIG_BASE64

New wide row Parquet format

In the 3.0.0 release, we introduced a new wide row format for transformed data. In a nutshell, with wide row format, all unstructured events and contexts are written on the same table. More information can be found in the announcement post of RDB Loader 3.0.0 release.

Formerly, the only option for file format with wide row was JSON. In this release, we are adding Parquet as a new output file format for wide row transformed data.

Parquet format will allow us to load Snowplow data to various destinations. The first of these destinations is Databricks.

New load_tstamp column

There are different types of timestamp fields in the atomic event such as collector_tstamp, dvce_created_tstamp, dvce_sent_tstamp, etl_tstamp etc. However, until now, we had no field recording when the event was loaded into the destination. In this release, we are adding this feature to all loaders.

If you already have a Redshift Loader or Snowflake Loader instance running, you don’t have to do anything other than bumping the version of the application. The load_tstamp column will be created automatically and the loader will start populating it.

There is one caveat with load_tstamp in Redshift. Existing events in the table will be auto-assigned a timestamp the first time the column is created. However, this is not the case for Snowflake and Databricks. The load_tstamp of existing events will be null in Snowflake and Databricks.

Improvements in Stream Transformer

Stream Transformer is the streaming counterpart of the Batch Transformer. It doesn’t use Apache Spark, like Batch Transformer; therefore, it can be run on some basic EC2 machines. It can’t yet scale horizontally, but we have plans for addressing this issue in the upcoming release.

However, we have included some bug fixes and improvements on it in this release. They are like following:

Write shredding_complete.json to S3 867
Report metrics 862
Fix updating total and bad number of events counter in global state 823
Fix passing checkpoint action during creation of windowed records 762

Upgrading to 4.0.0

No config change is needed to upgrade from 3.x.x to 4.0.0 for existing users. Only pulling the latest docker image is sufficient.

Further resources

For more information about how to setup and run the RDB Loader applications, refer to the documentation.

For the full changes in 4.0.0, refer to the release notes:

PaulBoocock · June 28, 2022, 1:06pm

Cross posting our companion release of the updated dbt web model which can now work with Snowplow data loaded into Databricks using RDB Loader v4.

Topic		Replies	Views
RDB Loader 4.1.0 released New releases	0	1069	July 4, 2022
Snowplow RDB Loader 3.0.0 released New releases	4	2243	May 3, 2022
New Snowplow Lake Loader New releases	0	643	October 2, 2023
Using shredded data for loading into databricks as parquet format	6	778	October 7, 2022
Lake Loader 0.4.1 released New releases	8	380	November 27, 2024