We are extremely excited about the addition of a new destination to the RBD Loader family in version 4.0.0! Our new Databricks Loader allows loading transformed data in wide row Parquet format into Delta Lake & Databricks.
Databricks pioneered the Lakehouse category, a new data architecture that combines the scale, flexibility and cost-efficiency of data lakes with the data management and ACID transactions of data warehouses.
Databricks is used by a wide variety of organizations for both advanced analytics and Machine Learning use cases. With this new integration, Business Intelligence and Data Science teams alike can leverage Snowplow to create rich, AI-ready behavioral data to power their BI use cases and build customer-centric ML models in Databricks.
Additionally, this release brings some improvements to Stream Transformer.
New Databricks Loader
In this release, we’ve created a loader for new destination which is Databricks. Databricks Loader will allow loading transformed data in wide row Parquet format to Databricks.
How to start loading into Databricks
Configure your transformer to output the new wide row Parquet format by specifying this in the config.hocon
file:
"formats": {
"transformationType": "widerow",
"fileFormat": "parquet"
}
Then, for the loader part you’ll need to:
- setup the necessary Databricks resources
- prepare configuration files for the loader
- deploy the loader app.
Docker image can be run like this:
$ docker run snowplow/rdb-loader-databricks:4.0.0 \
--iglu-config $RESOLVER_BASE64 \
--config $CONFIG_BASE64
New wide row Parquet format
In the 3.0.0 release, we introduced a new wide row format for transformed data. In a nutshell, with wide row format, all unstructured events and contexts are written on the same table. More information can be found in the announcement post of RDB Loader 3.0.0 release.
Formerly, the only option for file format with wide row was JSON. In this release, we are adding Parquet as a new output file format for wide row transformed data.
Parquet format will allow us to load Snowplow data to various destinations. The first of these destinations is Databricks.
New load_tstamp column
There are different types of timestamp fields in the atomic event such as collector_tstamp, dvce_created_tstamp, dvce_sent_tstamp, etl_tstamp etc. However, until now, we had no field recording when the event was loaded into the destination. In this release, we are adding this feature to all loaders.
If you already have a Redshift Loader or Snowflake Loader instance running, you don’t have to do anything other than bumping the version of the application. The load_tstamp column will be created automatically and the loader will start populating it.
There is one caveat with load_tstamp in Redshift. Existing events in the table will be auto-assigned a timestamp the first time the column is created. However, this is not the case for Snowflake and Databricks. The load_tstamp of existing events will be null in Snowflake and Databricks.
Improvements in Stream Transformer
Stream Transformer is the streaming counterpart of the Batch Transformer. It doesn’t use Apache Spark, like Batch Transformer; therefore, it can be run on some basic EC2 machines. It can’t yet scale horizontally, but we have plans for addressing this issue in the upcoming release.
However, we have included some bug fixes and improvements on it in this release. They are like following:
- Write shredding_complete.json to S3 867
- Report metrics 862
- Fix updating total and bad number of events counter in global state 823
- Fix passing checkpoint action during creation of windowed records 762
Upgrading to 4.0.0
No config change is needed to upgrade from 3.x.x to 4.0.0 for existing users. Only pulling the latest docker image is sufficient.
Further resources
For more information about how to setup and run the RDB Loader applications, refer to the documentation.
For the full changes in 4.0.0, refer to the release notes: