We’re pleased to announce we’ve released RDB Loader version 5.3.0
This release makes the Databricks loader resilient to invalid schema evolution.
Databricks invalid schema evolution recovery
What is schema evolution?
One of Snowplow’s key features is the ability to define custom schemas and validate events against them. Over time, users often evolve the schemas, e.g. by adding new fields or changing existing fields. To accommodate these changes, RDB loader automatically adjusts the database tables in the warehouse accordingly.
There are two main types of schema changes:
Breaking: The schema version has to be changed in a major way (1-2-3
→ 2-0-0
). In Databricks, each major schema version has its own column (..._1
, ..._2
, etc, for example: contexts_com.snowplowanalytics_ad_click_1
).
Non-breaking: The schema version can be changed in a minor way (1-2-3
→ 1-3-0
or 1-2-3
→ 1-2-4
). Data is stored in the same database column.
How it used to work
In the past, the transformer would format the data according to the latest version of the schema it saw (for a given major version, e.g. 1-*-*
). For example, if a batch contained events with schema versions 1-0-0
, 1-0-1
and 1-0-2
, the transformer would derive the schema of the Parquet file based on version 1-0-2
. Then the loader would instruct Databricks to adjust the database column and load the file.
This logic relied on two assumptions:
-
Old events compatible with new schemas. Events with older schema versions, e.g.
1-0-0
and1-0-1
, had to be valid against the newer ones, e.g.1-0-2
. Those that were not valid would result in failed events. -
Old columns compatible with new schemas. The corresponding Databricks column type had to expand correctly from one version to another using
mergeSchema = true
. In reality, this only works for adding and removing fields. Other changes, such as altering the type of a field frominteger
tostring
, would fail. Loading would break with SQL errors and the whole batch would be stuck and hard to recover.
These assumptions were not always clear to the users, making the transformer and loader error-prone.
What happens now?
Transformer and loader are now more robust, and the data is easy to recover if the schema was not evolved correctly.
First, we support schema evolution that’s not strictly backwards compatible (although we still recommend against it since it can confuse downstream consumers of the data). This is done by merging multiple schemas so that both old and new events can coexist. For example, suppose we have these two schemas:
{
// 1-0-0
"properties": {
"a": {"type": "integer"}
}
}
{
// 1-0-1
"properties": {
"b": {"type": "integer"}
}
}
These would be merged into the following:
{
// merged
"properties": {
"a": {"type": "integer"},
"b": {"type": "integer"}
}
}
Second, the loader does not fail when it can’t modify the database column to store both old and new events. (As a reminder, an example would be changing the type of a field from integer
to string
.) Instead, it creates a temporary column for the new data as an exception. The users can then run SQL statements to resolve this situation as they see fit. For instance, consider these two schemas:
{
// 1-0-0
"properties": {
"a": {"type": "integer"}
}
}
{
// 1-0-1
"properties": {
"a": {"type": "string"}
}
}
Because 1-0-1
events cannot be loaded into the same column with 1-0-0
, the data would be put in a separate column, e.g. contexts_com_snowplowanalytics_ad_click_1_0_1_recovered_9999999
, where:
-
1_0_1
is the version of the offending schema; -
9999999
is a hash code unique to the schema (i.e. it will change if the schema is overwritten with a different one).
If you create a new schema 1-0-2
that reverts the offending changes and is again compatible with 1-0-0
, the data for events with that schema will be written to the original column as expected.
Notes
- If events with incorrectly evolved schemas do not arrive, then the recovery column would not be created.
- It is still possible to break loading by overwriting version
1-0-0
of the schema. Please, avoid doing that.
Other improvements
-
In Databricks Loader, we’ve started to use the
CREATE OR REPLACE TABLE
statement instead ofDROP
andCREATE
while creating the temporary folder monitoring table sinceDROP
andCREATE
causes problem sometimes. Thanks very much to @dkbrkjni for his contributions! Related Github issue -
In Redshift Loader, folder monitoring wasn’t working when the stage bucket’s region and Redshift cluster’s region were different. This problem is fixed in this version.
Upgrading to 5.3.0
If you are already using a recent version of RDB Loader (3.0.0 or higher) then upgrading to 5.3.0 is as simple as
pulling the newest docker images. There are no changes needed to your configuration files.
docker pull snowplow/transformer-pubsub:5.3.0
docker pull snowplow/transformer-kinesis:5.3.0
docker pull snowplow/rdb-loader-redshift:5.3.0
docker pull snowplow/rdb-loader-snowflake:5.3.0
docker pull snowplow/rdb-loader-databricks:5.3.0
The Snowplow docs site has a full guide to running the RDB Loader.