Databricks Loader Building Up Lag

Pavel_Voropaev · January 11, 2023, 3:50pm

Hey Katie,

That is excellent news. I am surprised that event limit worked, given the low event volume as per the loader message -154 events would still remain a single file. Perhaps, it was not representative for rest of the batches.

One final question, do you know if there’s a point in the future when the Databricks loader will be able to load invalid data?

Current strategy for dealing with bad data in snowplow is using event recovery. Which would modify the data and reinsert it into the pipeline.

Many sources of bad data exist in snowplow, collector, enrich and transformer itself. It also comes in many different categories, each requires different adjustment. How event that failed validation should be loaded? It is almost guaranteed that typecasting in transformer would fail.

Generalised recovery scenario required for each bad row type. It is not an easy task, but we are working on it. Recovery is one of the main focuses of our team this year.

Some progress had been made already, as of 5.3.0 transfomer+loader would automatically recover data, where schema evolution resulted in type error.

@stanch might be in better position to answer this question.

Topic		Replies	Views
Significant lag with transformer and redshift loader Troubleshooting	8	880	April 21, 2023
Storage loader taking more time For engineers	14	1928	March 16, 2017
RDB Loader 5.1.1 released New releases	0	815	November 4, 2022
Snowplow-rdb-loader timing out For engineers	2	740	March 5, 2020
RDB Loader 4.1.0 released New releases	0	1069	July 4, 2022

Databricks Loader Building Up Lag

Related topics