Databricks Loader Building Up Lag

Hey Katie,

That is excellent news. I am surprised that event limit worked, given the low event volume as per the loader message -154 events would still remain a single file. Perhaps, it was not representative for rest of the batches.

One final question, do you know if there’s a point in the future when the Databricks loader will be able to load invalid data?

Current strategy for dealing with bad data in snowplow is using event recovery. Which would modify the data and reinsert it into the pipeline.

Many sources of bad data exist in snowplow, collector, enrich and transformer itself. It also comes in many different categories, each requires different adjustment. How event that failed validation should be loaded? It is almost guaranteed that typecasting in transformer would fail.

Generalised recovery scenario required for each bad row type. It is not an easy task, but we are working on it. Recovery is one of the main focuses of our team this year.

Some progress had been made already, as of 5.3.0 transfomer+loader would automatically recover data, where schema evolution resulted in type error.

@stanch might be in better position to answer this question.

1 Like