Using shredded data for loading into databricks as parquet format

Hi snowplow Team,

We have an old pipeline (EMR batch) that makes use of Shredding. Which splits events into multiple files (tables) which are then loaded into Redshift.
As we are migrating the pipeline from the EMR batch to Databricks and loading the data into a data lake we chose to store it in Wide Row format.
As per docs, shredded data can only be used in redshift. Doc

Is there any way I can use shredded data for loading into Databricks using parquet format?
Or Is there any efficient way to do that?

Do you have your enriched data stored on S3? The Databricks loader should be able to derive from that data set.

However - depending on how much data you have that may potentially be quite slow. You may be better off:

  • querying the data in Redshift (and joining the shredded data)
  • unloading from Redshift to S3 (in Parquet format)
  • loading directly into Databricks

For historical data this could then be actioned as a one off process rather than running the Databricks loader repeatedly for batches.

1 Like

Hello Mike,
Thank you for getting back on this.

Actually, I am not looking for migrating redshift data to databricks.
I’m looking for a configuration which shreds data before loading it into databricks. Currently, it is done through widerow format.

And as per the document it says. Shredded data can currently only be loaded into Redshift.

I want to know if it is possible to load shredded data in databricks?

Ah apologies - thanks for the clarification.

Not currently - and I’m not sure if the intention will be to support this in the RDB loader in the future (@istreeter?).

The main reason being that the shredded format tends to underperform when compared to the ‘wide row’ format as the overhead of joins and broadcasting across multiple nodes which is required in the shredded model can be avoided in the wide row format. In addition we save a bit of storage space too by not having to duplicate the join keys in the shredded table.

If you do still want to use the shredded model for other reasons I’d suggest opting for loading using the current wide-row format and then shredding downstream using dbt or something similar to materialise the shredded tables.

Thank you Mike for the insight. If you know of any updates or any future plans to support this in future will love to know more. :raised_hands:

Hi @Asmita_More,

I can confirm what Mike said – it is not on our roadmap to support shredded data in Databricks. But please don’t be disappointed! Hopefully we can persuade you that widerow is the superior table structure, once you get used to it.

I would be interested to hear why you would like to keep on using the shredded style? We find that the widerow format is working really well for analytic queries, mainly because you don’t need to do any joins. It loads very quickly, is queried quickly, and it stores very efficiently, because of how delta/parquet is column-oriented format.

If your experience is different, though, then I would appreciate hearing your thoughts.

2 Likes

Hello Mike, Thank you for confirmation. :raised_hands:

Currently, We are restructuring our existing snowplow architecture which has an EMR Batch job and redshift as the destination.

We are making changes to move away from redshift to Databricks.

And currently, our underlying DBT models use redshift shredded data and will need changes as everything will be in Databricks as widerow format.

Therefore we were checking if there is any option of using shredded data in Databricks.