We have an old pipeline (EMR batch) that makes use of Shredding. Which splits events into multiple files (tables) which are then loaded into Redshift.
As we are migrating the pipeline from the EMR batch to Databricks and loading the data into a data lake we chose to store it in Wide Row format.
As per docs, shredded data can only be used in redshift. Doc
Is there any way I can use shredded data for loading into Databricks using parquet format?
Or Is there any efficient way to do that?
Actually, I am not looking for migrating redshift data to databricks.
I’m looking for a configuration which shreds data before loading it into databricks. Currently, it is done through widerow format.
And as per the document it says. Shredded data can currently only be loaded into Redshift.
I want to know if it is possible to load shredded data in databricks?
Not currently - and I’m not sure if the intention will be to support this in the RDB loader in the future (@istreeter?).
The main reason being that the shredded format tends to underperform when compared to the ‘wide row’ format as the overhead of joins and broadcasting across multiple nodes which is required in the shredded model can be avoided in the wide row format. In addition we save a bit of storage space too by not having to duplicate the join keys in the shredded table.
If you do still want to use the shredded model for other reasons I’d suggest opting for loading using the current wide-row format and then shredding downstream using dbt or something similar to materialise the shredded tables.
I can confirm what Mike said – it is not on our roadmap to support shredded data in Databricks. But please don’t be disappointed! Hopefully we can persuade you that widerow is the superior table structure, once you get used to it.
I would be interested to hear why you would like to keep on using the shredded style? We find that the widerow format is working really well for analytic queries, mainly because you don’t need to do any joins. It loads very quickly, is queried quickly, and it stores very efficiently, because of how delta/parquet is column-oriented format.
If your experience is different, though, then I would appreciate hearing your thoughts.