Can you expand on this radical opposition to shredding? Why would it make data modelling very cumbersome and hard cap how efficient the models can be? First normal form has been consider a good idea for 50 years and is recommended by both Inmon and Kimball. The Snowflake engineers made it possible to skip, but did they ever stop to consider if it was a good idea?
I don’t think I’d describe it as radical opposition - but I do have a habit of overstating the point so it may have come across that way. There are advantages to shredded tables, I pointed out the disadvantages for context around the topic.
As far as I understand 1NF, the non-shredded atomic tables that Snowplow loads for Snowflake and BigQuery do not violate it. The atomic entity of Snowplow data is an event - and the data is always loaded to the table on a one-row-per-event basis. The way I see it, the shredded table structure represents a one-to-many mapping of events to their related context, not a one-to-many mapping of rows to their atomic values.
You could take a harder-line stance on 1NF and disagree with me, but at that point in my own opinion we are discussing the details of a table design borne of and primarily designed for transactional data. I personally am ok with violating those principles as long we land on principles that are well suited to behavioural data collection - such as having one row per event, making events self-describing, at-least-once-delivery and including rich metadata.
Again though, these are just my personal opinions for the purpose of an interesting discussion - I’m just a random engineer from an unrelated team, my opinion doesn’t hold a lot of weight in terms of product decisions. And I’m happy to be disagreed with!
Why would it make data modelling very cumbersome and hard cap how efficient the models can be?
Specifically to this question - Let’s say you have a custom web event and you want to model it alongside the web model. You need the page view ID, the data from the custom event’s shredded table, and the session id from the atomic events table. So you need two joins on the largest table in the database per custom table (without considering entites/contexts, which are also one-to-many joins). Efficiency is hard-capped as a function of how many different tables you need to join. And that’s just the efficiency problem - the cognitive load involved in designing the model and avoiding error is pretty high too (duplicates are allowed - more joins = more points of failure), which I consider to be probably the more important consideration broadly (since my guess is that more users would be affected by that).
Exactly! It isn’t spelled out in the issue, but this is where Snowplow is heading? To deprecate shredding all together?
That’s not the position, no. Perhaps my phrasing was misleading before, but all I see is an issue to specify adding a feature to Redshift. If the decision is made to deprecate any feature, that’ll be announced once the decision is made.
If it were up to me personally, I’d probably want to deprecate shredding though, yeah - as long as Redshift’s new functionality around handling objects and arrays is actually performant enough (I haven’t had a chance to try it). But tbh all the principled stuff I mentioned above aren’t really why I think that, I just will always prefer to maintain less code - and it would mean we could deprecate a lot of code in the data models. I wouldn’t want to do it if it meant degrading the experience of users in important ways, I just at the moment might need some convincing about that.
But yeah there’s good reason I’m not the one to make those decisions - our product team pay a lot more attention to those kinds of decisions - I’m just a random engineer with opinions which may or may not be way off the mark.