Any plans to introduce RDB shredding mode to Snowflake?

As we learned with the release of RDB Loader 3.0.0 and How Snowplow data is structured in Snowflake, the Snowflake implementation does not shred different events into separate tables and instead insists on the big row format with VARIANT columns. This choice was motived by customer request and the fact that it

  1. reduces the number of joins required
  2. does not introduce any significant penalty as VARIANT columns are equivalent to flattened tables in many cases.
  3. is relatively easy to extract single values from a variant in Snowflake when you need it.

At the moment "transformationType": "shred" is only compatible with Redshift (source), but are there any plans of making this an option for Snowflake as well?

It would seem that one benefit of shredding it into tables from the mega enterprise perspective would be to enable column level access. If it remains a variant, it would either be totally restricted or not restricted. If it was shredded into tables, only sensitive columns would be restricted.

Was it to avoid any additional complexity no one has requested yet you didn’t implement it so far, or are there any additional challenges with implementing it for Snowflake as well?

You would still run the same transform step as before, only that the final load step for Snowflake is not compatible with that format? The easiest work arounds for now would be to either

  1. Script flattening of tables inside the database
  2. Create your own custom Snowflake load step from the shredded S3 Redshift format
  3. Fork the existing RDB Loader to have this capability
1 Like

Hi @medicinal-matt sorry but we do not have any plans to add the shredded format for Snowflake.

The reasons you enumerated are still the main reasons. And furthermore, we have heard from other Snowplow users that the “wide” style events table is well liked and easier to use. I find it interesting to hear your alternative point of view though!

Regarding column level access… our recommended way of implementing Snowplow is to use the PII pseudonymisation enrichment so that sensitive data like you’re describing never reach the warehouse at all. Instead, sensitive fields are hashed. Done this way, analysts can still perform many types of analytics on the hashed values, but without giving them access to the unhashed sensitive data.

Was it to avoid any additional complexity no one has requested yet you didn’t implement it so far, or are there any additional challenges with implementing it for Snowflake as well?

To implement this feature would be challenging but possible – certainly no more challenging that what we’ve already done for Redshift. But currently we have so many other things on our roadmap that we are unlikely to spend any time on this feature, especially when we hear that most Snowplow-Snowflake users are happy with the current table.

1 Like

Fair enough! It might be that we realise this is not a good idea in the end. Otherwise we still have these three work arounds.

I now notice, there is even a roadmap ticket to do the opposite. I raised this issue there as well.

@medicinal-matt it would be great if we could get you setup using column level access and the wide-row table format together. It must be possible! Here is one idea…

You could start by creating an Iglu schema specifically for the sensitive data. Call it something like sensitive_data. Then design your tracking so that all sensitive fields get attached to the event using a custom context referencing your Iglu schema.

Done this way, the RDB Loader will load this data into a Snowflake column called something like contexts_com_acme_sensitive_data. You could put access restrictions just on this column, and allow global access to all other columns.

I am convinced that there is a way to implement tracking to achieve the result you want. Snowplow tracking is so flexible that it must be possible, especially when you consider custom events, custom entities, and even tools like the javascript enrichment to further manipulate events server-side before loading to Snowflake.

Interesting conversation here! My 2c:

@istreeter in principle that’s a great approach but IMO it falls down because it then requires the tracking setup to essentially reconstruct any default tracking that might contain PII - eg. IP addresses or form tracking of certain fields.

However I don’t think that limitation means that the solution is to implement shredded tables for all DBs. @medicinal-matt for context, the main reason the shredded table structure exists in the first place is that Redshift historically didn’t deal well with arrays or nested data structures. I think the ticket you’ve linked and input to (thanks by the way, it’s important to have that kind of feedback logged on our issues!) is centred on doing the opposite of what you propose - removing shredded tables from Redshift at all.

In my own opinion, the limitations the shredded model are that it makes data models more complex, inefficient, and error-prone. Once tracking is sufficiently developed, it makes the data modeling step very cumbersome, and basically hard-caps how efficient models can be (and therefore how soon data can be used).

There’s a hint in that to what I see as the actual solution to the specific situation you describe - in short, the model that we typically work to is that not very many people have access to any of the atomic data (especially for large organisations as you refer to in the GH issue). Rather, most consumers of the data consume only the derived tables that have been produced by a data modeling process.

If you DO require giving certain users access to only certain fields for a given event or entity, and need a data structure like our shredded Redshift tables to do so, one way of doing so fairly easily is to use FLATTEN to produce a separate federated table.

However I would hope that with some good data model design, that would only really be necessary during the process of developing a model which can abstract away the need for individuals to access those fields, and obscure the values where widespread access is needed.

Of course, that’s not to say that this solution is perfect - I am definitely biased having spent a lot of time dealing with the more difficult aspects of the federated table structure in Redshift :smiley: . Fine grained access control is hard, I’m sure our product could stand to evolve further to improve support here.

2 Likes

This shouldn’t be the case depending on exactly what your use case is.

CREATE OR REPLACE MASKING POLICY snowplow.atomic.single_field AS (col VARIANT) RETURNS VARIANT

In Snowflake you can apply masking policies to VARIANTs as well as setting the return type to VARIANT so if for some reason you want to redact a single field - or multiple fields in a single column this is achievable. You can either do this via a normal masking policy or if you have a more complicated masking requirement you can call out to a UDF within the masking policy itself.

Thanks for the input! The confirmation that is not happening would be enough for me, but thanks for transparency on why and more data modelling solutions around this.

For more context on our setup, we’re thinking not just restricting directly identifiable data, but also sensitive data that is tied to an indirectly identifiable user UUID. For example restricting what page a user has viewed. A nice aspect about the shredder is that it shreds all new events automatically, but now we would have to figure out how to do the automatic flattening ourselves. Can definitely be done, but no need to reinvent the wheel if it was soon going to be available.

Can you expand on this radical opposition to shredding? Why would it make data modelling very cumbersome and hard cap how efficient the models can be? First normal form has been consider a good idea for 50 years and is recommended by both Inmon and Kimball. The Snowflake engineers made it possible to skip, but did they ever stop to consider if it was a good idea?

You mean that it would lead to unnecessary UNIONs of similar tables and joins to get in shared fields again?

Exactly! It isn’t spelled out in the issue, but this is where Snowplow is heading? To deprecate shredding all together?

In Snowflake you can apply masking policies to VARIANT s as well as setting the return type to VARIANT so if for some reason you want to redact a single field - or multiple fields in a single column this is achievable. You can either do this via a normal masking policy or if you have a more complicated masking requirement you can call out to a UDF within the masking policy itself.

This is actually a good point if scales etc! Let me explore this option further. It seems it is partly covered by Using JavaScript UDFs on JSON on this page.

Can you expand on this radical opposition to shredding? Why would it make data modelling very cumbersome and hard cap how efficient the models can be? First normal form has been consider a good idea for 50 years and is recommended by both Inmon and Kimball. The Snowflake engineers made it possible to skip, but did they ever stop to consider if it was a good idea?

I don’t think I’d describe it as radical opposition - but I do have a habit of overstating the point so it may have come across that way. There are advantages to shredded tables, I pointed out the disadvantages for context around the topic.

As far as I understand 1NF, the non-shredded atomic tables that Snowplow loads for Snowflake and BigQuery do not violate it. The atomic entity of Snowplow data is an event - and the data is always loaded to the table on a one-row-per-event basis. The way I see it, the shredded table structure represents a one-to-many mapping of events to their related context, not a one-to-many mapping of rows to their atomic values.

You could take a harder-line stance on 1NF and disagree with me, but at that point in my own opinion we are discussing the details of a table design borne of and primarily designed for transactional data. I personally am ok with violating those principles as long we land on principles that are well suited to behavioural data collection - such as having one row per event, making events self-describing, at-least-once-delivery and including rich metadata.

Again though, these are just my personal opinions for the purpose of an interesting discussion - I’m just a random engineer from an unrelated team, my opinion doesn’t hold a lot of weight in terms of product decisions. And I’m happy to be disagreed with!

Why would it make data modelling very cumbersome and hard cap how efficient the models can be?

Specifically to this question - Let’s say you have a custom web event and you want to model it alongside the web model. You need the page view ID, the data from the custom event’s shredded table, and the session id from the atomic events table. So you need two joins on the largest table in the database per custom table (without considering entites/contexts, which are also one-to-many joins). Efficiency is hard-capped as a function of how many different tables you need to join. And that’s just the efficiency problem - the cognitive load involved in designing the model and avoiding error is pretty high too (duplicates are allowed - more joins = more points of failure), which I consider to be probably the more important consideration broadly (since my guess is that more users would be affected by that).

Exactly! It isn’t spelled out in the issue, but this is where Snowplow is heading? To deprecate shredding all together?

That’s not the position, no. Perhaps my phrasing was misleading before, but all I see is an issue to specify adding a feature to Redshift. If the decision is made to deprecate any feature, that’ll be announced once the decision is made.

If it were up to me personally, I’d probably want to deprecate shredding though, yeah - as long as Redshift’s new functionality around handling objects and arrays is actually performant enough (I haven’t had a chance to try it). But tbh all the principled stuff I mentioned above aren’t really why I think that, I just will always prefer to maintain less code - and it would mean we could deprecate a lot of code in the data models. I wouldn’t want to do it if it meant degrading the experience of users in important ways, I just at the moment might need some convincing about that.

But yeah there’s good reason I’m not the one to make those decisions - our product team pay a lot more attention to those kinds of decisions - I’m just a random engineer with opinions which may or may not be way off the mark. :slight_smile:

To add a bit more to the answer from @Colm above as I’d describe myself as militant anti-shredding but a similar disclaimer that I don’t have any influence over the choice.

There are just fewer and fewer scenarios where shredding (in the database) provides benefits over the alternative.

  • From a loading perspective the loader needs to wrap multiple table loads in a transaction and be able to gracefully roll back if a single table fails. A normal load has to at least temporarily lock all these tables. This added complexity makes it significantly more difficult to support certain use cases (like streaming data in Redshift) which then move the atomicity constraint from a batch of data to individual events.

  • Joins are really expensive and shredded data means that you are doing more of these joins more frequently. For certain operations you are potentially loading a large amount of data into memory (even for hash joins) and may be broadcasting / shuffling this across the network which is going to impact query speed.

  • Shredding doesn’t solve the problem the 1NF problem and instead just puts on a normalisation wig and moustache. If you have an array (or a nested object) within an entity or context the column type is going to end up as a varchar (in Redshift, specifically). You could in theory shred this array too, but then you introduce the logic of having to recurse through a data structure to normalise it and create N tables to join back to the single event. Deeply nested objects might be easier to flatten but the mutator then needs to ensure that it doesn’t violate the 127 byte limit for a column name when traversing the event definition.

  • Storing things as varchars (above) that shouldn’t be varchars (arrays, nested objects) makes them difficult to work with and potentially lossy. If you want to retrieve something from the varchar you’ll need to either parse the JSON (or cast to SUPER) which - at query time - is not a cheap operation. To add insult to injury a varchar column is going to max out at 65535 bytes (Redshift). If you have an array, or a nested object which can’t be represented in these bytes your options are to either truncate it - resulting in a situation where the JSON is no longer valid and can’t be parsed or to just not load the row entirely.

  • Shredding data means that there is a significant storage overhead. Storing (timestamp/root_tstamp and event_id/root_id) for every single table results in significant data duplication. For an event that is sent with four contexts you are storing (uncompressed) 8 bytes (timestamp) + 36 bytes (event_id) * 5 (1 pair for atomic, 4 pairs for shredded tables) for a total of 36+8 * 5 = 220 bytes (assuming each context only has one row in the best case) compared to just 44 bytes in a wide table model. Yes - Redshift can compress this data (somewhat) but you don’t want to compress the timestamp in any shredded tables (as it is a sort key and will degrade performance) and root_id has extremely high cardinality by design so isn’t going to compress particularly well across different events.

  • Shredding has a non-zero computational cost and requires additional CPU, memory and importantly time when compared to the wide table counterpart which increases the latency of data between enrichment and sinking into storage. The database (depending on which database) incurs additional resources having to maintain multiple sorted and distributed tables. There’s additional engineering time in building, testing, supporting and maintaining any component of the pipeline as well.

  • The shredded design even for people quite familiar with the Snowplow data model isn’t particularly easy to reason about.

For example - how many rows does this join query return?

SELECT
   app_id
FROM
   atomic.events
   LEFT JOIN
   atomic.com_shredded
   ON events.collector_tstamp = com_shredded.root_tstamp AND events.event_id = com_shredded.root_id
WHERE
event_id = 'example-123'

The answer here could be 1, or 2, or any positive integer above 1 depending on how many instances of that context were attached the event when it was sent. That’s not particularly clear at query time and I would argue also isn’t intuitive for an end user as well as the query optimiser that also isn’t aware of this (primary key and foreign key constraints can be used by the query optimiser but in this case the key in the context table doesn’t fulfil this constraint).

  • The mutator needs to have more knowledge of the database it’s loading into. Want to use a property that is reserved keyword like “table” or starts with a number? That’s fine but the mutator now needs to be aware of every SQL reserved keyword so that it can ensure it’s quoted appropriately when the table is created / altered.

The shredded model has worked really well in Redshift for a number of years and I believe was absolutely the right solution at the time and a dramatic improvement to it’s predecessor but almost all (I think) Snowplow targets now support some notion of storing semi-structured data (Redshift - SUPER, Postgres - JSON(B), Snowflake - VARIANT, BigQuery - JSON) which makes the wide table model more of a reality across databases. This isn’t just beneficial from a usability and a compute perspective but allows standarisation of data models and makes it far easier for folks to migrate from one warehouse to another.

4 Likes