I am trying to figure out, if there’s a way to configure simpler column names for self-describing events and entities. It will simplify the data modeling and querying for simpler use cases.
I am thinking to avoid the the prefixes and version number as well as they could be stored a dedicated field.
The document says this
The column name is prefixed by unstruct_event_
for self-describing events, and bycontexts_
for entities. *(In case you were wondering, those are the legacy terms for self-describing events and entities, respectively.)*
In theory this is achievable with code changes - but how would you differentiate between entities and events (which is the context / unstruct_event prefix) as well as disambiguate collisions with things like event names?
Generally we’re recommend doing this in the downstream data models of the events table - generally we don’t see too many consumers querying events directly.
@mike For very simple use cases, those constraints may not apply upfront, like structured events, but they will help in simplifying or customizing the raw events loaded to the targets. It’s just about a little bit of a learning curve when a team adopts Snowplow.
Yes @mike I agree on this part. This can be taken care of by a view or a layer with customisations.
@mike Nonetheless, I feel having such customisations/options with the RDB and Lake Loaders may add up some advantages. But again, it’s just an opinion.
I think there are definitely some advantages to customisation but in having this standard (which is applied globally across all events + entities) it enables us to build robust data models like the recently released unified model to take advantage of knowing that these conventions are in place. If we allowed for more flexible renaming in the destination target we would then need to add additional complexity to each data model (and in turn each data warehouse) to account for variability in these column names.
Other changes - like storing the version as part of the column name rather than within the column allow us to often more efficiently query warehouses - e.g., in BigQuery this means the user can scan fewer bytes (by selecting only the values they need) and in Snowflake you can do so similarly where you can reduce the amount of computation required by not needing as many columns, as well as not needing to reference a field inside the
VARIANT. Many query engines either do not calculate or provide only coarse / block level statistics on properties within these columns and as a result the query planner isn’t able to optimise the scans as much as is otherwise possible.
Thank you @mike I get your point.
On your second point, I am trying to understand each point that you have mentioned.
When you mention fewer bytes, is it about the low cardinality values for the version columns? Or it has more meaning and sense to it that I am missing