Customise column names for Event and Entities

Jayant_Kumar · November 21, 2023, 5:12pm

Hi,

I am trying to figure out, if there’s a way to configure simpler column names for self-describing events and entities. It will simplify the data modeling and querying for simpler use cases.

I am thinking to avoid the the prefixes and version number as well as they could be stored a dedicated field.

The document says this

The column name is prefixed by unstruct_event_for self-describing events, and bycontexts_ for entities. *(In case you were wondering, those are the legacy terms for self-describing events and entities, respectively.)*

Colm · November 21, 2023, 5:44pm

No that’s not possible.

mike · November 21, 2023, 11:23pm

In theory this is achievable with code changes - but how would you differentiate between entities and events (which is the context / unstruct_event prefix) as well as disambiguate collisions with things like event names?

Generally we’re recommend doing this in the downstream data models of the events table - generally we don’t see too many consumers querying events directly.

Jayant_Kumar · November 22, 2023, 6:50am

@mike For very simple use cases, those constraints may not apply upfront, like structured events, but they will help in simplifying or customizing the raw events loaded to the targets. It’s just about a little bit of a learning curve when a team adopts Snowplow.

Yes @mike I agree on this part. This can be taken care of by a view or a layer with customisations.

Jayant_Kumar · November 22, 2023, 6:52am

@mike Nonetheless, I feel having such customisations/options with the RDB and Lake Loaders may add up some advantages. But again, it’s just an opinion.

mike · November 22, 2023, 7:09am

I think there are definitely some advantages to customisation but in having this standard (which is applied globally across all events + entities) it enables us to build robust data models like the recently released unified model to take advantage of knowing that these conventions are in place. If we allowed for more flexible renaming in the destination target we would then need to add additional complexity to each data model (and in turn each data warehouse) to account for variability in these column names.

Other changes - like storing the version as part of the column name rather than within the column allow us to often more efficiently query warehouses - e.g., in BigQuery this means the user can scan fewer bytes (by selecting only the values they need) and in Snowflake you can do so similarly where you can reduce the amount of computation required by not needing as many columns, as well as not needing to reference a field inside the VARIANT. Many query engines either do not calculate or provide only coarse / block level statistics on properties within these columns and as a result the query planner isn’t able to optimise the scans as much as is otherwise possible.

Jayant_Kumar · November 22, 2023, 7:29am

Thank you @mike I get your point.

On your second point, I am trying to understand each point that you have mentioned.
When you mention fewer bytes, is it about the low cardinality values for the version columns? Or it has more meaning and sense to it that I am missing

Topic		Replies	Views
Is it possible to change event_name from structed events? For engineers	5	917	May 31, 2020
Self-describing events versus the mega JSON-object property for Snowflake? For data modelers & consumers	8	1259	February 10, 2022
Can I change the name of the field that contains the unstruct event data?	1	1073	June 12, 2018
Should I use trackSelfDescribingEvent or Custom contexts or neither? Tracking SDKs	6	1914	August 13, 2018
Simplifying wide row column names for event and contexts GCP pipeline	4	575	November 2, 2023

Customise column names for Event and Entities

Related topics