Hi @Brandon_Kane,
This is a good question!
On the topic of cluster keys in general I find Snowflake’s docs to be quite helpful. The ideal cluster keys are ordered, and have the right cardinality to distribute the data relatively evenly - too high is expensive, too low is inefficient.
This makes things quite tricky when choosing one for Snowplow data - event_name
isn’t ordered and is likely too low cardinality. Timestamps are very high cardinality, and so typically aren’t great either. etl_tstamp
once was a good candidate, back when the enrich job was batch rather than streaming, since it naturally grouped the data nicely by the time of spark-enrich run. Unfortunately this is no longer the case.
Unfortunately there’s no good single candidate in the atomic data - any tests I’ve done so far (admittedly these tests have been limited) have revealed that no combination that I found performed any better than Snowflake’s built-in autoclustering.
Sorry to not be able to give you something more helpful on the speific question you have asked - however, taking a step back and looking at the why, perhaps I can be more helpful.
In general, no matter what database you use, the size of the events table eventually leads to some iteration of this problem. If you’re querying that table for your analysis, it’ll eventually be inefficient and more expensive than it needs to be, no matter how well you optimise. However, the events table is intended to be a master record of events, rather than directly consumed.
If you don’t already do this, I would strongly recommend creating a ‘data modeling’ process - a SQL job which aggregates the relevant data into derived tables, which are then what you query for analysis. (Eg aggregate to page view or session level). This job can run on a schedule, and can incrementally update derived tables (limiting on timestamp ensure good performance). The derived tables can have ideal cluster keys by, for example, truncating dates to day.
You can find some example models in this repo - although they’re not incremental and they’re quite old versions - it should give you the idea of the logic for web data. We are currently working on whipping some of our closed-source models into shape for an OS release, which will include incremental logic.
I hope that’s helpful?
Would love to hear if others have better suggestions on the topic of clustering!