Recommended cluster key for events table in Snowflake?

Brandon_Kane · July 20, 2020, 8:31pm

We have been using Snowplow for a while now and last year migrated from RedShift to Snowflake for our data warehouse, and couldn’t be happier with the move.

However now our events table is getting fairly large (1.5 billion records, growing at ~10 million/day) and we are looking at some of the Snowflake features to keep peak performance. One is the ability to set a cluster key on the table. We typically filter on things like the event name, and timestamps, so those are obvious choices.

Was curious what others are running for their event table cluster key and how they arrived at the list of columns to use.

Thanks in advance,

Brandon

Colm · July 21, 2020, 9:20am

Hi @Brandon_Kane,

This is a good question!

On the topic of cluster keys in general I find Snowflake’s docs to be quite helpful. The ideal cluster keys are ordered, and have the right cardinality to distribute the data relatively evenly - too high is expensive, too low is inefficient.

This makes things quite tricky when choosing one for Snowplow data - event_name isn’t ordered and is likely too low cardinality. Timestamps are very high cardinality, and so typically aren’t great either. etl_tstamp once was a good candidate, back when the enrich job was batch rather than streaming, since it naturally grouped the data nicely by the time of spark-enrich run. Unfortunately this is no longer the case.

Unfortunately there’s no good single candidate in the atomic data - any tests I’ve done so far (admittedly these tests have been limited) have revealed that no combination that I found performed any better than Snowflake’s built-in autoclustering.

Sorry to not be able to give you something more helpful on the speific question you have asked - however, taking a step back and looking at the why, perhaps I can be more helpful.

In general, no matter what database you use, the size of the events table eventually leads to some iteration of this problem. If you’re querying that table for your analysis, it’ll eventually be inefficient and more expensive than it needs to be, no matter how well you optimise. However, the events table is intended to be a master record of events, rather than directly consumed.

If you don’t already do this, I would strongly recommend creating a ‘data modeling’ process - a SQL job which aggregates the relevant data into derived tables, which are then what you query for analysis. (Eg aggregate to page view or session level). This job can run on a schedule, and can incrementally update derived tables (limiting on timestamp ensure good performance). The derived tables can have ideal cluster keys by, for example, truncating dates to day.

You can find some example models in this repo - although they’re not incremental and they’re quite old versions - it should give you the idea of the logic for web data. We are currently working on whipping some of our closed-source models into shape for an OS release, which will include incremental logic.

I hope that’s helpful?

Would love to hear if others have better suggestions on the topic of clustering!

Brandon_Kane · July 22, 2020, 2:10am

@Colm, thanks for the detailed reply! I had looked at that doc, one of the things that they suggest for the timestamp to reduce cardinality is using a function to truncate the timestamp to_date() so that reduces it to about 1000 entries for us as we have about 3 years of history in there.

We’re not hitting the events table directly for analytics, this current effort is to speed up the process that aggregates the event table into our models that are used for analytics. We are still on the batch load, so are loading data into events three times per day. We want to reduce the time that process takes so we can run it 6-8 times per day.

Our use case is a bit more specialized than the one in the repo, so we use a few different models. Was hoping someone else had gone through the process of tuning Snowflake for Snowplow and might be able to throw some specific pointers our way.

We will document the process as we go through this and I’ll post our findings back here in a few weeks once we have some results.

Brandon

Colm · July 22, 2020, 9:05am

Ah ok, that all makes sense.

Apologies I couldn’t offer any better insight on tuning Snowflake. Would be very interested to hear how you get on!

Topic		Replies	Views
Redshift Indexing	3	1122	October 1, 2019
How Snowplow data is structured in Snowflake Snowflake	5	4695	May 8, 2020
Duplicate events, using event_id as partition_key Troubleshooting	1	2750	October 20, 2017
Snoplow on Postgres - encoded fields, constraints, DISTKEYS and other doubts For engineers	2	779	June 3, 2019
Unstructured Events and the events table Data store sources	5	2082	March 7, 2019

Recommended cluster key for events table in Snowflake?

Related topics