Can I change the DISTKEY or SORTKEY of the atomic tables in Redshift?

leon · August 18, 2016, 9:34am

We received this question from one of our users:

Do people tend to modify the sortkey or distkey of the atomic tables in Redshift if they decide to use something other than collector_tsamp?

Because the answer might help other users, we’re cross-posting it to Discourse.

It’s possible but we do not recommend it, for the following reasons:

SORTKEY

1. You want joins between the different atomic tables to be fast

We recommend joining the atomic tables on both event_id = root_id and collector_tstamp = root_tstamp (this prevents issues with duplicates - more information). Updating the SORTKEY in atomic.events will slow down these joins.

2. It will increase the number of unsorted rows in atomic

Events are sorted on collector_tstamp when we load them into Redshift. In almost all cases, events in a later run will all have a collector_tstamp that is greater than those in all earlier runs. This prevents the table from becoming unsorted over time.

With the derived_tstamp, it’s possible that events in a later run will have a derived_timestamp smaller than an earlier arriving event (e.g. if that event was cached for a long time). The more unsorted rows there are, the more often a vacuum needs to be run.

DISTKEY

As standard all our table definitions have the distkey set to the event_id or root_id to ensure an even distribution.

Changing this (e.g. to domain_userid) will have two negative impacts:

1.Increasing data skew

Data skew will affect query performance. Amazon has an article on data skew including a query to find the skew value: Identifying Tables with Data Skew or Unsorted Rows. A skew value of 4.00 or higher is considered undesirable.

2. Slow joins

It will slow down joins between the events table and your other atomic tables because data will need to be shuffled between nodes.

dweitzenfeld · November 17, 2016, 9:32pm

What do you recommend for session-level tables? Keep in mind that I create some session-level fact PDTs in Looker, and I often join these to the underlying Session table on session_id… seems like I should use a distkey of session_id?

christophe · November 21, 2016, 7:58am

@dweitzenfeld The choice of DIST and SORTKEY will indeed be different for derived tables. For a sessions table, session_id might be a good candidate. However, if you need to join the sessions table with a users table, then the relevant user ID might be a better choice.

brucey31 · November 21, 2016, 7:11pm

In general

Use Dist keys for things that you are going to group on
Use Sort keys for things you are doing to use in where clauses

to optimise performance.

Topic		Replies	Views
Redshift Indexing	3	1122	October 1, 2019
Redshift maintenance best practices Storage targets	3	5103	April 26, 2017
Making Redshift Use Merge Joins Instead of Slower Hash Joins For engineers	8	2182	April 8, 2024
Unstructured Events and the events table Data store sources	5	2082	March 7, 2019
Passing values from atomic.events to a custom table Redshift	5	4322	May 11, 2017