We received this question from one of our users:
Do people tend to modify the sortkey or distkey of the atomic tables in Redshift if they decide to use something other than
collector_tsamp
?
Because the answer might help other users, we’re cross-posting it to Discourse.
It’s possible but we do not recommend it, for the following reasons:
SORTKEY
1. You want joins between the different atomic tables to be fast
We recommend joining the atomic tables on both event_id = root_id and collector_tstamp = root_tstamp
(this prevents issues with duplicates - more information). Updating the SORTKEY in atomic.events
will slow down these joins.
2. It will increase the number of unsorted rows in atomic
Events are sorted on collector_tstamp
when we load them into Redshift. In almost all cases, events in a later run will all have a collector_tstamp
that is greater than those in all earlier runs. This prevents the table from becoming unsorted over time.
With the derived_tstamp
, it’s possible that events in a later run will have a derived_timestamp
smaller than an earlier arriving event (e.g. if that event was cached for a long time). The more unsorted rows there are, the more often a vacuum needs to be run.
DISTKEY
As standard all our table definitions have the distkey set to the event_id
or root_id
to ensure an even distribution.
Changing this (e.g. to domain_userid
) will have two negative impacts:
1.Increasing data skew
Data skew will affect query performance. Amazon has an article on data skew including a query to find the skew value: Identifying Tables with Data Skew or Unsorted Rows. A skew value of 4.00 or higher is considered undesirable.
2. Slow joins
It will slow down joins between the events table and your other atomic tables because data will need to be shuffled between nodes.