We are loading our Snowplow data into Redshift using the rdb-loader-redshift container.
The problem we have is that queries are quite slow when incorporating even the most basic joins.
For example, we presently only have about 50M rows in the events and nl_basjes_yauaa_context_1 tables, but the following basic query takes over 30 seconds to complete. (Our actual queries with a few joins in them take over 10 minutes, even with this minimal data set.)
SELECT e.event_id, yauaa.device_class
FROM events e
LEFT JOIN nl_basjes_yauaa_context_1 yauaa ON e.event_id = yauaa.root_id
ORDER BY e.collector_tstamp DESC
LIMIT 100;
In looking at the EXPLAIN output, I noticed that Redshift is doing a “hash join” in this query, which seems to be pretty much solely responsible for the slowness.
I noticed that by default, the event_id
or root_id
is set as the Redshift distribution key and collector_tstamp
is set as the Redshift sort key. It seems like perhaps that setup should enable a merge join, and yet it’s not happening.
After discovering this stack overflow post, I tried running full vacuums on the two tables:
VACUUM FULL events TO 100 PERCENT;
VACUUM FULL nl_basjes_yauaa_context_1 TO 100 PERCENT;
But Redshift is still using a hash join, even immediately after running the vacuum, before inserting more data.
Perhaps based on the Redshift best practices, it would be possible if the merge key and sort key were the same, although that seems like it may have other undesirable consequences.
My question is: Is it possible to adjust the Redshift configuration in some way to use a merge join and speed up my queries? And if so, how could I configure docker and terraform to deploy that configuration?