A new compression option in Redshift allows you to make big storage savings, up to two-thirds in our tests, over the standard Snowplow setup. This guide shows how it works and how to get it happening.
In late 2016 Facebook open sourced a compression algorithm known as Zstandard that combines Lempel Ziv and tANS to achieve a compression ratio better than many algorithms with a slight tradeoff in speed.
In January 2017 AWS added ZSTD support to Redshift, so you can now use this powerful compression algorithm to reduce the storage size of your data. One immense perk of ZSTD is that it can be applied across all supported data types. It gives exceptional performance on long varchars: perfectly suited for large JSON strings that regularly appear in shredded tables.
We’ve had great success both experimentally and practically by applying ZSTD compression to multiple Snowplow tables. Experimentally across datasets ranging between 10 million and 5 billion rows we’ve achieved a mean compression ratio of ~3 meaning that the newly compressed table takes up approximately a third of the original table on disk when compared to the compression defaults in atomic.events 0.8.0.
In particular we’ve found:
Negligible impact on speed of queries for atomic.events (note this will be highly workload dependent so benchmarks here are less useful)
So far we haven’t experimented with the performance/space savings of compressing SORTKEYs. AWS recommendagainst compressing SORTKEYs (so use ENCODE RAW for collector_tstamp and root_tstamp.
ZSTD in almost all instances replaces LZO as the default compression method suggested by ANALYZE COMPRESSION.
Things to take into account
Depending on your workload you may want to run ANALYZE COMPRESSION on your table which will provide some recommendations by Redshift as to what the suggested column encodings are. One caveat of this approach is that you are limited to sampling 1 billion rows so if possible choose a sample data set that contains representative variability within columns.
If you have the opportunity we also recommend benchmarking common queries/jobs on an identical sample of data for a) the 0.8.0 compression defaults and b) the newly compressed tables.
You cannot modify compression on existing columns in a table so consider deep copying the data particularly if a large region of your table is unsorted1 as this will outperform a VACUUM. If performing a deep copy ensure you have sufficient disk space to complete the action: it’s difficult to know how much space is required but we opt for at least 50% of the cluster storage remaining. You may need to consider resizing the cluster and/or temporarily pausing your pipeline to complete this action.
If you decide to drop the original atomic.events table ensure that you either DROP CASCADE or individually drop any dependencies that may rely on this table such as views or foreign key constraints from shredded or derived tables.
If you’ve dropped any foreign keys ensure that you recreate them and they reference the new atomic.events object. Although Redshift does not enforce these constraints they are used by planner to generate optimised query plans.
As this is a new object you may also need to regrant permissions to users/groups as well as ensure the table owner is identical to the original table(s) so that Snowplow can continue to load data.
Analyze compression on your existing table by using ANALYZE COMPRESSION.
If Redshift recommends using ZSTD for columns consider benchmarking a sample of data (e.g., 3 months) with the original column compression and ZSTD column compression
If performance is better or equivalent deep copy data from the original atomic.events table to a new atomic.events_new table. You may need to resize Redshift or free up additional disk space before performing this action.
Verify that the two tables contain identical data by comparing a total row count as well as a row count per day.
Drop the original table as well as any references2. ALTER the new table to atomic events3. Ensure that this new table has an identical owner.
Sit back and enjoy that free disk space.
There’s a few gotchas and edge cases involved with doing this. If you’d like help or advice on how to do it feel free to post in the comments below or get in touch.
I had the problem, that we have a lot of users and it’s complicated to lookup, which user has which permissions. But we have to recover the permissions, if we create a new table to fix the encoding, so I wrote a SQL Query for it, thought it might be also interesting for others.
CASE WHEN tableowner = usename THEN 'ALTER TABLE ' || u.schemaname || '.' || objectname || ' OWNER TO ' || usename END
,CASE WHEN sel THEN 'GRANT SELECT ON TABLE ' || u.schemaname || '.' || objectname || ' TO ' || usename END AS selects
,CASE WHEN ins THEN 'GRANT INSERT ON TABLE ' || u.schemaname || '.' || objectname || ' TO ' || usename END AS inserts
,CASE WHEN ins THEN 'GRANT UPDATE ON TABLE ' || u.schemaname || '.' || objectname || ' TO ' || usename END AS update
,CASE WHEN ins THEN 'GRANT DELETE ON TABLE ' || u.schemaname || '.' || objectname || ' TO ' || usename END AS delete
,CASE WHEN ins THEN 'GRANT REFERENCES ON TABLE ' || u.schemaname || '.' || objectname || ' TO ' || usename END AS references
,HAS_TABLE_PRIVILEGE(usrs.usename, fullobj, 'select') AS sel
,HAS_TABLE_PRIVILEGE(usrs.usename, fullobj, 'insert') AS ins
,HAS_TABLE_PRIVILEGE(usrs.usename, fullobj, 'update') AS upd
,HAS_TABLE_PRIVILEGE(usrs.usename, fullobj, 'delete') AS del
,HAS_TABLE_PRIVILEGE(usrs.usename, fullobj, 'references') AS ref
SELECT schemaname, 't' AS obj_type, tablename AS objectname, schemaname + '.' + tablename AS fullobj FROM pg_tables
WHERE schemaname not in ('pg_internal')
SELECT schemaname, 'v' AS obj_type, viewname AS objectname, schemaname + '.' + viewname AS fullobj FROM pg_views
WHERE schemaname not in ('pg_internal')
) AS objs
,(SELECT * FROM pg_user) AS usrs
ORDER BY fullobj
) as u
JOIN (SELECT tableowner, schemaname, tablename FROM pg_tables) as t
ON u.schemaname = t.schemaname AND tablename = objectname
WHERE (sel = true or ins = true or upd = true or del = true or ref = true)
AND u.schemaname = '<enter schema name>'
AND objectname = '<enter table name>';
—this approach is suggested to use ZSTD compression for all columns including SORT-KEY column.
b) I have created new table using existing table’s DDL and used copy command in order to get column compression encoding (Copy select column compression encoding when load data into an empty table) —COPY command suggested LZO for all columns including SORT-KEY column.
Which approach is correct or optimised ?
SORT-KEY column compression is bad so will ZSTD for SORT-KEY column improve performance ?
I’d recommending using ENCODE RAW on SORT KEYS rather than applying any compression. We’ve only found that compressing the SORT KEY column reduces rather than improves performance. Although the column will be smaller on disk we’ve found that this translates to poorer query performance.
In order to select column compression encoding for an existing table to save some space and improve performance
which approach is best
ANALYZE COMPRESSION table name VS Create new table from an existing table’s DDL and Let COPY command select column compression encoding ?
I tend to create the new table from scratch (with manually specified encodings) and then do a deep copy from the existing table into the new table. Whichever method you opt with the automatically recommended compressions tend to be good but don’t take into account your query access patterns which is why I always opt for manually defining these encodings.
One thing to take into account when you are deep copying is that the data will be decompressed as part of this step (before being compressed again in the new table) so ensure that you have a reasonably large amount of disk space left before performing this operation. A deep copy will also ensure your data is completely sorted, which can be quite useful if you currently have unsorted regions in your existing table.