GDPR: Deleting customer data from Redshift [tutorial]

The introduction of GDPR will dramatically change the rights of data subjects and put the onus on data controllers with respect to data capture and processing. Amongst these changes is the right of a data subject to “be forgotten”. Broadly speaking, this means that an individual will be able to request for any data on them to be removed from all the data stores that a company uses. This will pose a daunting challenge for companies without well-structured data. Luckily, the structure of Snowplow data makes complying relatively straightforward. The following tutorial outlines an approach to deleting customer data from Redshift.

Assumptions

  • A request has been made to delete all data belonging to a specific user. We’ll be using the user_id as the identifier in this tutorial but the same concepts can be applied to other fields (e.g domain_userid, user_ipaddress or any other fields that can be used to identify someone).
  • The data controller has in-batch deduplication enabled (and has done since they started using Snowplow). This simply requires running Snowplow R76 or later.
  • The business runs a data model which is solely derived and recomputed from the atomic data daily. This means that in removing the customer data from the atomic data in Redshift, the modeled tables will also be cleared upon recomputation. Some further thought is required for incremental data models - this is out of scope of this tutorial.

Deleting data from Amazon Redshift

The first thing that we need to do is create a table which contains all of the event_id and collector_tstamp combinations that have been collected in atomic.events.

Whilst creating a table is not technically necessary, it’s a safer route. We’ll be using this table to remove all lines of data from every table in the atomic schema other than atomic.events. Once the lines of data are removed from the table which containes the user or device identifier it is impossible to determine which lines of data in the other tables belong to the user who requested to be forgotten.

1. Identify data for deletion

CREATE SCHEMA IF NOT EXISTS scratch;

DROP TABLE IF EXISTS scratch.deletion_fact_table;
CREATE TABLE scratch.deletion_fact_table
  DISTKEY(event_id)
  SORTKEY(collector_tstamp)
AS (

  SELECT

    event_id,
    collector_tstamp

  FROM atomic.events

  WHERE user_id = 'Data Subject'

);

It’s always worth sanity checking this data before we go ahead and use it to delete from the database:

SELECT COUNT(*) FROM scratch.deletion_fact_table

If this number makes sense then we’re good to continue!

2. Find all tables that needs to be cleared

We now need to identify all the tables that we need to delete from - this could be a lot of tables so it’s worth saving ourselves some time with a query:

SELECT 

  DISTINCT table_name 

FROM information_schema.columns 

WHERE table_schema = 'atomic' 
AND column_name = 'root_id';

3. Delete data from tables identified in the previous step

The following SQL uses the scratch.deletion_fact_table created in step 1 to delete the rows associated with the data subject from one of the tables identified in step 2:

DELETE FROM atomic.com_mycompany_myevent_1 AS m
USING scratch.deletion_fact_table AS d
WHERE m.root_id = d.event_id 
AND m.root_tstamp = d.collector_tstamp;

This step should be applied to all of the tables that don’t contain the user or device identifier being used (in this case all tables apart from atomic.events).

4. Delete from atomic.events

Once we’ve removed the rows from the tables which don’t contain the user identifiers we can delete from atomic.events:

DELETE FROM atomic.events AS a
USING scratch.deletion_fact_table AS d
WHERE a.event_id = d.event_id 
AND a.collector_tstamp = d.collector_tstamp;

5. Clean up!

At this point, if you don’t run a regular Redshift defrag process then it’d be worth running a vacuum statement across all of the tables in the database:

VACUUM;

Before we drop the scratch.deletion_fact_table we can run the following so we know the exact times for which data on the subject was deleted from the database:

SELECT
MIN(collector_tstamp),
MAX(collector_tstamp)
FROM scratch.deletion_fact_table

Amazon Redshift Snapshots

Redshift’s Snapshots feature creates automatic and / or manual snapshots of your cluster, allowing for data you delete to persist for some time after deletion.

Automated snapshots are enabled by default when you create a cluster. They are taken at regular intervals and stored for a specified retention period. The default retention period is one day, but you can modify that, so check your settings to see for how long the data will be kept. These snapshots are deleted at the end of a retention period.

At any point you can take a manual snapshot, which will never be automatically deleted by Redshift. Manual snapshots are retained even after you delete your cluster. Check to see if you have any manual snapshots that include the data subject’s data.

Snapshots are stored in S3 and there might be copies in different regions. Some copies might have been shared with third parties.

Snowflake

If you use Snowflake and would like to do the same thing, here’s our post on how to do that:

4 Likes