The introduction of GDPR will dramatically change the rights of data subjects and put the onus on data controllers with respect to data capture and processing. Amongst these changes is the right of a data subject to “be forgotten”. Broadly speaking, this means that an individual will be able to request for any data on them to be removed from all the data stores that a company uses. This will pose a daunting challenge for companies without well-structured data. Luckily, the structure of Snowplow data makes complying relatively straightforward. The following tutorial outlines an approach to deleting customer data from Redshift.
Assumptions
- A request has been made to delete all data belonging to a specific user. We’ll be using the
user_id
as the identifier in this tutorial but the same concepts can be applied to other fields (e.gdomain_userid
,user_ipaddress
or any other fields that can be used to identify someone). - The data controller has in-batch deduplication enabled (and has done since they started using Snowplow). This simply requires running Snowplow R76 or later.
- The business runs a data model which is solely derived and recomputed from the
atomic
data daily. This means that in removing the customer data from theatomic
data in Redshift, the modeled tables will also be cleared upon recomputation. Some further thought is required for incremental data models - this is out of scope of this tutorial.
Deleting data from Amazon Redshift
The first thing that we need to do is create a table which contains all of the event_id
and collector_tstamp
combinations that have been collected in atomic.events
.
Whilst creating a table is not technically necessary, it’s a safer route. We’ll be using this table to remove all lines of data from every table in the atomic
schema other than atomic.events
. Once the lines of data are removed from the table which containes the user or device identifier it is impossible to determine which lines of data in the other tables belong to the user who requested to be forgotten.
1. Identify data for deletion
CREATE SCHEMA IF NOT EXISTS scratch;
DROP TABLE IF EXISTS scratch.deletion_fact_table;
CREATE TABLE scratch.deletion_fact_table
DISTKEY(event_id)
SORTKEY(collector_tstamp)
AS (
SELECT
event_id,
collector_tstamp
FROM atomic.events
WHERE user_id = 'Data Subject'
);
It’s always worth sanity checking this data before we go ahead and use it to delete from the database:
SELECT COUNT(*) FROM scratch.deletion_fact_table
If this number makes sense then we’re good to continue!
2. Find all tables that needs to be cleared
We now need to identify all the tables that we need to delete from - this could be a lot of tables so it’s worth saving ourselves some time with a query:
SELECT
DISTINCT table_name
FROM information_schema.columns
WHERE table_schema = 'atomic'
AND column_name = 'root_id';
3. Delete data from tables identified in the previous step
The following SQL uses the scratch.deletion_fact_table
created in step 1 to delete the rows associated with the data subject from one of the tables identified in step 2:
DELETE FROM atomic.com_mycompany_myevent_1 AS m
USING scratch.deletion_fact_table AS d
WHERE m.root_id = d.event_id
AND m.root_tstamp = d.collector_tstamp;
This step should be applied to all of the tables that don’t contain the user or device identifier being used (in this case all tables apart from atomic.events
).
4. Delete from atomic.events
Once we’ve removed the rows from the tables which don’t contain the user identifiers we can delete from atomic.events
:
DELETE FROM atomic.events AS a
USING scratch.deletion_fact_table AS d
WHERE a.event_id = d.event_id
AND a.collector_tstamp = d.collector_tstamp;
5. Clean up!
At this point, if you don’t run a regular Redshift defrag process then it’d be worth running a vacuum statement across all of the tables in the database:
VACUUM;
Before we drop the scratch.deletion_fact_table
we can run the following so we know the exact times for which data on the subject was deleted from the database:
SELECT
MIN(collector_tstamp),
MAX(collector_tstamp)
FROM scratch.deletion_fact_table
Amazon Redshift Snapshots
Redshift’s Snapshots feature creates automatic and / or manual snapshots of your cluster, allowing for data you delete to persist for some time after deletion.
Automated snapshots are enabled by default when you create a cluster. They are taken at regular intervals and stored for a specified retention period. The default retention period is one day, but you can modify that, so check your settings to see for how long the data will be kept. These snapshots are deleted at the end of a retention period.
At any point you can take a manual snapshot, which will never be automatically deleted by Redshift. Manual snapshots are retained even after you delete your cluster. Check to see if you have any manual snapshots that include the data subject’s data.
Snapshots are stored in S3 and there might be copies in different regions. Some copies might have been shared with third parties.
Snowflake
If you use Snowflake and would like to do the same thing, here’s our post on how to do that: