This tutorial is a followup on our guide to deleting customer data from Redshift. It is meant to help Snowplow users who use Snowflake as a storage target comply with the GDPR rules coming into effect later this year. Under GDPR, data subjects have the right to “be forgotten”. This means that an individual will be able to request for any data on them to be removed from all the data stores that a company uses.
Assumptions
- A request has been made to delete all data belonging to a specific user. We’ll be using the
user_id
as the identifier in this tutorial but the same concepts can be applied to other fields (e.gdomain_userid
,user_ipaddress
or any other fields that can be used to identify someone). - The business runs a data model which is solely derived and recomputed from the
atomic
data daily. This means that in removing the customer data from theatomic
data in Snowflake, the modeled tables will also be cleared upon recomputation. Some further thought is required for incremental data models - this is out of scope of this tutorial.
Deleting data from Snowflake
Deleting customer data from Snowflake is much simpler than Redshift because atomic
data is contained within a single table.
1. Check what data will be deleted
Before actually deleting the data, it’s always worth doing a sanity check:
SELECT
COUNT(*),
MIN(collector_tstamp),
MAX(collector_tstamp)
FROM
atomic.events
WHERE
user_id = 'Data Subject';
If the results make sense then we’re good to continue!
2. Delete the events
We can go ahead and delete the data:
DELETE FROM
atomic.events
WHERE
user_id = 'Data Subject';
Time Travel and Fail-safe
Snowflake has two powerful features that allow deleted data to be queried and / or restored after it’s been removed from a table.
Time Travel
Time Travel enables accessing historical data (ie, data that has been changed or deleted) at any point within a defined period.
The standard retention period is 1 day (24 hours) and is automatically enabled for all Snowflake accounts with some configuration options:
- Standard Edition accounts can change the period to 0 (effectively disabling Time Travel)
- Enterprise Edition accounts can change the period to between 0 and 90 days
This means that, depending on you account settings, deleted data may still be accessible to you for up to 90 days after removing it from the atomic.events
table. There is no way to set the data retention period for just these rows to a value different than the one for the rest of the atomic.events
table.
So if you are using Time Travel with a longer window, make sure the data subject is aware their data will be deleted with some delay. According to GDPR, your obligation is to " erase personal data without undue delay".
Fail-safe
Separate and distinct from Time Travel, Fail-safe provides a (non-configurable) 7-day period during which historical data is recoverable by Snowflake.
This period starts immediately after the Time Travel retention period ends. In that period, you cannot access the data, but Snowflake can.