GDPR: Delete user data from buckets (along pipeline)


we are currently facing the following issue. We want to enable personalized tracking in our project that means that we will store a userId in our tracking data, of course, only if that user has given us explicit consent to do so. In order to be GDPR compliant we need to guarantee the following two things:

  1. takeout (The user can get the data we have collected about him or her.
  2. deletion (Delete all the data that is associated with this specific userId)

to 1.) This should be straight-forward in Redshift.
to 2.) This should also be straight-forward in Redshift (see. GDPR: Deleting customer data from Redshift [tutorial]) However, we are storing this userId in the raw data along the pipeline and deleting data from Redshift does not seem to be enough because the user data could be easily restored from the raw data.

Before the data is loaded into Redshift it passes the following buckets:
loader-target-bucket → enriched-bucket → shredded-bucket
(The first is the target bucket for the s3 loader, the last two are used in the shredding process).

Currently we want to keep the enriched-bucket as a single point of truth regarding our raw data. Data from the other buckets (loader and shredded) could easily be removed anyway.

Is there an efficient way to crawl through that bucket and delete files based on a userId? Over time with a rapidly increasing amount of data this will become a terribly time-consuming and inefficient procedure. Is someone else facing a similar issue?

Hi @mgloel,

So it wouldn’t be a matter of deleting individual files, rather deleting individual rows within files. This is indeed a fairly expensive process from an efficiency standpoint - unfortunately that constraint isn’t really avoidable. In my own opinion the best way to work with that limitation is to attempt to limit how often the job needs to be run insofar as you can tolerate within the law/your policy. (I am quite unfamiliar with the specifics of the laws/regulations here and obviously can’t make any specific recommendations - however I assume there’s some tolerance for performing this task in a way that is practicable).

As far as carrying out the task of removing data for these purposes, we have built a right to be forgotten application. It’s on the user to test it and ensure that it’s operating properly - even for our own customers we don’t run it as a service, because of the practicalities around providing those kinds of assurances (It’s not like say an issue with enrich where we can examine logs - getting involved in something like this necessitates access to data). Obviously, though, if it’s not operating as expected please file issues to let us know about it.

I hope that’s helpful.


So performing this on enriched data isn’t too tricky (as it’s structured / semi-structured).

The keys on S3 are immutable so if you want to remove data you need to select the data you want and either write a new key or remove the old one. This can be done with something like Athena or S3 Select reasonably easily - though you’ll need to be mindful of cost depending on how you are storing these enriched files (e.g., many small files or larger gzipped files).

Deleting from raw is significantly more difficult and again depends on the format you are storing the data in. If it’s the serialised format (Thrift / ElephantBird) then what you ideally want to remove is bytes rather than rows (as there’s no concept or rows at this stage). If it’s gzipped this is still doable (though difficult) but if it’s LZO then S3 select / Athena aren’t going to be of much assistance here.

Another - expensive, but useful governance process that I’ve done before but is a bit of a pain is to maintain a manifest for each of your raw files - this involves a little bit of processing but allows you to generate a manifest between:

personal data (userid, IP address etc) to an S3 key to a byte offset (assuming deserialised / decompressed, if LZO this will be a block offset) which makes the removal process easier.

Then it’s worth considering what you want to do / have to do with that data e.g., is it removing the row entirely, is it removing a certain field or redacting a value and leaving the rest intact or it it anonymising / pseudononymising the data?


Thanks a lot @mike and @Colm for your quick and helpful replies.

We need to remove the rows entirely.

Actually it seems a lot easier easier to consider our redshift data base as the single-point-of truth and remove the raw data from the shredded and enriched buckets after a certain period of time, e.g. one month.

On the other hand it seems a little bit unreliable if the rdbloader fails without us noticing it. Furthermore, we would like to make sure to make snapshots frequently (at the moment we do them every 8 hours)
Would it be a bad practice to delete all the data from the shredded and enriched buckets after one month?

Not at all - that’s not uncommon in cost management to add lifecycle policies to either move to infrequent access, Glacier or just deleting the keys entirely. It is a bit of a pain if you ever need to reprocess data but if you aren’t doing that on a regular basis it’s not too bad.

It does make me wonder if it’s worth having a process (as part of enrich) that is capable of performing a lookup and adding a ‘do not process’ flag to events.

1 Like