Snowplow R100 Epidaurus released with PII pseudonymization support

knservis · February 28, 2018, 12:06pm

We are excited to announce the release of Snowplow R100 Epidaurus:

https://snowplowanalytics.com/blog/2018/02/27/snowplow-r100-epidaurus-released-with-pii-pseudonymization-support/

This streaming pipeline release adds support for pseudonomizing user PII (Personally Identifiable Information) through a new Snowplow enrichment.

We are initially adding this new PII Enrichment to the Snowplow streaming pipeline; extending this support to the batch pipeline will follow in due course.

This release is intended to help our users on their journey through GDPR:

mjensen · March 12, 2018, 8:34pm

@knservis no changes to bump version in emr config?

alex · March 12, 2018, 9:44pm

Hi @mjensen, no, this is a Stream Enrich release. Support for this enrichment in the batch pipeline (Spark Enrich) will arrive in a future Snowplow version.

mjensen · March 12, 2018, 11:11pm

@alex got it thanks

jrpeck1989 · March 13, 2018, 1:50pm

If this release introduces pseudonymisation using hashing, does anonymisation use two way encryption?

I would have thought it would be the other way around - pseudonymisation uses encryption (so that the original information can be re-extracted and used if necessary) and anonymisation would use one-way hashing algorithms like SHA-256 (where it’s impossible to get the original data back unless you already have it)

Have I misunderstood something along the way?

knservis · March 13, 2018, 2:56pm

Hi @jrpeck1989 both encryption and hashing are substituting a value with an alias (pseudonym). In the case of hashing you could either hash all values (if that is possible) and find out what the original value was, or you could build a lookup table with the hashed values (as we are doing in a subsequent release, but we are also adding salt. The lookup table will be secured). The point is that accidental and casual use of data subject’s PII is averted, but it is not impossible with sufficient resources and internal knowledge to recover at least some information. To me true anonymisation would be to each PII value with a random value or downsampling sufficiently (e.g. 192.168.255.1 -> 192.168.x.x or “Jim Beam” -> “J B”), and that happens before that information hits any permanent storage although I cannot imagine how you would be able to do that on a per data subject basis. At least that is my understanding of the two terms. I am happy to be told otherwise. What are your thoughts?

jrpeck1989 · March 13, 2018, 3:10pm

So when you collect the information, the hashing algo randomises the information, and the data is then permanently stored (S3 and Redshift) in its hashed form - is this correct?

Or are you saying there is somewhere the data is stored in its original format, it’s actually hashed during the enrichment process, and should you need it you can use it for your purposes?

knservis · March 13, 2018, 3:52pm

@jrpeck1989 As of r100 the value is just hashed. It is not randomised, meaning it is not substituted with a random value. Each value is then replaced with it’s hash. The original value is not kept in the enrichment, but could possibly be retrieved from raw logs if those logs are not discarded.

In a later release, there will be the option (which will need to be enabled) to keep the mapping of the original value to its hash, but that would be kept separate from the rest of the data as good practice would advise that this information which constitutes PII of the data subject, should only be used with due justification and when consent is given by the data subject. That feature will be in an upcoming release.

Additionally, in a later release we will add the capability to easily scrub data from preexisting data on S3 (Removing PII form Redshift can currently be done as shown in this tutorial: GDPR: Deleting customer data from Redshift [tutorial])

jrpeck1989 · March 13, 2018, 3:58pm

Thanks for this.

Apologies, I understand how hashing works, I was using ‘randomised’ as short-hand for “its hashed value” - I should have been clearer

So is the hashed value sent over in the payload from the tracker? Or the original value?

I’d like to be conceptually clear in my mind of the process

christophe · March 13, 2018, 4:06pm

Hey Jordan,

The original value is sent with the payload. The hashing happens in enrichment (so downstream of collecting).

knservis · March 13, 2018, 4:08pm

@jrpeck1989 No worries. I just wanted to make sure I did not mislead anyone The original value is sent from the tracker base64 encoded and hashing takes place in the first actual piece that contains any logic about the content (as opposed to handling its transmission). That is where decoding takes place and hashing of sent values, or values that come from other enrichments (e.g. you could hash the location if you are using GeoIp lookup enrichment).

jrpeck1989 · March 13, 2018, 4:28pm

Got it!

Thanks for clarifying.

petervcook · March 30, 2018, 8:25pm

As a firm outside of the EU and a site not targeting users in the EU, are there ways to apply pseudonymization or other GDPR features only to users who are based in the EU?

I’m thinking something along the lines of…
…with the Geolocation enrichment, there is an approximation of the country a user is in, IF the visitor is in one of the 28 member states THEN apply certain rules.

alex · March 31, 2018, 10:38am

Hi @petervcook,

Thanks for the great suggestion - that’s something we thought of as well, and added this ticket:

Please do add any thoughts to that ticket on how this should all work!

Topic		Replies	Views
Snowplow R100 Epidaurus released with PII pseudonymization support – Snowplow GDPR	0	1242	March 9, 2018
Snowplow R99 released with support for Google Analytics New releases	3	2353	February 28, 2018
GDPR - PII configuration in the batch pipeline Enrichment	3	1240	January 31, 2019
GDPR challenges and compliance discussion GDPR	5	3034	December 23, 2017
GDPR and IP adresses For data modelers & consumers	2	2553	October 16, 2017

Snowplow R100 Epidaurus released with PII pseudonymization support

Related topics