Controlling the order enrichments are run

gareth · September 7, 2017, 11:19am

Hi

Is it possible to control the order in which the Snowplow EMR ETL runs the custom enrichments?

We have a custom JavaScript enrichment which flags the IP addresses indicating if they’re from our office traffic or the real world. When I enable the IP anonymisation this enrichment cannot tag the IPs correctly. I’m assuming this because the IP address the JS enrichment is seeing is anonymised and so the exact matching we use no longer works. We have some IP range matching that does.

Thanks
Gareth

alex · September 7, 2017, 12:10pm

Hi @gareth - that’s a nice feature, the ticket for this is here:

Unfortunately it’s a lot of re-architecting to deliver this, so it’s not imminent.

gareth · September 7, 2017, 3:28pm

Hi @alex

Thanks, that’s a useful ticket. With GDPR on the horizon we’d like to scrub our Snowplow events of personal information like IP addresses so downstream processing escapes the regulations. However we do need to derive some information from the IP address before it’s anonymised. Unfortunately we won’t be able to do this with the current Snowplow tooling.

Thanks
Gareth

alex · September 7, 2017, 3:51pm

We’re doing a lot of work on GDPR - I’m not sure that IP scrubbing for the enriched events is written is early enough in the process for GDPR adherence, because those IP addresses will still exist in the raw collector logs in S3.

We have a ticket to add support for IP scrubbing in the event collector itself, which would rule out your use case. We have a ticket for this here:

@yali may be able to share more here.

gareth · September 7, 2017, 4:37pm

Yes that could be very useful. Do you have plans to be able to do enrichments like the geo IP lookup you currently have?

Our current plan is to the delete the source data (Cloudfront logs at present) once they’ve been processed, and now the output of the EMR ETL once we’ve done our first post-processing. The aim is to minimise the exposure of the data.

We’re very interested to see what you’re doing with respect to GDPR, partially because no one really knows what the best practice is yet. Nice to see others are thinking similar things.

yali · September 13, 2017, 8:32pm

Hi @gareth - some thoughts (more structured blog post to follow):

We want to make it easier to capture “consent” as an event. This should make it easy for anyone working with the data to be able to query data on individual user’s directly to understand what is and is not permissable to do with the data. (So the consent lives is part of the data the consent governs.)
There may be opportunities to get users to self-identify, if that means that data controllers can more effectively guarantee their rights under GDPR.
We want to be able to pseudonymize any field that might contain personal data. This would include IP addresses, but it could also include cookie IDs user-defined fields in specific self-describing events or custom contexts.

Pseudoanymization is really powerful because it means you can collect the data to use for analytical purposes, you just can’t then tie it back to the user to e.g. personalize their user experience with it. So ideally, we’d have an enrichment that:

Let you specify which fields to pseudoanonymize
Had some logic to determine which events to run on. (So you only psueodoanonymize where you don’t have consent, for example.)

Ideally this would happen upstream of writing out the collector logs, ensuring that where you don’t have consent you don’t have personal identifiable data. However, it’s pretty hard to deliver that level of functionality on the event without first processing it. So your suggestion of deleting the raw collector logs has its appeal. You’d also have to be careful with any bad rows as well.

I need to do some more thinking on how to meet GDPR obligations and keep some of the robustness that comes with being able to reprocess the event stream from scratch and recover bad events safely. Any ideas from the community appreciated!

gareth · September 25, 2017, 11:08am

Thanks @yali I’m looking forward to the blog post.

Consent is an interesting one for us as we integrate into other people’s website so it will require some negotiation with the host retailer on how to ask for consent.

We had thought about the bad rows too, they’re are more difficult because you can’t know for sure what’s in the error message. The log lines could be processed with the standard log line pseudoanonymisation code.

The bad rows we’re planning on deleting after a short period of time to give us a chance to reprocess. The error logs are small volume (typically) and I believe there is scope for keeping type of data temporarily within the GDPR.

Topic		Replies	Views
GDPR and IP adresses For data modelers & consumers	2	2545	October 16, 2017
Filtering events from specific IPs Enrichment	7	2140	February 28, 2019
How to add custom business logic into Snowplow enrichment process? Enrichment	11	3004	January 6, 2017
Completely masking IP address in JS tracker Tracking SDKs	6	911	December 1, 2022
Enrichments, how to enable in quickstart examples? Enrichment	13	1320	June 29, 2022

Controlling the order enrichments are run

Related topics