Running Snowplow in Minimal Mode for GDPR

Curious if anyone is running Snowplow in “minimal” mode in JavaScript Tracker until user consents for GDPR. We want to disable any user tracking as much as possible until user consents, but would like to still capture pageview and anything minimal until they consent. We were thinking maybe we could disable cookies in the Snowplow config for JS until they consent.

One element to consider here is how minimal you want to be. If the Javascript tracker is sending directly to the collector (before or after consent) you’ll be recording and processing the IP address.

I don’t think there’s an easy way around this at the moment other than recording certain events - if you can - server side so that no user identifying information is captured.

1 Like

good point. the IP will still end up in collector and go through ETL.
I wonder if i can edit NGINX config to strip last characters of IP by default.

Hey @mjensen,

Are the IP anonymisation enrichment or the PII enrichment of any use here, or do you need to ensure that this data never hits raw logs?

These will anonymise at enrichment stage. The un-anonymised values are still available in the collector logs, but the strategy here is to use lifecycle rules to permanently delete raw logs after a week - just for use in case of a failure which requires reprocessing.

I’m aware this approach might not satisfy your needs if you need to ensure that the data is never collected at all - just wanted to share in case you weren’t aware.

2 Likes

@mjensen Interestingly there are a few ways to mask the last octets of your IP address in the combined log. Here is one nginx module that can also hash: https://github.com/masonicboom/ipscrub

What I though was interesting was the YAGNI section from that: if you’re doing all that, is it useful? Would the approach that @Colm mentioned not be good enough?

I would be interested to know your requirements if you can share them.

2 Likes

yeah, i’m definitely looking at those as well

requirements are really to be GDPR compliant.
so up until user consents to cookies, we are thinking about disabling any tracking until then including Snowplow. i would hate to stop Snowplow completely but the domain_userid cookie is a perm cookie that shouldn’t be enabled in Snowplow until they consent. session cookies are iffy.
the biggest problem is we use Snowplow data for marketing attribution. for EU, we will loose this ability unless we rely only on 3rd party tracking pixels. but we have a lot of stuff written in-house that relies on Snowplow pageview data.

Using nginx to strip or remove parts of the header with an IP address will work but it introduces another issue which is that you need to be able to do this selectively i.e., only run the anonymising functionality where you don’t have consent vs anonymising all events. Depending on how this is performed it’s going to need some degree of inspection on the nginx side to determine what should be anonymised. This is why I think it’s probably just easy to record the page views with a server side tracker.

Google Analytics gets around this (with anonymizeIP) by performing this masking at the load balancer level before any storage or analytics processing takes place.

If you’re on the Scala Stream Collector and trying to be GDPR compliant (while still collecting non-identifying analytics information) writing it to disk (either S3 or Kinesis/Kafka/Pubsub) isn’t compliant.

I am not a lawyer, however that is not how I understand the spirit or the letter of GDPR. PII may be temporarily present in memory and in logs until you determine that it should not be there (because that user has withdrawn consent), in which case you remove it. In a world of dynamic IP addresses, you are not able to determine that by the IP alone and to me it sounds perfectly reasonable that the IP remains in a temporary log (or in memory) until you can make that determination. As long as it is temporary storage, be it disk or memory and only for as long as it is necessary it seems legitimate to me. As an aside, there are cases where the law allows you to keep PII even if the user has not given consent yet (see https://ico.org.uk/for-organisations/guide-to-the-general-data-protection-regulation-gdpr/lawful-basis-for-processing/contract/ )

I agree that in-memory storage is one thing but writing out personal data (which may include but isn’t restricted to PII) to a more permanent storage mechanism like disk means that you are now storing data without seeking any consent from the end user and therefore no opportunity to opt out. IP address is considered a personal identifier under GDPR (Recital 30) so it falls under this remit.

As you’ve mentioned there are exceptions to this if you need the information for legal purposes or “legitimate business use case” but these are far legally more difficult to make if the information is being collected for analytics purposes and it remains unclear if some of these exceptions will remain acceptable under ePrivacy.

TL;DR: Minimal mode for tracker seems like a good idea to explore to me. The server would need to know the consent status of the tracker and hash the IP if not consent is given and there may be other issues.

Hey @mike,

yes I get the strictly no PII argument, however (and this is by no means advice) it seems to me that unprocessed logs, that are temporarily stored until a determination is made, do not constitute such a violation, and it is rather a consequence of the system design that requires raw data to be temporarily stored somewhere until they are processed. As long as in the first opportunity while processing, you throw away PII for that data subject that has requested (and obviously the raw logs also) that respects the data subject’s wishes in good faith.

As for the IP of a user, while that is understandably considered PII, when we are talking about consent, you cannot know whether the data subject that has consented still has the same ip, and only when you have identified the event as pertaining to a non consenting user can you be sure that you should throw it away (as a consenting user may now be using that IP).

In any case all this is are technicalities that at some point some, hopefully technically competent, judge will make a determination on, but my reading of GDPR is that is intended, understandably to give back some level of control to the data subject when it comes to PII and that processors should make good faith efforts to grant them that. It is not meant to harm companies by making them unable to analyse the company’s operational data or contact their customers.

Given that, I believe that dropping all PII and not just the IP at the point that it is known that the event pertains to a non-consenting data subject is the right thing to do. The only way I can imagine that being done at the collector would be to have a great monolithic system there that knows everything. In snowplow the collector knows nothing about the data except that it seems legitimate.

The tracker may know that the user has not consented in that platform used (e.g. browser A) but it would not know that for all platforms (e.g. browser B, phone, IoT, what have you) so it seems incomplete, and the determination would still need to be made at the processor end as to whether this is a non-consenting user, rather than the client end.

Still a given processor may have only one service delivery platform and/or by default make the decision to not collect almost any data until the user has explicitly consented, in which case the client-side approach probably makes sense.

At this point @alex or @yali may have a much better idea about what is possible and any pitfalls.

1 Like

thanks everyone. very helpful thread :slight_smile:

i updated nginx config:

map $remote_addr $ip_anonym1 {
 default 0.0.0;
 "~(?P<ip>(\d+)\.(\d+)\.(\d+))\.\d+" $ip;
 "~(?P<ip>[^:]+:[^:]+):" $ip;
}

map $remote_addr $ip_anonym2 {
 default .0;
 "~(?P<ip>(\d+)\.(\d+)\.(\d+))\.\d+" .0;
 "~(?P<ip>[^:]+:[^:]+):" ::;
}

map $ip_anonym1$ip_anonym2 $ip_anonymized {
 default 0.0.0.0;
 "~(?P<ip>.*)" $ip;
}

    log_format  main  '$ip_anonymized - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    include       conf.d/*.conf;

and then proxy to send to collector from nginx and works fine so far:

[root@ip-X-X-X-X elasticbeanstalk]# more 00_application.conf 
location / {
    proxy_pass          http://127.0.0.1:5000;
    proxy_http_version  1.1;

    proxy_set_header    Connection          $connection_upgrade;
    proxy_set_header    Upgrade             $http_upgrade;
    proxy_set_header    Host                $host;
    proxy_set_header    X-Real-IP           $ip_anonymized;
    proxy_set_header    X-Forwarded-For     $ip_anonymized;
}
2 Likes

I think this depends on what determination is being made. If somebody were to use consent as their legal basis of storing personal data and you were then to store or process this data prior to obtaining consent - that’s a violation (Recital 42). If a user isn’t provided with an opportunity to consent to the processing (e.g., default opt in or no opt in option) this isn’t considered to meet the conditions for consent (Recital 32). It’s worth noting that GDPR here has a broad definition of “processing” (Article 4) which includes the process of data collection.

Absolutely and this is the approach some other vendors take (i.e., Google anonymises IP before collection and Adobe shortly after) - but ideally this should happen before the data is ever processed which would mean at or before the collector.

so we still need to figure out how and if we can run snow plow in min mode if possible for GDPR until user consents. I would hate to disable snow plow completely. one option was just to call the pixel tracker instead of the javascript tracker to not set any cookies from snow plow for user. i would prefer to be able to disable cookies in JS tracker as much as possible instead though. looking through the JS tracker docs, i can’t see a way to disable cookies (domain_userid, session id etc).

In light of the latest ruling of the European Court of Justice I’m wondering if there is a way to disable the setting of all cookies/localstorage by Snowplow?

Is Snowplow even working if there is no 1st party cookie or will the data end up in bad rows?

Hi @volderette
The good news is that the Snowplow JavaScript Tracker is capable of running without cookies/local storage and the data will still go to the good index. The JavaScript Tracker will allow you to set stateStorageStrategy to none on initialisation. This will prevent both cookies and local storage from being used.

This does have a couple of consequences however:

  • You will no longer have a domain_userid so if you are relying on this to track users then it will no longer work.
  • The session cookie will not be stored, meaning each page view will appear as a new session.
  • Events will not be cached in Local Storage. This means events that fail to send (due to connectivity issues) may be lost and any batching of events to reduce requests will not occur, likely leading to more requests being sent from the browser.

One option would be to intially load the tracker with stateStorageStrategy set to none, then once the user has given active consent, load another instance of the tracker - this time with the stateStorageStrategy set to cookieAndLocalStorage. This way you could have two trackers and wouldn’t loose any of the initial tracking before a user consents. You can decide which tracker to direct the event to when calling the track... methods by naming them (See https://github.com/snowplow/snowplow/wiki/1-General-parameters-for-the-Javascript-tracker#25-managing-multiple-trackers).

1 Like

To add to what @PaulBoocock said, the collector also sets a cookie (network_userid) This cookie was historically used as a 3rd-party cookie, but has now become the best choice for a 1st-party server-side cookie.

Currently, you cannot make the setting of this cookie conditional. You either have to disable it altogether (meaning you will only be able to rely on client-side cookies even after consent is given) or – if enabled – the collector will always respond with a Set-Cookie header to requests made by the tracker.

So, if you do not want to set any cookies before obtaining consent, I think you will have to suspend the loading of the Snowplow tracker until consent is given.

To capture page views without PII pre-consent, you might consider having a second collector, with the server-side cookie disabled, and you would only send data to it conditional on user consent.

Thank you so much @PaulBoocock and @dilyan! I understand the greatest blocker right now would be the collector cookie. We will need to evaluate our options. In the end it is probably the best to have the cookie consent banner block all tracking scripts unless consent is given.

1 Like

Hello, I have been looking at GDPR compliance on Android. Thanks for adding some clarity to this. As it’s been a year is this still accurate?

In addition, some of the other libraries we use have interfaces or public constructors that allow us the subclass which means we can use the library as is with some no op subclass and then once we get permission we can instantiate the actual library.

Are there any plans for something like this or more comprehensive consent management?

Yes, we extended our native mobile trackers adding the GDPR context feature in order to track each data event with a clear explanation of their purpose. Now, the mobile trackers are fully aligned with the same GDPR feature already available in the web tracker.

At the moment we haven’t anything planned on this topic. However, the mobile trackers are quite flexible and they can be reconfigured at runtime within the app. We haven’t introduced anything specific to handle the two phases, before and after the GDPR consent, but you can set an initial setup of the tracker where you track anonymous data about the user behaviour and, once the GDPR consent has granted, reconfigure the tracker with a more detailed tracking of the user behaviour, i.e. setting IP address, custom userID, enabling session tracking and mobile context.

1 Like