Last July we posted an RFC: making the Snowplow pipeline real-time end-to-end and deprecating support for batch processing modules.
Since publishing that post and consulting with a number of members of our community, we have ratified that decision. We will be deprecating support for the Clojure Collector and Spark Enrich effective from February 28. Our strong recommendation therefore is for users of our batch technology on AWS to migrate to using the Scala Stream collector instead of the Cloudfront or Clojure collectors, and Stream Enrich rather than Spark Enrich. Read on to learn more about:
- Important context for anyone running the Clojure Collector
- Additional benefits of upgrading to the streaming architecture
- The steps required to upgrade
1. Important context for anyone running the Clojure Collector
AWS is deprecating support for Tomcat 8 on Elastic Beanstalk on March 1st 2020
The Clojure collector uses Tomcat 8 on Elastic Beanstalk, which AWS is deprecating support for on March 1st. As a result, users of the Clojure Collector should look to migrate earlier.
The Scala Stream collector does not use Elastic Beanstalk or Tomcat 8 under the hood, so upgrading resolves this issue.
Google Chrome is being updated to treat cookies set without the SameSite
attribute as first party rather than third party
On February 4th Google intends to update Chrome so that cookies set without a SameSite
attribute will be treated as first-party-only rather than third party. This means that it will no longer function as a reliable third party cookie.
The Clojure collector does not support the setting of the SameSite
attribute. As a result, anyone running the Clojure collector and using it to do third party tracking of users across multiple different domains will find that that the network_userid
no longer persists across page views and domains.
The Scala Stream collector, by contrast, does support setting that attribute, so upgrading will address this issue.
2. Additional benefits for anyone upgrading to the new architecture
The Scala Stream collector supports a number of other features that are of significant benefit to companies tracking visitors on the web
- The ability to reliably track visitors using Safari via a first party, server-set cookie. ITP 2.1 means that cookies set client-side (e.g. the Snowplow
domain_userid
) are expired after a maximum of only 7 days. The Scala Stream collector supports the ability to set the server-setnetwork_userid
on multiple domains based on what domain a user is being tracked on, so it can be used as a first party cookie on more than one domain. It can therefore be used to perform reliable first party tracking of visitors using Safari. - The ability to set custom collector paths. This prevents ad blockers blocking Snowplow tracking.
The streaming architecture is more robust and the data available at lower latency
With the streaming architecture it is possible to access the data at much lower latency (e.g. query the data in Elasticsearch within seconds and Redshift within minutes). Streaming components like Stream Enrich can be configured to automatically scale, making them the pipeline robust in handling dramatic traffic spikes, for example.
We’re moving the streaming architecture forward
We are already channeling our efforts, which have been split across batch and streaming, on pushing our now focussed architecture forward. We will be releasing more features, more often and have an exciting roadmap planned for 2020.
3. The steps required to perform the upgrade
We will be posting these in a new thread shortly, and linking to them from here.