Web traffic driven campaign tracking with Snowplow [tutorial]

Campaign tracking with Snowplow

In one of our blogs a few years back we raised an awareness of the complexity the web analyst face when trying to answer the questions like:

  • Which sites and marketing campaigns are driving visitors to your website?
  • How valuable are those visitors?
  • What should you be doing to drive up the number of high-quality users?

We pointed out the importance of examining both page URL and referer URL. For this purpose, we introduced the corresponding

In this tutorial, we are going to introduce the Snowplow practical approach to addressing this problem when it comes to web driven traffic. If you are interested in tracking mobile driven campaigns, please, refer to the tutorial (in 2 parts) listed below:

What is a referer?

When you load a web page in your browser, the browser makes an HTTP request to a web server to deliver that page. That request includes a header field that identifies the address of the web page that linked to the resource being requested: this is called the HTTP referer.

Web analytics programs typically read the HTTP referer header or JavaScript’s document.referrer, and use that page referer data as one the inputs to infer where a visitor has come from.

Note that we normally use the original HTTP misspelling of “referer” as opposed to “referrer”.

Here’s an example of HTTP request:

GET https://www.properweb.ca/ HTTP/1.1
Host: www.properweb.ca
Connection: keep-alive
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer: https://www.google.ca/
Accept-Encoding: gzip, deflate, sdch, br
Accept-Language: en-US,en;q=0.8,ru;q=0.6
Cookie: PHPSESSID=dlq83b1gj0hqomib74ibkpovi7; sc_is_visitor_unique=rx10185912.1470435733.4A1DC2BF0E334F4118E9003B3BE41D25.

The Referer parameter references the source of the request to GET the Host's web page. In the example above, we can see the request came from Google. In this form, it is of little help. Well, we all know what Google is but what if the request came from other (less known) source?

We want to know more about it and Refer Parser Enrichment can help us.

Before going into the details of the enrichment internals I would like to remind you that Snowplow is able to extract the value from the Referer header field. It’s stored in page_referrer column of atomic.events table. Moreover, the referer page URL is further atomized into comprising parts populating the following columns of the atomic.events table:

  • refr_urlscheme - the protocol (ex. http, ftp)
  • refr_urlhost - the host of the web server (domain name)
  • refr_urlport - port of the server to obtain the resource (ex. 80)
  • refr_urlpath - path to the document
  • refr_urlquery - querystring of the referer URL
  • refr_urlfragment - identifier of the page section following # in the URL of the document

Referer Parser Enrichment

If Refer Parser Enrichment is enabled the referer is further examined and compared against the database referers.yml. The database itself contains 4 sections representing the medium:

  • unknown - for when we know the source, but not the medium
  • email - for webmail providers
  • social - for social media services
  • search - for search engines

Additionally the referer page domain name (value for refr_urlhost in atomic.events table) is used to determine if this is an internal referer (that is the request came from within own network) by comparing it with the domain names extracted from internalDomains parameter of referer_parser.json.

As a result, the referer page URL dimension widens by populating additional columns of atomic.events as outlined below.

  • refr_medium - Type of referer (ex. ‘search’, ‘internal’)
  • refr_source - Name of referer if recognised
  • refr_term - Keywords if source is a search engine

NOTE: Since Google started encrypting the search terms it is not possible to infer them from the referer URL. Google strips the search query information from the “q=” (q=search+query) parameter in its referer string.

Enabling referer enrichment

The Referer Parser is a member of so-called configurable enrichments provided by Snowplow. It is easy to use. First, we need to prepare the enrichment referer_parser.json configuration file.

	"schema": "iglu:com.snowplowanalytics.snowplow/referer_parser/jsonschema/1-0-0",
	"data": {
		"name": "referer_parser",
		"vendor": "com.snowplowanalytics.snowplow",
		"enabled": true,
		"parameters": {
			"internalDomains": []

To distinguish the internal referers, you can add the list of your network domain names providing the link to the resource. Whenever the refr_urlhost value matches the domain name from the internalDomains list the column refr_medium of atomic.events table will be populated with “internal”.

	"parameters": {
		"internalDomains": [

Add the configuration file to your “enrichments” folder (or whatever name you came up with) and run the EMRETLRunner with the --enrichments parameter:

$ ./snowplow-emr-etl-runner --config config.yml --resolver resolver.json --enrichments enrichments

This is all to it. The enrichment process will take care of dimension widening your events.

Campaign Attribution Enrichment

Page referers are a technical solution to identifying where traffic comes from. In addition, digital marketers may want to label incoming traffic so that they can identify which marketing campaigns that traffic should be attributed to. This is typically done by adding a querystring to the landing page URL.

In other words, you have to build the link leading to your resource.

To give an example, let’s imagine that I am marketing the website www.properweb.ca. I run a campaign on AdWords called “September sale”. In my AdWords ad, I include a link (that I hope viewers of the ad will click) to my domain names webpage. However, instead of just including the standard link in my ad, i.e.

<a href="https://www.properweb.ca/domain-names/">domain names discount</a>

I add a query parameter onto the end of my link labelling the campaign:

<a href="https://www.properweb.ca/domain-names/?utm_campaign=September%20sale">www.properweb.ca/domain-names/</a>

Adding the query parameter does not change the experience of the user clicking on the ad. Then, on the landing page (in this case, the www.properweb.ca/domain-names/ web page) the web analytics JavaScript tag will pass the querystring to Snowplow, which can then infer that the traffic should be attributed to the “September sale”.

Different web analytics programs look for different query parameters when assigning traffic to different marketing campaigns. We follow the same naming convention deployed by Google Analytics, which makes an easy transition from the latter. The below summarises the parameters:

  • utm_medium - The advertising or marketing medium, for example, cpc, banner, email newsletter.
  • utm_source - Identifies the advertiser, site, publication, etc. that is sending traffic to your resource.
  • utm_term - Identifies the search terms that triggered the ad being displayed in the search results.
  • utm_content - Used to differentiate similar content, or links within the same ad. For example, if you have two call-to-action links within the same email message, you can use utm_content and set different values for each so you can tell which version is more effective.
  • utm_campaign - The individual campaign name, slogan, promo code, etc. for a product.

Additionally, we introduced mkt_clickid which serves as a tracking parameter identifying the marketing network. The enrichment automatically knows about Google (corresponding to the “gclid” querystring parameter), Microsoft (“msclkid”), and DoubleClick (“dclid”). However, you can add your own identifier (key) giving the name of your desired network as an attribute (value).

Enabling compaign attribution enrichment

Similarly to referer parser enrichment, we have to add campaign_attribution.json configuration file to the directory holding all our configurable enrichments. By doing so, you enable Compaign Attribution Enrichment.

Below is an example:

    "schema": "iglu:com.snowplowanalytics.snowplow/campaign_attribution/jsonschema/1-0-1",
    "data": {
        "name": "campaign_attribution",
        "vendor": "com.snowplowanalytics.snowplow",
        "enabled": false,
        "parameters": {
            "mapping": "static",
            "fields": {
                "mktMedium": ["utm_medium", "medium"],
                "mktSource": ["utm_source", "source"],
                "mktTerm": ["utm_term", "legacy_term"],
                "mktContent": ["utm_content"],
                "mktCampaign": ["utm_campaign", "cid", "legacy_campaign"],
                "mktClickId": {
                    "customclid": "My Network"

Note that the actual parameter included in the page URL could be of arbitrary name. That is you might combine the campaign attribution provided by different analytics platforms or campaign managing tools.

Thus, (from the example above) the marketing campaign could be inferred from any of the three parameters in the page querystring: utm_campaign, cid, or legacy_campaign. If more than one encountered the first one takes precedence.

Therefore, campaign_attribution.json could be viewed as a mapping means between the parameters submitted with the querystring and the correponding columns in atomic.events table. Specifically, the following describes the relationship:

  • mkt_mediummktMedium
  • mkt_sourcemktSource
  • mkt_termmktTerm
  • mkt_contentmktContent
  • mkt_campaignmktCampaign
  • mkt_clickid (key) & mkt_network (value) ← mktClickId

Further reading


We expose both page_url and page_referrer. The data in mkt_ columns reflects the tagged (paid) campaign as opposed to organic/search activities. It answers the questions about which marketing campaigns the traffic should be attributed to. The data in refr_ columns, on the other hand, indicates where the traffic comes from. By combining the analysis of data in both mkt_ and refr_ columns:

  1. It leads to more intelligent and robust inferences about where you traffic comes from
  2. It identifies surprising results related to the placement of your paid campaigns, which may have significant implications for your overall marketing strategy.
  3. It makes it possible to identify and manage errors that are invariably introduced in the data

Having said that it is up to you how to combine the mkt_ and refr_ fields together. This is different to e.g. Google Analytics approach, that will combine them directly, setting the value of the medium, source, term etc. based on the utm_ parameters if available, and the refr_ parameters if not.

Further reading