Snowplow-web 0.12.1 dbt package released

We are very happy to give you an early holiday gift in the form of the release of the snowplow-web v0.12.1 dbt package. This release adds a new variable to ensure deterministic behaviour on page views when there are stray page pings, see below for a more detailed explanation of what this means .

Due to a bug in this version you should ensure you have at least version 0.12.2 installed to use the below feature

Features

Add option to limit page view metrics to a session (Close #96)

Under The Hood

  • Use new snowflake exclude syntax (Close #36)
  • Add action for generating docs for pages (Close #5)

Upgrading

To upgrade simply bump the snowplow-web version in your packages.yml file. If you wish to keep the existing non-deterministic behaviour for page view processing, set the snowplow__limit_page_views_to_session variable to false in your dbt_project.yml

Stray Page Pings

When a user interacts with your webpage, then stops interaction for some period of time long enough to end their session (perhaps they are browsing a site in some other tab), then interacts with your page again, this can lead to what we refer to as Stray Page Pings. These Page Ping events have the same page_view id as the previous page pings, but have a new session id. Due to the way our dbt packages reprocess everything at a session level, this meant that the way we dealt with these pings was non-deterministic i.e. it could change the outcome depending on if both those sessions were included in the same run of the package or not. Note this only impacted the few aggregate measures in the page views table such as absolute time and max scroll depth.

Looking at a really simple example, with two sessions and two page views, but where the user returns after the first session to the same page and so there are some page pings that should be associated with the first page view in the second session:
stray_page_pings-Data setup

State 1: Both sessions in the same run

In the old version of the package, how these pings were treated depended on if both sessions were processed in the same run or not. In this case, the page view calculations (most noticeably absolute_time_in_s) would be made from both sessions, which depending on the gap between them could be days:
stray_page_pings-Both in same run (old)

State 2: Sessions in different runs

In the situation where the sessions are processed in two different runs, because there is no matching page view event in the second session, that the time would be based only on the first session, basically the page pings in the second session are thrown away as there is no page view event to associate them with in that run:
new

Ensuring state 2

There was no way, outside of running all your data in a single run every time, to guarantee which of these two states your data would end up in. The new version of the package calculates the time and scroll depth on a page view id AND session id level to ensure you always end up in state 2 no matter what.

If you wish to disable this check, you can set the snowplow__limit_page_views_to_session to false in your project variables under snowplow_web, but note this leaves your results non-deterministic:

# dbt_project.yml
...
vars:
    snowplow_web:
        snowplow__limit_page_views_to_session: false

We are working to identify a way to make the other outcome (include page pings from all sessions) deterministic as well but don’t currently provide that in this version, let us know if you would be interested in contributing to this!

4 Likes