Snowplow-web 0.1.0 dbt package released

We are very excited to announce the release of snowplow-web v0.1.0 dbt package. This package replicates the standard, sql-runner based web model in dbt.

This package only supports Redshift. BigQuery and Snowflake support will follow in later releases.

What the package brings

  • Transforms and aggregates raw web event data collected from the Snowplow JavaScript tracker into a set of derived tables: page views, sessions and users.
  • Processes events incrementally.
  • Is designed in a modular manner, allowing you to easily integrate your own custom SQL into the incremental framework provided by the package.
  • A custom incremental materialisation, to reduce table scans during the upsert stage.
  • A full suite of tests to ensure data integrity.
  • A comprehensive data dictionary.

Modules

This package consists of a series of modules, each producing a table which serves as the input to the next module.

The ‘standard’ modules are:

  • Base: Performs the incremental logic, outputting the table snowplow_web_base_events_this_run which contains a de-duped data set of all events required for the current run of the model.
  • Page Views: Aggregates event level data to a page view level, page_view_id.
  • Sessions: Aggregates page view level data to a session level, domain_sessionid.
  • Users: Aggregates session level data to a users level, domain_userid.

Incremental framework

The majority of the incremental logic sits within the base module and performs the ‘heavy-lifting’ for you. The logic is as follows:

  • Identify new or late arriving events since the last run of the package.
  • Identify the domain_sessionid associated with these new/late events.
  • Reprocess all events associated with the domain_sessionid. This ensures when aggregating to a session level we have all the events associated with the session.

This de-duped dataset is then written to an events_this_run table, containing all the required events for the given run of the web model.

Customizations

The events_this_run table removes complexity when adding your own customisations. You can now write drop and recompute style SQL using the events_this_run as a source, without having to worry about which events to select.

Furthermore this reduces cost and improves performance. Since the events_this_run table is shared between the standard modules and your customizations, we negate the need to query the raw events table multiple times. For more information on writing custom SQL, please refer to the docs. An example dbt project demonstrating customisations can also be found within the repo

More information

Checkout the web data model section of Snowplow Docs for more information on the models structure.

Checkout the snowplow-web package docs for a quickstart guide as well as an explanation of operating and configuring the package.

5 Likes