Dbt_snowplow_web: Adding context data

Hi!

I’m wondering for quite some time already what would be the best / recommended way to get context data to the page_views, sessions and users models in the dbt snowplow web package.

I know there are examples in the repo, but I’m still stuck with a lot of problems.

What I’m doing right now:

I have a user_context. Starting from the base_events_this_run table, I’m building my own context_this_run table and from there I I’m building a incremental context_pv, context_session and context_user table. These get joined to the respective models (page_views, sessions and users) then.

This is easy for the page_views table, but for the sessions and users table I have to replicate a lot of the logic Snowplow built already somewhere just to be able to join the contexts and still be able to do partition pruning. For example I won’t have a user first session start_tstamp in my models… Also sessions can start earlier than what I will have in my this_run tables and so I won’t be able to merge them to the session table.

Also right now I’m materializing views that join the default models with my context models. The querying becomes quite expensive though. On the other hand, it is a real pain and quite expensive to re-run everything, when the user context adds new fields.

What would you recommend? What is the best way to solve this. I was thinking about replacing all the packages default _this_run tables on the way and add my contexts to each of them.

Thanks for your input!

Best
Andreas

Assuming you aren’t on redshift, the base_events_this_run table in each package should contain any contexts from your events data. (If you are on redshift/postgres, currently this is a lot of work to do and we are looking at ways to support it in the future but don’t have anything locked down at the moment.)

If the logic for uniquely defining the context at a page view, session, and user level is pretty simple, I would probably suggest that you disable the _derived models in the package, and simply overwrite them with a model that uses the relevant _this_run table and joins on your unique-per-level context (from the base_events_this_run table). This way you keep the benefits of all the logic and processing done on each of the derived tables, and upgrading remains easy in the future, but can also add your context information. These are what we call custom incremental models.

For page views and sessions, because we always re-process the whole session, the code would look something like this (pseudo-type code, not run):

-- my_custom_page_views_derived
{{
  config(
    materialized='incremental',
    unique_key='page_view_id',
    upsert_date_key='start_tstamp',
    sort='start_tstamp',
    dist='page_view_id',
    partition_by = snowplow_utils.get_value_by_target_type(bigquery_val = {
      "field": "start_tstamp",
      "data_type": "timestamp"
    }, databricks_val='start_tstamp_date'),
    cluster_by=snowplow_web.web_cluster_by_fields_page_views(),
    tags=["derived"],
    sql_header=snowplow_utils.set_query_tag(var('snowplow__query_tag', 'snowplow_dbt')),
    tblproperties={
      'delta.autoOptimize.optimizeWrite' : 'true',
      'delta.autoOptimize.autoCompact' : 'true'
    },
    snowplow_optimize = true
  )
}}


select a.*,
b.my_custom_context
  {% if target.type in ['databricks', 'spark'] -%}
  , DATE(start_tstamp) as start_tstamp_date
  {%- endif %}
from {{ ref('snowplow_web_page_views_this_run') }} a
left join (
-- YOU'RE LOGIC HERE
select distinct page_view_id, my_custom_context from {{ ref('snowplow_web_base_events_this_run') }} 
) b on a.page_view_id  = b.page_view_id
where {{ snowplow_utils.is_run_with_new_events('snowplow_web') }} --returns false if run doesn't contain new events.

Users you have to be a little more careful with, as we don’t re-process whole users each run, but the approach should be the same you just need to be more specific with your logic to get a unique value per domain_userid e.g. you may need to query the derived sessions table instead of the sessions_this_run table. It depends on if you just want the latest value or some aggregation.

The main thing to make 100% sure of is whatever logic you use returns a unique record for the key for that table.


This approach isn’t the 100% most optimised, because you are querying some of the this_run tables more, and adding the code directly to the e.g. snowplow_sessions_this_run logic would be the most efficient, but this means you would have to keep making changes any time we made changes to the logic of these tables, whereas just changing the derived table model is much simpler and FAR less likely to need ongoing changes by you with new release.

Thank you @Ryan,

how do you usually approach updates of the custom contexts that are joined to the incremental sessions, page_views and users tables. We did it a couple of times with reprocessing all of the snowplow models, but it is way to expensive. I’m wondering if there are some best practices of how to do this without having to re-process tens of terrabytes of data?

In terms of historic changes to the records? Unfortunately to extract those contexts you are going to need to scan the events table regardless of if you full-refresh the package or just do a nice update/merge statement to join them into existing records. If you have terrabytes of event data, you need to scan that to get the context out (but can reduce it by just selecting only the required columns) to join back to pv/sessions/users.

One thing to be careful of is some sub-queries in an UPDATE clause in some warehouses aren’t efficient (it runs the subquery per row) so it’s best to use a MERGE or build a smaller temp table.

1 Like