Assuming you aren’t on redshift, the base_events_this_run
table in each package should contain any contexts from your events data. (If you are on redshift/postgres, currently this is a lot of work to do and we are looking at ways to support it in the future but don’t have anything locked down at the moment.)
If the logic for uniquely defining the context at a page view, session, and user level is pretty simple, I would probably suggest that you disable the _derived
models in the package, and simply overwrite them with a model that uses the relevant _this_run
table and joins on your unique-per-level context (from the base_events_this_run
table). This way you keep the benefits of all the logic and processing done on each of the derived tables, and upgrading remains easy in the future, but can also add your context information. These are what we call custom incremental models.
For page views and sessions, because we always re-process the whole session, the code would look something like this (pseudo-type code, not run):
-- my_custom_page_views_derived
{{
config(
materialized='incremental',
unique_key='page_view_id',
upsert_date_key='start_tstamp',
sort='start_tstamp',
dist='page_view_id',
partition_by = snowplow_utils.get_value_by_target_type(bigquery_val = {
"field": "start_tstamp",
"data_type": "timestamp"
}, databricks_val='start_tstamp_date'),
cluster_by=snowplow_web.web_cluster_by_fields_page_views(),
tags=["derived"],
sql_header=snowplow_utils.set_query_tag(var('snowplow__query_tag', 'snowplow_dbt')),
tblproperties={
'delta.autoOptimize.optimizeWrite' : 'true',
'delta.autoOptimize.autoCompact' : 'true'
},
snowplow_optimize = true
)
}}
select a.*,
b.my_custom_context
{% if target.type in ['databricks', 'spark'] -%}
, DATE(start_tstamp) as start_tstamp_date
{%- endif %}
from {{ ref('snowplow_web_page_views_this_run') }} a
left join (
-- YOU'RE LOGIC HERE
select distinct page_view_id, my_custom_context from {{ ref('snowplow_web_base_events_this_run') }}
) b on a.page_view_id = b.page_view_id
where {{ snowplow_utils.is_run_with_new_events('snowplow_web') }} --returns false if run doesn't contain new events.
Users you have to be a little more careful with, as we don’t re-process whole users each run, but the approach should be the same you just need to be more specific with your logic to get a unique value per domain_userid
e.g. you may need to query the derived sessions table instead of the sessions_this_run
table. It depends on if you just want the latest value or some aggregation.
The main thing to make 100% sure of is whatever logic you use returns a unique record for the key for that table.
This approach isn’t the 100% most optimised, because you are querying some of the this_run
tables more, and adding the code directly to the e.g. snowplow_sessions_this_run
logic would be the most efficient, but this means you would have to keep making changes any time we made changes to the logic of these tables, whereas just changing the derived table model is much simpler and FAR less likely to need ongoing changes by you with new release.