I have an unstructured event which is occurring multiple times after a pageview occur.
what I would like is a column in my pageview table counting those events.
my concern is the cases where I run the dbt models and events occur after that run like
run dbt models
It could be handled by creating a separate table and linking the two tables in a view. But that will make querying the data slower and more costly.
What you could do in this case is to leverage the model’s incremental logic to make sure you fully process all events that occur within the same session efficiently using a custom page_view model and disabling the original one. Let me try to break it down into steps of how I envision it could work:
I would start by creating a custom page_view model that would first calculate your unstructured event count logic referencing the base_events_this_run table in a CTE. In case there are new events arriving later than the first run of the model within the same session, this value will increase during further incremental runs.
then in the subsequent CTE left join the results to the page_views_this_run table (which is what the original page_view incremental model takes as a base). Each time the incremental run happens, the unstruct event count can be replaced for the relevant page_views due to the full session reprocessing logic (therefore no need to build in additions and referring back to the original table, you can just take the new count value calculated in the first CTE as is).
You should disable the out of the box page_view incremental model using your dbt_project.yml file after this to avoid unnecessary runs / duplicate tables.
To leverage the snowplow_web incremental logic, please make sure you tag your custom model with snowplow_web_incremental either inside the model or in your project.yml file.
+tags: snowplow_web_incremental #Adds tag to all models in the 'snowplow_web_custom_modules' directory
Assuming you only need that count within the page_views table these changes should be enough and not disrupt the downstream models.
I hope this works, please let me know if anything is unclear or there is a different intention.
Hi Agnes, thank you so much that works perfect. I didn’t realize that was how it worked. Can you please also explain this?
I did almost what you said except I made a model for the events and then joined that model into new page_view model… It does work, but Im using a parameter in the model upsert_date_key=‘collector_tstamp’ what does that do? Is it just used to limit the query to the partition?
I am glad you were able to progress with this and it works for you overall!
The upsert_date_key is a required field for the Snowplow incremental logic to work, it is used to calculate the upper and lower limits for the incremental runs in order to limit the length of table scans on the destination table (for cost reduction and to improve performance).
Based on that column the minimum date is calculated by deducting the snowplow__upsert_lookback_days (by default 30 days) from the earliest date of the run values. The latest date will become the upper limit. Then depending on the warehouse these boundaries are used for the merge or delete/insert operation to limit the table scan on the destination table during the process.