When creating custom incremental models e.g. a events table that contains some structured events, backfilling is quite costly. Especially the snowplow_web_base_sessions_lifecycle_manifest creates a huge query. In our case around 2.5 TB. Usually we can do around 10 days of backfilling each run and sometimes we have to backfill 1-2 years of data, so recreating the lifecycle manifest each time is a quite expensive operation. Are there any pointers on how to limit the scan of the table. Ive tried to set the session_lookback_days window shorter, but that doesn’t seem to have any impact.
So this table really only does 3 things - it scans the main events table within the range date range of your run, it checks against quarantined sessions, and it scans itself (again filtered on date).
If you have 2.5TB worth of event data in that date range, this is unfortunately just the nature of event data, there is not anything you or we can do to optimise the scan any more than filtering on the date (which depending on your warehouse will be the sort key/(micro)partition column) and selecting only the columns we need. Similar thing with the self-scan.
If the bulk of the scan is instead coming from the quarantine filter, this would be a little unexpected.
When you’ve added the new models, are you doing a full refresh or are you just running it and allowing the package to back-fill for you?
Thanks a lot for the additional info @Ryan. We are not doing a full refresh when adding new models and use the Snowplow backfill logic instead.
For a normal run the snowplow_web_base_sessions_lifecycle_manifest will be around 25 GB, but for every single backfill loop it is around 2.5 TB. Our events table is less then 10TB spanning over 1.5years though and snowplow__backfill_limit_days is set to 10 days. The quarantine table is a couple of MB, so this shouldn’t be a problem.
Is there a way to simulate a backfill, so I can check what the limit variables in the lifecycle manifest will be set to?
You could just create a dummy model that selects a few columns from the _base_events_this_run with the appropriate tags and config and then a dbt run should generate the relevant ranges in your target folder. It might also populate on dbt compile but I’m not 100% sure as it won’t get be in the manifest table…
I tried running a dummy model and there is something I don’t quite understand:
snowplow_web_base_sessions_lifecycle_manifest has a macro that checks for new events inside two of the ctes. In all my runs this macro resolves to false which in turn seems to cause the 2 TB query. If I set it to false in both ctes and check the query size, it goes down to around 25 GB.
I’ve tried to understand what the macro does, but I’m not quite sure as it’s fairly complex. Can you shed some light on how it works and if it’s intended to resolve to False?
I tried with a backfill run as well as a normal run and the query is always around 2TB.
This is the macro in question:
and {{ is_run_with_new_events }} --don't reprocess sessions that have already been processed.
And the compiled code:
and cast(False as boolean) --don't reprocess sessions that have already been processed.
So that value comes from a macro with the purpose of basically not processing data that it doesn’t need to unless there are new events as part of that run for that model. In the base sessions manifest table it checks if the upper limit of the run is above the max session start date.
I think that macro will always evaluate to false under compile, because there’s an if execute block in there, basically we don’t try and see if there are new events until the models are actually being run. If you do a run and look in your target/run folder you should see cases where it is true.
The 2TB still feels alarming though, that model queries the event table filtered on collector (and derived if specified) timestamp - which are the partition key, and then itself filtered on start_tstamp which is also the partition key. Even the macro calls a subquery on the same table, getting the max start_tstamp…
What warehouse are you using, and how are you determining this 2TB Query? Did you fully delete the line with the macro call ({% set is_run_with_new_events = snowplow_utils.is_run_with_new_events('snowplow_web') %}), or just replace where we use the value (and {{ is_run_with_new_events }})?
Thanks for the additional info! It seems like it’s in both the compiled as well as the run models. This is from the dbt log, when I did a backfill run. 0 rows merged, but 2.1 TiB:
172 OK created sql incremental model snowplow.snowplow_web_base_sessions_lifecycle_manifest [MERGE (0.0 rows, 2.1 TiB processed) in 10.54s]
For my test I just copied the compiled sql and removed the false from both cte where clauses. We are using BigQuery as warehouse.
Just to double check, what version of the package are you using? And if it’s >= 0.14.0 have you done the dispatch of the macros in your project yaml? Also what are you using to define a “backfill run”, as it seems odd that 0 rows were merged in that case…
If those answers don’t help me I’ve got a few other things you can try to help us pinpoint the exact issue, but let’s hold off on that until I’m certain it isn’t some other things.
sorry for the late reply! We are on version 0.14.1 and have the following dispatch configuration:
# To take advantage of the optimization Snowplow applies to the incremental materialization, the search order has to be updated
dispatch:
- macro_namespace: dbt
search_order: ["snowplow_utils", "dbt"]
As for the backfill, I’m not really sure, why it shows 0 rows in that test case.
I didn’t find out, what the problem with backfilliing and 2TB was, but I had to do a full refresh anyways, because we had to update a lot of the models with new context information. Doing the full-refresh somehow solved the 2TB backfills. A backfill run after the refresh only uses 50-70GB for the scan of the lifecycle table, which is in the ballpark of what I was expecting.