Hi,
I made some changes to a custom model in the snowplow web dbt package so I wanted to replay the events from the beginning so the recently added columns are being filled.
According to the documentation on github this can be done by using:
dbt run --select +snowplow_web_custom_incremental_model --full-refresh --vars 'models_to_remove: snowplow_web_custom_incremental_model'
Which I did, the model was removed from the manifest table but I get this error log in dbt cloud:
10:12:53 Running with dbt=1.1.1
10:12:53 Partial parse save file not found. Starting full parse.
10:12:57 Found 52 models, 137 tests, 8 snapshots, 0 analyses, 551 macros, 4 operations, 1 seed file, 34 sources, 0 exposures, 0 metrics
10:12:57
10:13:09
10:13:09 Running 2 on-run-start hooks
10:13:09 1 of 2 START hook: xxx.on-run-start.0 ................................. [RUN]
10:13:09 1 of 2 OK hook: xxx.on-run-start.0 .................................... [OK in 0.00s]
10:13:09 2 of 2 START hook: snowplow_web.on-run-start.0 ................................. [RUN]
10:13:09 2 of 2 OK hook: snowplow_web.on-run-start.0 .................................... [OK in 0.00s]
10:13:09
10:13:09 Concurrency: 1 threads (target='prod')
10:13:09
10:13:09 1 of 7 START incremental model dbt_snowplow_manifest.snowplow_web_base_quarantined_sessions [RUN]
10:13:12 1 of 7 OK created incremental model dbt_snowplow_manifest.snowplow_web_base_quarantined_sessions [INSERT 0 0 in 3.29s]
10:13:12 2 of 7 START incremental model dbt_snowplow_manifest.snowplow_web_incremental_manifest [RUN]
10:13:16 2 of 7 OK created incremental model dbt_snowplow_manifest.snowplow_web_incremental_manifest [INSERT 0 0 in 3.19s]
10:13:16 3 of 7 START table model dbt_scratch.snowplow_web_base_new_event_limits ........ [RUN]
10:13:17 10:13:17 + Snowplow: New Snowplow incremental model. Backfilling
10:13:17 10:13:17 + Snowplow: Processing data between '2020-01-27 00:00:00' and '2020-02-26 00:00:00'
10:13:19 3 of 7 OK created table model dbt_scratch.snowplow_web_base_new_event_limits ... [SELECT in 3.90s]
10:13:19 4 of 7 START snowplow_incremental model dbt_snowplow_manifest.snowplow_web_base_sessions_lifecycle_manifest [RUN]
10:13:26 4 of 7 ERROR creating snowplow_incremental model dbt_snowplow_manifest.snowplow_web_base_sessions_lifecycle_manifest [ERROR in 6.11s]
10:13:26 5 of 7 SKIP relation dbt_scratch.snowplow_web_base_sessions_this_run ........... [SKIP]
10:13:26 6 of 7 SKIP relation dbt_scratch.snowplow_web_base_events_this_run ............. [SKIP]
10:13:26 7 of 7 SKIP relation dbt_staging.xxx ......... [SKIP]
10:13:27
10:13:27 Running 2 on-run-end hooks
10:13:27 1 of 2 START hook: xxx.on-run-end.0 ................................... [RUN]
10:13:27 1 of 2 OK hook: xxx.on-run-end.0 ...................................... [OK in 0.00s]
10:13:27 2 of 2 START hook: snowplow_web.on-run-end.0 ................................... [RUN]
10:13:27 2 of 2 OK hook: snowplow_web.on-run-end.0 ...................................... [OK in 0.00s]
10:13:27
10:13:27
10:13:27 Finished running 2 incremental models, 3 table models, 2 snowplow_incremental models, 4 hooks in 29.57s.
10:13:27
10:13:27 Completed with 1 error and 0 warnings:
10:13:27
10:13:27 Database Error in model snowplow_web_base_sessions_lifecycle_manifest (models/base/manifest/snowplow_web_base_sessions_lifecycle_manifest.sql)
10:13:27 relation "dbt_snowplow_manifest.snowplow_web_base_sessions_lifecycle_manifest" does not exist
10:13:27 compiled SQL at target/run/snowplow_web/models/base/manifest/snowplow_web_base_sessions_lifecycle_manifest.sql
10:13:27
10:13:27 Done. PASS=3 WARN=0 ERROR=1 SKIP=3 TOTAL=7
I tried again with the command
dbt run --select +snowplow_web_custom_incremental_model --full-refresh
but no luck either. What am I missing? Why does it tell me the relation is missing even if it’s there and normal dbt runs are working just fine.
Hey Henry, welcome to the Snowplow discourse community!
I’m not sure why this is happening, but can you confirm that the dbt_snowplow_manifest.snowplow_web_base_sessions_lifecycle_manifest exists in your database? And what happens if you run the compiled SQL found in target/run/snowplow_web/models/base/manifest/snowplow_web_base_sessions_lifecycle_manifest.sql directly against your database?
Hopefully the answer to both of these things helps us solve the issue you’re facing!
Let me know,
Emiel
Hi Emiel,
thanks for the reply! Yes, the table exists:
[2022-07-20 12:37:10] completed in 95 ms
> SELECT t.*
FROM dbt_snowplow_manifest.snowplow_web_base_sessions_lifecycle_manifest t
LIMIT 501
[2022-07-20 12:37:19] 500 rows retrieved starting from 1 in 9 s 202 ms (execution: 9 s 141 ms, fetching: 61 ms)
I can’t seem to access the target folder in dbt cloud in the prod environment, but I see the compiled sql in the debug log of the run.
2022-07-19 10:13:24.136924 (Thread-12): 10:13:24 On model.snowplow_web.snowplow_web_base_sessions_lifecycle_manifest: /* {"app": "dbt", "dbt_version": "1.1.1", "profile_name": "user", "target_name": "prod", "node_id": "model.snowplow_web.snowplow_web_base_sessions_lifecycle_manifest"} */
create table
"XXXXX"."dbt_snowplow_manifest"."snowplow_web_base_sessions_lifecycle_manifest"
diststyle key distkey (session_id)
compound sortkey(start_tstamp)
as (
-- Known edge cases:
-- 1: Rare case with multiple domain_userid per session.
with new_events_session_ids as (
select
e.domain_sessionid as session_id,
max(e.domain_userid) as domain_userid, -- Edge case 1: Arbitary selection to avoid window function like first_value.
min(e.collector_tstamp) as start_tstamp,
max(e.collector_tstamp) as end_tstamp
from "XXXXX"."snowplow_atomic"."events" e
where
e.domain_sessionid is not null
and not exists (select 1 from "XXXXX"."dbt_snowplow_manifest"."snowplow_web_base_quarantined_sessions" as a where a.session_id = e.domain_sessionid) -- don't continue processing v.long sessions
and e.dvce_sent_tstamp <=
dateadd(
day,
3,
dvce_created_tstamp
)
-- don't process data that's too late
and e.collector_tstamp >=
cast('2020-01-27 00:00:00' as
timestamp without time zone
)
and e.collector_tstamp <=
cast('2020-02-26 00:00:00' as
timestamp without time zone
)
and app_id in ('com.camparound','com.snowplow') --filter on app_id if provided
and cast(False as boolean) --don't reprocess sessions that have already been processed.
group by 1
)
, previous_sessions as (
select *
from "XXXXX"."dbt_snowplow_manifest"."snowplow_web_base_sessions_lifecycle_manifest"
where start_tstamp >=
cast('2018-01-27 00:00:00' as
timestamp without time zone
)
and cast(False as boolean) --don't reprocess sessions that have already been processed.
)
, session_lifecycle as (
select
ns.session_id,
coalesce(self.domain_userid, ns.domain_userid) as domain_userid, -- Edge case 1: Take previous value to keep domain_userid consistent. Not deterministic but performant
least(ns.start_tstamp, coalesce(self.start_tstamp, ns.start_tstamp)) as start_tstamp,
greatest(ns.end_tstamp, coalesce(self.end_tstamp, ns.end_tstamp)) as end_tstamp -- BQ 1 NULL will return null hence coalesce
from new_events_session_ids ns
left join previous_sessions as self
on ns.session_id = self.session_id
where
self.session_id is null -- process all new sessions
or self.end_tstamp <
dateadd(
day,
3,
self.start_tstamp
)
--stop updating sessions exceeding 3 days
)
select
sl.session_id,
sl.domain_userid,
sl.start_tstamp,
least(
dateadd(
day,
3,
sl.start_tstamp
)
, sl.end_tstamp) as end_tstamp -- limit session length to max_session_days
from session_lifecycle sl
);
2022-07-19 10:13:25.935680 (Thread-12): 10:13:25 Postgres adapter: Postgres error: relation "dbt_snowplow_manifest.snowplow_web_base_sessions_lifecycle_manifest" does not exist
Apologies for the delay in getting back to you, I hope the issue got resolved by now but let me try to help if not.
The reason you are getting the error message is most likely due to the --full-refresh part of the dbt command. For safety reasons the manifest tables are not allowed to be dropped unless the variable snowplow__allow_refresh is set to true. If you look at the compiled code you will see a create table sql script due to the --full-refresh dbt command but as the manifest table still/already exists in your warehouse the query execution fails.
Now that your custom model is removed from the incremental manifest table, and due to this behaviour I would suggest to manually drop the custom table which you would need to reprocess anyway and then run the following instead:
dbt run --select +snowplow_web_custom_incremental_model --vars ‘snowplow__start_date: your_backfill_start_date’
As you run this you should see the new model reintroduced to the incremental manifest table and backfilling to start ( Snowplow: New Snowplow incremental model. Backfilling). After this you can run the usual dbt run --selector snowplow_web command and from this time onwards synching should happen until every model is aligned. (Snowplow incremental models out of sync. Syncing)
I hope this helps. From our side we will have to update the documentation, thank you for letting us know about this issue! Apologies for the inconvenience.