Dbt snowplow-web 0.16.0 package fails on run (redshift)

Hi there :wave: I’ve been banging my head against the wall trying to get the snowplow_web 0.16.0 package to work in dbt. First, I tried implementing in our existing dbt project. With no luck there, I created a new project in my local dev environment with only the snowplow_web package (and associated snowplow_utiils and dbt_utils packages), but still can’t figure out what I’m doing wrong.

Here’s what I’ve got going on:

dbt_project.yml

version: '1.6.3'

vars:
  snowplow_web:
    snowplow__start_date: '2023-08-01'
    snowplow __enable_load_tstamp: false

packages.yml

packages:
  - package: snowplow/snowplow_web
    version: 0.16.0

Here’s what I get when I run the project (CLI)

17:18:47  Running with dbt=1.6.3
17:18:47  Registered adapter: redshift=1.6.1
17:18:47  Found 18 models, 103 tests, 3 seeds, 2 operations, 8 sources, 0 exposures, 0 metrics, 690 macros, 0 groups, 0 semantic models
17:18:47
17:18:53
17:18:53  Running 1 on-run-start hook
17:18:53  1 of 1 START hook: snowplow_web.on-run-start.0 ................................. [RUN]
17:18:53  1 of 1 OK hook: snowplow_web.on-run-start.0 .................................... [OK in 0.00s]
17:18:53
17:18:53  Concurrency: 1 threads (target='dev')
17:18:53
17:18:53  1 of 18 START sql incremental model dbt_rob_snowplow_manifest.snowplow_web_base_quarantined_sessions  [RUN]
17:18:55  1 of 18 OK created sql incremental model dbt_rob_snowplow_manifest.snowplow_web_base_quarantined_sessions  [SUCCESS in 2.15s]
17:18:55  2 of 18 START sql incremental model dbt_rob_snowplow_manifest.snowplow_web_incremental_manifest  [RUN]
17:18:57  2 of 18 OK created sql incremental model dbt_rob_snowplow_manifest.snowplow_web_incremental_manifest  [SUCCESS in 1.89s]
17:18:57  3 of 18 START sql table model dbt_rob_scratch.snowplow_web_base_new_event_limits  [RUN]
17:18:57  13:18:57 + Snowplow: No data in manifest. Processing data from start_date
17:18:58  13:18:58 + Snowplow: Processing data between '2023-08-01 00:00:00' and '2023-08-31 00:00:00' (snowplow_web)
17:18:59  3 of 18 OK created sql table model dbt_rob_scratch.snowplow_web_base_new_event_limits  [SUCCESS in 2.42s]
17:18:59  4 of 18 START sql incremental model dbt_rob_snowplow_manifest.snowplow_web_base_sessions_lifecycle_manifest  [RUN]
17:19:01  4 of 18 ERROR creating sql incremental model dbt_rob_snowplow_manifest.snowplow_web_base_sessions_lifecycle_manifest  [ERROR in 1.86s]
17:19:01  5 of 18 SKIP relation dbt_rob_scratch.snowplow_web_base_sessions_this_run ...... [SKIP]
17:19:01  6 of 18 SKIP relation dbt_rob_scratch.snowplow_web_base_events_this_run ........ [SKIP]
17:19:01  7 of 18 SKIP relation dbt_rob_scratch.snowplow_web_pv_engaged_time ............. [SKIP]
17:19:01  8 of 18 SKIP relation dbt_rob_scratch.snowplow_web_pv_scroll_depth ............. [SKIP]
17:19:01  9 of 18 SKIP relation dbt_rob_scratch.snowplow_web_sessions_this_run ........... [SKIP]
17:19:01  10 of 18 SKIP relation dbt_rob_derived.snowplow_web_user_mapping ............... [SKIP]
17:19:01  11 of 18 SKIP relation dbt_rob_scratch.snowplow_web_page_views_this_run ........ [SKIP]
17:19:01  12 of 18 SKIP relation dbt_rob_derived.snowplow_web_sessions ................... [SKIP]
17:19:01  13 of 18 SKIP relation dbt_rob_derived.snowplow_web_page_views ................. [SKIP]
17:19:01  14 of 18 SKIP relation dbt_rob_scratch.snowplow_web_users_sessions_this_run .... [SKIP]
17:19:01  15 of 18 SKIP relation dbt_rob_scratch.snowplow_web_users_aggs ................. [SKIP]
17:19:01  16 of 18 SKIP relation dbt_rob_scratch.snowplow_web_users_lasts ................ [SKIP]
17:19:01  17 of 18 SKIP relation dbt_rob_scratch.snowplow_web_users_this_run ............. [SKIP]
17:19:01  18 of 18 SKIP relation dbt_rob_derived.snowplow_web_users ...................... [SKIP]
17:19:02
17:19:02  Running 1 on-run-end hook
17:19:02  1 of 1 START hook: snowplow_web.on-run-end.0 ................................... [RUN]
17:19:02  1 of 1 OK hook: snowplow_web.on-run-end.0 ...................................... [OK in 0.00s]
17:19:02
17:19:02
17:19:02  Finished running 7 incremental models, 11 table models, 2 hooks in 0 hours 0 minutes and 14.21 seconds (14.21s).
17:19:02
17:19:02  Completed with 1 error and 0 warnings:
17:19:02
17:19:02    Database Error in model snowplow_web_base_sessions_lifecycle_manifest (models/base/manifest/snowplow_web_base_sessions_lifecycle_manifest.sql)
  8001
  compiled Code at target/run/snowplow_web/models/base/manifest/snowplow_web_base_sessions_lifecycle_manifest.sql
17:19:02
17:19:02  Done. PASS=3 WARN=0 ERROR=1 SKIP=14 TOTAL=18

And here’s a bit more context from dbt.log

[0m13:19:00.530087 [debug] [Thread-1  ]: On model.snowplow_web.snowplow_web_base_sessions_lifecycle_manifest: /* {"app": "dbt", "dbt_version": "1.6.3", "profile_name": "snowplow", "target_name": "dev", "node_id": "model.snowplow_web.snowplow_web_base_sessions_lifecycle_manifest"} */
  CREATE TABLE "postmarkbi"."dbt_rob_snowplow_manifest"."snowplow_web_base_sessions_lifecycle_manifest__dbt_tmp" diststyle KEY distkey (
	session_identifier) compound sortkey (
	start_tstamp)
AS (
	WITH new_events_session_ids_init AS (
		SELECT
			COALESCE(e.domain_sessionid,
				NULL) AS session_identifier,
			max(COALESCE(e.domain_userid,
					NULL)) AS user_identifier,
			-- Edge case 1: Arbitary selection to avoid window function like first_value.
			min(collector_tstamp) AS start_tstamp,
			max(collector_tstamp) AS end_tstamp
		FROM
			"postmarkbi"."snowcatcloud"."events" e
		WHERE
			dvce_sent_tstamp <= dateadd (day,
				3,
				dvce_created_tstamp)
			-- don't process data that's too late
			AND collector_tstamp >= cast('2023-08-01 00:00:00' AS TIMESTAMP)
			AND collector_tstamp <= cast('2023-08-31 00:00:00' AS TIMESTAMP)
			AND TRUE
			AND TRUE --don't reprocess sessions that have already been processed.
		GROUP BY
			1),
		new_events_session_ids AS (
			SELECT
				*
			FROM
				new_events_session_ids_init e
			WHERE
				session_identifier IS NOT NULL
				AND NOT EXISTS (
					SELECT
						1
					FROM
						"postmarkbi"."dbt_rob_snowplow_manifest"."snowplow_web_base_quarantined_sessions" AS a
					WHERE
						a.session_identifier = e.session_identifier) -- don't continue processing v.long sessions
),
				session_lifecycle AS (
					SELECT
						*
					FROM
						new_events_session_ids
)
				SELECT
					sl.session_identifier,
					sl.user_identifier,
					sl.start_tstamp,
					least(dateadd (day,
							3,
							sl.start_tstamp),
						sl.end_tstamp) AS end_tstamp -- limit session length to max_session_days
				FROM
					session_lifecycle sl
);
e[0m13:19:01.460295 [debug] [Thread-1  ]: Redshift adapter: Redshift error: 8001
e[0m13:19:01.461518 [debug] [Thread-1  ]: On model.snowplow_web.snowplow_web_base_sessions_lifecycle_manifest: ROLLBACK
e[0m13:19:01.573067 [debug] [Thread-1  ]: Timing info for model.snowplow_web.snowplow_web_base_sessions_lifecycle_manifest (execute): 13:19:00.413343 => 13:19:01.572301
e[0m13:19:01.574273 [debug] [Thread-1  ]: On model.snowplow_web.snowplow_web_base_sessions_lifecycle_manifest: Close
e[0m13:19:01.579748 [debug] [Thread-1  ]: Database Error in model snowplow_web_base_sessions_lifecycle_manifest (models/base/manifest/snowplow_web_base_sessions_lifecycle_manifest.sql)
  8001
  compiled Code at target/run/snowplow_web/models/base/manifest/snowplow_web_base_sessions_lifecycle_manifest.sql
e[0m13:19:01.581264 [error] [Thread-1  ]: 4 of 18 ERROR creating sql incremental model dbt_rob_snowplow_manifest.snowplow_web_base_sessions_lifecycle_manifest  [e[31mERRORe[0m in 1.86s]

I’ve tried using v0.15 and v0.14 of the package as well, and ran into similar errors there. I haven’t been able to get this to successfully run once, and it’s in its own isolated (fresh) dbt project. What am I missing/doing wrong? :sweat_smile:

I don’t recognise this error to be honest outside of data loads but this is a query. Unfortunately given that it’s not a particularly verbose error message I’d consider reaching out to AWS support to elaborate on what a 8001 actually means.

Another option that may give you additional details is to try and run that query within redshift and see if you get any further details for that error message, unfortunately the redshift python adapter is pretty poor at cascading error details.

Here’s the detail I get when running the query directly in Redshift:

Query 1 ERROR: ERROR: 8001
DETAIL: The combined length of columns processed in the SQL statement exceeded the query-processing limit of 65535 characters

Isn’t it strange that I’m getting this error (or any error?) running this package in a clean dbt project with no other packages (except the ones snowplow_web depends on) or models?

Yes it is odd, especially with trying older versions of the package as well. I assume you’ve tried doing this with a full refresh for each version of the package you’ve tried?

I wonder if something weird has happened with the loader, do any of your atomic fields have a varchar(max) length (65535)?

@robk I’ve got a theory about this, can you confirm what type the domain_userid field is in your events table in redshift please?

hey @Ryandomain_userid is field type character varying(128)

Not sure if this is related to your theory, but I was able to trace the issue back to the base_create_snoplow_quarantined_sessions.sql macro in snowplow_utils. That macro references another, snowplow_utils.type_max_string.

That last macro was setting the session_id field in the snowplow_web_base_quarantined_sessions table to character varying(65525). By updating the type_max_string macro to be less than the max (I arbitrarily chose character varying(6000)), I was able to get the package to run successfully.

Oh nice thanks! I guess the issue is in the subquery we use that table in. We’re working on a few other fixes for a new version so we’ll add that in there.