Snowplow_web DBT: column "collector_tstamp" does not exist

Hello All,

I am failing to complete the job for snowplow_web_sessions. The error appears to be that a column does not exist when trying to do an insert from a tmp table into “atomic_derived”.“snowplow_web_sessions”. However in our debugging we have discovered that this column (along with 4 others, SHOULD NOT exist) in the origin table but are being added.

In order to fix the issue, we have removed the following 5 columns from “atomic_derived”.“snowplow_web_sessions”:
collector_tstamp
derived_tstamp
dvce_created_tstamp
page_view_in_session_index
page_views_in_session

The run immediately after will succeed, but subsequent runs will continue to fail because this columns appear to be regenerated by Snowplow_web dbt package.

My team and I have hit a wall with this issue, and I am looking for more information or assistance on how to fix. Any help is appreciated. Additional information below.

Thank you!

17:54:31  19 of 24 ERROR creating snowplow_incremental model atomic_derived.snowplow_web_sessions  [ERROR in 1.27s]
17:54:31  20 of 24 SKIP relation atomic_scratch.snowplow_web_users_sessions_this_run ..... [SKIP]
17:54:31  21 of 24 SKIP relation atomic_scratch.snowplow_web_users_aggs .................. [SKIP]
17:54:31  22 of 24 SKIP relation atomic_scratch.snowplow_web_users_lasts ................. [SKIP]
17:54:31  23 of 24 SKIP relation atomic_scratch.snowplow_web_users_this_run .............. [SKIP]
17:54:31  24 of 24 SKIP relation atomic_derived.snowplow_web_users ....................... [SKIP]
17:54:31
17:54:31  Running 1 on-run-end hook
17:54:31  1 of 1 START hook: snowplow_web.on-run-end.0 ................................... [RUN]
17:54:31  1 of 1 OK hook: snowplow_web.on-run-end.0 ...................................... [DROP TABLE in 0.08s]
17:54:31
17:54:31
17:54:31  Finished running 3 incremental models, 17 table models, 4 snowplow_incremental models, 2 hooks in 17.47s.
17:54:32
17:54:32  Completed with 1 error and 0 warnings:
17:54:32
17:54:32  Database Error in model snowplow_web_sessions (models/sessions/snowplow_web_sessions.sql)
17:54:32    column "collector_tstamp" does not exist
17:54:32    LINE 29: ...system_name_version", "operating_system_version", "collector...
17:54:32                                                                  ^
17:54:32    HINT:  There is a column named "collector_tstamp" in table "snowplow_web_sessions", but it cannot be referenced from this part of the query.
17:54:32    compiled SQL at target/run/snowplow_web/models/sessions/snowplow_web_sessions.sql

Error comes from …/dbt_packages/snowplow_web/models/sessions/snowplow_web_sessions.sql

    insert into "enriched"."atomic_derived"."snowplow_web_sessions" ("app_id", "domain_sessionid", "domain_sessionidx", "start_tstamp", "end_tstamp", "model_tstamp", "user_id", "domain_userid", "stitched_user_id", "network_userid", "page_views", "engaged_time_in_s", "absolute_time_in_s", "first_page_title", "first_page_url", "first_page_urlscheme", "first_page_urlhost", "first_page_urlpath", "first_page_urlquery", "first_page_urlfragment", "last_page_title", "last_page_url", "last_page_urlscheme", "last_page_urlhost", "last_page_urlpath", "last_page_urlquery", "last_page_urlfragment", "referrer", "refr_urlscheme", "refr_urlhost", "refr_urlpath", "refr_urlquery", "refr_urlfragment", "refr_medium", "refr_source", "refr_term", "mkt_medium", "mkt_source", "mkt_term", "mkt_content", "mkt_campaign", "mkt_clickid", "mkt_network", "geo_country", "geo_region", "geo_region_name", "geo_city", "geo_zipcode", "geo_latitude", "geo_longitude", "geo_timezone", "user_ipaddress", "useragent", "br_renderengine", "br_lang", "os_timezone", "category", "primary_impact", "reason", "spider_or_robot", "useragent_family", "useragent_major", "useragent_minor", "useragent_patch", "useragent_version", "os_family", "os_major", "os_minor", "os_patch", "os_patch_minor", "os_version", "device_family", "device_class", "agent_class", "agent_name", "agent_name_version", "agent_name_version_major", "agent_version", "agent_version_major", "device_brand", "device_name", "device_version", "layout_engine_class", "layout_engine_name", "layout_engine_name_version", "layout_engine_name_version_major", "layout_engine_version", "layout_engine_version_major", "operating_system_class", "operating_system_name", "operating_system_name_version", "operating_system_version", "collector_tstamp", "derived_tstamp", "dvce_created_tstamp", "page_view_in_session_index", "page_views_in_session")
    (
        select "app_id", "domain_sessionid", "domain_sessionidx", "start_tstamp", "end_tstamp", "model_tstamp", "user_id", "domain_userid", "stitched_user_id", "network_userid", "page_views", "engaged_time_in_s", "absolute_time_in_s", "first_page_title", "first_page_url", "first_page_urlscheme", "first_page_urlhost", "first_page_urlpath", "first_page_urlquery", "first_page_urlfragment", "last_page_title", "last_page_url", "last_page_urlscheme", "last_page_urlhost", "last_page_urlpath", "last_page_urlquery", "last_page_urlfragment", "referrer", "refr_urlscheme", "refr_urlhost", "refr_urlpath", "refr_urlquery", "refr_urlfragment", "refr_medium", "refr_source", "refr_term", "mkt_medium", "mkt_source", "mkt_term", "mkt_content", "mkt_campaign", "mkt_clickid", "mkt_network", "geo_country", "geo_region", "geo_region_name", "geo_city", "geo_zipcode", "geo_latitude", "geo_longitude", "geo_timezone", "user_ipaddress", "useragent", "br_renderengine", "br_lang", "os_timezone", "category", "primary_impact", "reason", "spider_or_robot", "useragent_family", "useragent_major", "useragent_minor", "useragent_patch", "useragent_version", "os_family", "os_major", "os_minor", "os_patch", "os_patch_minor", "os_version", "device_family", "device_class", "agent_class", "agent_name", "agent_name_version", "agent_name_version_major", "agent_version", "agent_version_major", "device_brand", "device_name", "device_version", "layout_engine_class", "layout_engine_name", "layout_engine_name_version", "layout_engine_name_version_major", "layout_engine_version", "layout_engine_version_major", "operating_system_class", "operating_system_name", "operating_system_name_version", "operating_system_version", "collector_tstamp", "derived_tstamp", "dvce_created_tstamp", "page_view_in_session_index", "page_views_in_session"
        from "snowplow_web_sessions__dbt_tmp105430605981"
    );

The tmp table seems to be generated by the follow:

snowplow_web_sessions_aggs
snowplow_web_page_views_this_run
snowplow_web_sessions_lasts

From the following joins:

	from "enriched"."atomic_scratch"."snowplow_web_sessions_aggs" as b
	
	inner join "enriched"."atomic_scratch"."snowplow_web_page_views_this_run" as a
	on a.domain_sessionid = b.domain_sessionid
	and a.start_tstamp = b.start_tstamp
	and a.page_view_in_session_index = 1
	
	inner join "enriched"."atomic_scratch"."snowplow_web_sessions_lasts" c
	on b.domain_sessionid = c.domain_sessionid

We are stumped as to why the dbt snowplow package seems to deliberately and these columns back but then complain when they are there or should not be there.

Thank you again!

Hey @wambam,

Sorry about the delay on a response to this! We were chatting internally and are all slightly confused as to what is going on with your setup and what could be causing issues. Just before we dive any further, what led your team to believe that the mentioned columns should not exist in the origin table?
I’m not sure why that would be the case as all of those columns should be tracked/generated by the trackers and by the dbt package. Are you not seeing the collector_tstamp, derived_tstamp and dvce_created_tstamp columns in your atomic.events table?

Looking forward to hearing from you so I can help you to fix this issue!
Thanks,
Emiel

Hello @Emiel,

We discovered the idea that those columns might not be required by walking back up from the “snowplow_web_sessions.sql” model. This lead to the tmp table: “snowplow_web_sessions__dbt_tmp140544927942” that was created from “enriched”.“atomic_scratch”.“snowplow_web_sessions_this_run”.

We then referenced - dbt Docs and saw that the “collector_tstamp” was not present. We tried dropping it and rerunning, this produced a similar error but for the other fields: derived_tstamp, etc. This lead us to dropping the five columns:
collector_tstamp
derived_tstamp
dvce_created_tstamp
page_view_in_session_index
page_views_in_session

This produced a successful run, but then failed again on subsequent runs.

Yes I am seeing collector_tstamp , derived_tstamp and dvce_created_tstamp columns in your atomic.events table.

Thank you for your assistance in this!

Hey @wambam,

Thanks for the extra context, I think I misunderstood the original post. I’ve been looking at the dbt SQL in snowplow_web_sessions_this_run and I can’t see where those columns would be generated in the snowplow_web_sessions_this_run table, even in previous versions of the dbt packages. Those fields seem to be corresponding to fields in the snowplow_web_page_views_this_run and snowplow_web_page_views models. Are you generating/leveraging custom models? What version of the web package are you running and on what warehouse? Is it Postgres? What happens when you drop and re-run the sessions tables completely? I’m quite confused as to how you had those columns being produced in the sessions table.

Thanks,
Emiel

Hey there!

We’re currently running the following packages for dbt-postgres==1.1.1:

packages:
  - package: dbt-labs/snowplow
    version: 0.14.0
  - package: snowplow/snowplow_web
    version: 0.7.0

Nothing custom. This is all out-of-the-box.

However, we did find this in the logs where you can see that it is specifically inserting those fields:

	  -- run the delete+insert statement
	  delete from "enriched"."atomic_derived"."snowplow_web_sessions"
	    where (domain_sessionid) in (
	        select (domain_sessionid)
	        from "snowplow_web_sessions__dbt_tmp141647564936"
	    )
	     and start_tstamp between (select lower_limit from vars) and (select upper_limit from vars) ;
	
	    insert into "enriched"."atomic_derived"."snowplow_web_sessions" ("app_id", "domain_sessionid", "domain_sessionidx", "start_tstamp", "end_tstamp", "model_tstamp", "user_id", "domain_userid", "stitched_user_id", "network_userid", "page_views", "engaged_time_in_s", "absolute_time_in_s", "first_page_title", "first_page_url", "first_page_urlscheme", "first_page_urlhost", "first_page_urlpath", "first_page_urlquery", "first_page_urlfragment", "last_page_title", "last_page_url", "last_page_urlscheme", "last_page_urlhost", "last_page_urlpath", "last_page_urlquery", "last_page_urlfragment", "referrer", "refr_urlscheme", "refr_urlhost", "refr_urlpath", "refr_urlquery", "refr_urlfragment", "refr_medium", "refr_source", "refr_term", "mkt_medium", "mkt_source", "mkt_term", "mkt_content", "mkt_campaign", "mkt_clickid", "mkt_network", "geo_country", "geo_region", "geo_region_name", "geo_city", "geo_zipcode", "geo_latitude", "geo_longitude", "geo_timezone", "user_ipaddress", "useragent", "br_renderengine", "br_lang", "os_timezone", "category", "primary_impact", "reason", "spider_or_robot", "useragent_family", "useragent_major", "useragent_minor", "useragent_patch", "useragent_version", "os_family", "os_major", "os_minor", "os_patch", "os_patch_minor", "os_version", "device_family", "device_class", "agent_class", "agent_name", "agent_name_version", "agent_name_version_major", "agent_version", "agent_version_major", "device_brand", "device_name", "device_version", "layout_engine_class", "layout_engine_name", "layout_engine_name_version", "layout_engine_name_version_major", "layout_engine_version", "layout_engine_version_major", "operating_system_class", "operating_system_name", "operating_system_name_version", "operating_system_version", "collector_tstamp", "derived_tstamp", "dvce_created_tstamp", "page_view_in_session_index", "page_views_in_session")
	    (
	        select "app_id", "domain_sessionid", "domain_sessionidx", "start_tstamp", "end_tstamp", "model_tstamp", "user_id", "domain_userid", "stitched_user_id", "network_userid", "page_views", "engaged_time_in_s", "absolute_time_in_s", "first_page_title", "first_page_url", "first_page_urlscheme", "first_page_urlhost", "first_page_urlpath", "first_page_urlquery", "first_page_urlfragment", "last_page_title", "last_page_url", "last_page_urlscheme", "last_page_urlhost", "last_page_urlpath", "last_page_urlquery", "last_page_urlfragment", "referrer", "refr_urlscheme", "refr_urlhost", "refr_urlpath", "refr_urlquery", "refr_urlfragment", "refr_medium", "refr_source", "refr_term", "mkt_medium", "mkt_source", "mkt_term", "mkt_content", "mkt_campaign", "mkt_clickid", "mkt_network", "geo_country", "geo_region", "geo_region_name", "geo_city", "geo_zipcode", "geo_latitude", "geo_longitude", "geo_timezone", "user_ipaddress", "useragent", "br_renderengine", "br_lang", "os_timezone", "category", "primary_impact", "reason", "spider_or_robot", "useragent_family", "useragent_major", "useragent_minor", "useragent_patch", "useragent_version", "os_family", "os_major", "os_minor", "os_patch", "os_patch_minor", "os_version", "device_family", "device_class", "agent_class", "agent_name", "agent_name_version", "agent_name_version_major", "agent_version", "agent_version_major", "device_brand", "device_name", "device_version", "layout_engine_class", "layout_engine_name", "layout_engine_name_version", "layout_engine_name_version_major", "layout_engine_version", "layout_engine_version_major", "operating_system_class", "operating_system_name", "operating_system_name_version", "operating_system_version", "collector_tstamp", "derived_tstamp", "dvce_created_tstamp", "page_view_in_session_index", "page_views_in_session"
	        from "snowplow_web_sessions__dbt_tmp141647564936"
	    );```

Hey @Katie62442,

Thanks for the clarification! Can I ask why you’re using dbt-lab’s Snowplow package together with our own? They’re not very compatible and both try to achieve the same things so I’m curious why you chose to use both, maybe there’s something missing in our package that dbt’s own package has that we should leverage! Can you share the variables you have set in your dbt_project.yml as well please?

Yes, thanks for providing the logs! The code that we’ve written identifies all of the columns in the original relation (so snowplow_web_sessions_this_run in this case), takes all of those columns and then tries to insert those into the new relation (so snowplow_web_sessions). Can you see the collector_tstamp etc columns in your snowplow_web_sessions_this_run table? If you do, I would say you should drop that table and see what happens when you re-run but it’s a drop + recompute table so I don’t think that would help. What’s the compiled/logged SQL for the snowplow_web_sessions_this_run table? If you look at the GitHub code for v0.7.0 then you can see that those columns shouldn’t exist in the snowplow_web_sessions_this_run sql so that would be very worrying if that were the case.

As you can see in the utils package where we define the custom snowplow_incremental materialization here we use the expand_target_column_types and get_columns_in_relation functionality from dbt’s adapter object which takes the column types from a specific relation and copies them over to another relation (used for inserting into a table). So looking at the code again I’m not sure how the columns were added to the snowplow_web_sessions column originally, but it seems whenever they are erroneously there the incremental logic will keep throwing errors. You also said that after dropping the columns from the snowplow_web_sessions table you found that the next run was successful but subsequent runs failed for the same reason, which must mean that the package seems to be inserting those columns into the table, but honestly I’m failing to see where in the code that’s coming from. Have you tried changing your schema value in your profiles.yml and re-running the whole thing fresh? Maybe throwing in a dbt clean && dbt deps beforehand for good measure to flush out the package would help as well.

I hope some of these strategies might help bring more clarification and if you could let me know about your dbt_project.yml configuration as well as why you’re also leveraging the dbt-labs Snowplow package that would be great, thanks so much!

4 Likes

Hello @Emiel ! Thanks for all the help!

I can’t really say why I was using the two packages, best I can tell now is that I think I must have looked at two different getting started docs. and I thought they were related when they really were not.

Here is our dbt_project.yml file (It is pretty stock):

name: 'fpdm_snowplow_web'
version: '1.0.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: 'fpdm_snowplow_web'

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target"  # directory which will store compiled SQL files
clean-targets:         # directories to be removed by `dbt clean`
  - "target"
  - "dbt_packages"


# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models

# In this example config, we tell dbt to build all models in the example/ directory
# as tables. These settings can be overridden in the individual model files
# using the `{{ config(...) }}` macro.
# models:
#   snowplow_web

vars:
  snowplow_web:
    snowplow__enable_iab: false
    snowplow__enable_ua: true
    snowplow__enable_yauaa: true
    snowplow__start_date: '2022-06-12'
  'snowplow:context:performance_timing': true
  'snowplow:context:useragent': true

“Can you see the collector_tstamp etc columns in your snowplow_web_sessions_this_run table?” - No I cannot. So that is good!

I have tried adding dbt clean && dbt deps and I am working on testing the new schema value in profiles.yml, but so far I am still seeing the same issue.

Thanks for the assist, I am slowly integrating thought the issue!

Adding this on the bottom since it is so long:
“What’s the compiled/logged SQL for the snowplow_web_sessions_this_run table?”

21:05:43.915150 [debug] [Thread-3  ]: On model.snowplow_web.snowplow_web_sessions_this_run: /* {"app": "dbt", "dbt_version": "1.1.0", "profile_name": "fpdm_snowplow_web", "target_name": "dev", "node_id": "model.snowplow_web.snowplow_web_sessions_this_run"} */

    


  create  table "enriched"."atomic_scratch"."snowplow_web_sessions_this_run__dbt_tmp"
  as (
    


select
  -- app id
  a.app_id,

  -- session fields
  a.domain_sessionid,
  a.domain_sessionidx,

  b.start_tstamp,
  b.end_tstamp,
  
    (current_timestamp at time zone 'utc')::
    timestamp without time zone

 as model_tstamp,

  -- user fields
  a.user_id,
  a.domain_userid,

  
    -- updated with mapping as part of post hook on derived sessions table
    cast(a.domain_userid as 
    varchar( 255 )
) as stitched_user_id,
  
  
  a.network_userid,

  -- engagement fields
  b.page_views,
  b.engaged_time_in_s,
  
        (
        (
        (
        ((b.end_tstamp)::date - (b.start_tstamp)::date)
     * 24 + date_part('hour', (b.end_tstamp)::timestamp) - date_part('hour', (b.start_tstamp)::timestamp))
     * 60 + date_part('minute', (b.end_tstamp)::timestamp) - date_part('minute', (b.start_tstamp)::timestamp))
     * 60 + floor(date_part('second', (b.end_tstamp)::timestamp)) - floor(date_part('second', (b.start_tstamp)::timestamp)))
     as absolute_time_in_s,

  -- first page fields
  a.page_title as first_page_title,

  a.page_url as first_page_url,

  a.page_urlscheme as first_page_urlscheme,
  a.page_urlhost as first_page_urlhost,
  a.page_urlpath as first_page_urlpath,
  a.page_urlquery as first_page_urlquery,
  a.page_urlfragment as first_page_urlfragment,

  c.last_page_title,

  c.last_page_url,

  c.last_page_urlscheme,
  c.last_page_urlhost,
  c.last_page_urlpath,
  c.last_page_urlquery,
  c.last_page_urlfragment,

  -- referrer fields
  a.page_referrer as referrer,

  a.refr_urlscheme,
  a.refr_urlhost,
  a.refr_urlpath,
  a.refr_urlquery,
  a.refr_urlfragment,

  a.refr_medium,
  a.refr_source,
  a.refr_term,

  -- marketing fields
  a.mkt_medium,
  a.mkt_source,
  a.mkt_term,
  a.mkt_content,
  a.mkt_campaign,
  a.mkt_clickid,
  a.mkt_network,

  -- geo fields
  a.geo_country,
  a.geo_region,
  a.geo_region_name,
  a.geo_city,
  a.geo_zipcode,
  a.geo_latitude,
  a.geo_longitude,
  a.geo_timezone,

  -- ip address
  a.user_ipaddress,

  -- user agent
  a.useragent,

  a.br_renderengine,
  a.br_lang,

  a.os_timezone,

  -- optional fields, only populated if enabled.

  -- iab enrichment fields
  a.category,
  a.primary_impact,
  a.reason,
  a.spider_or_robot,

  -- ua parser enrichment fields
  a.useragent_family,
  a.useragent_major,
  a.useragent_minor,
  a.useragent_patch,
  a.useragent_version,
  a.os_family,
  a.os_major,
  a.os_minor,
  a.os_patch,
  a.os_patch_minor,
  a.os_version,
  a.device_family,

  -- yauaa enrichment fields
  a.device_class,
  a.agent_class,
  a.agent_name,
  a.agent_name_version,
  a.agent_name_version_major,
  a.agent_version,
  a.agent_version_major,
  a.device_brand,
  a.device_name,
  a.device_version,
  a.layout_engine_class,
  a.layout_engine_name,
  a.layout_engine_name_version,
  a.layout_engine_name_version_major,
  a.layout_engine_version,
  a.layout_engine_version_major,
  a.operating_system_class,
  a.operating_system_name,
  a.operating_system_name_version,
  a.operating_system_version

from "enriched"."atomic_scratch"."snowplow_web_sessions_aggs" as b

inner join "enriched"."atomic_scratch"."snowplow_web_page_views_this_run" as a
on a.domain_sessionid = b.domain_sessionid
and a.start_tstamp = b.start_tstamp
and a.page_view_in_session_index = 1

inner join "enriched"."atomic_scratch"."snowplow_web_sessions_lasts" c
on b.domain_sessionid = c.domain_sessionid
  );
21:05:43.994766 [debug] [Thread-3  ]: SQL status: SELECT 4 in 0.08 seconds

Hey @wambam,

Thanks for sharing all of this info. Yes so it looks like in your logged sql for snowplow_web_sessions_this_run we’re not seeing any collector_tstamp either which is great. Can you also share the logged SQL around your snowplow_web_sessions, I’m expecting to see the collector_tstamp in the snowplow_web_sessions__dbt_tmpXYZ table for this error to keep occurring. Also have you since removed the dbt-labs/snowplow package from your packages.yml before running dbt clean && dbt deps?

I’m curious to hear whether having the dbt-labs package interferes in some way with our package.

Thanks,
Emiel

Sorry for the late return, however I wanted to comment for any other users that find this. We were not able to identify or correct this issue with advice listed, however we ended up making a custom model that would just remove these columns when dbt runs before our other models complete.

Thanks!

1 Like