DBT package gets stuck

I’m using the dbt snowplow_web package. Unfortunately it gets stuck at model 9 and doesn’t continue (I let it run the whole night but it never finishes - CPU on the RDS machine is at 50% so I guess it hangs somewhere?!)

Using it on AWS / RDS / Postgres, with snowplow__enable_load_tstamp: false

Output looks like this:

(.venv) moon:dbt-snowplow-web mo$ dbt run --selector snowplow_web
07:58:22  Running with dbt=1.6.6
07:58:22  Registered adapter: postgres=1.6.6
07:58:22  Found 18 models, 103 tests, 3 seeds, 2 operations, 8 sources, 0 exposures, 0 metrics, 645 macros, 0 groups, 0 semantic models
07:58:25  Running 1 on-run-start hook
07:58:25  1 of 1 START hook: snowplow_web.on-run-start.0 ................................. [RUN]
07:58:25  1 of 1 OK hook: snowplow_web.on-run-start.0 .................................... [OK in 0.00s]
07:58:25  Concurrency: 1 threads (target='dev')
07:58:25  1 of 18 START sql incremental model skdbt_snowplow_manifest.snowplow_web_base_quarantined_sessions  [RUN]
07:58:25  1 of 18 OK created sql incremental model skdbt_snowplow_manifest.snowplow_web_base_quarantined_sessions  [SELECT 0 in 0.41s]
07:58:25  2 of 18 START sql incremental model skdbt_snowplow_manifest.snowplow_web_incremental_manifest  [RUN]
07:58:26  2 of 18 OK created sql incremental model skdbt_snowplow_manifest.snowplow_web_incremental_manifest  [SELECT 0 in 0.37s]
07:58:26  3 of 18 START sql table model skdbt_scratch.snowplow_web_base_new_event_limits . [RUN]
07:58:26  08:58:26 + Snowplow: No data in manifest. Processing data from start_date
07:58:26  08:58:26 + Snowplow: Processing data between '2023-11-01 00:00:00' and '2023-11-08 07:58:26' (snowplow_web)
07:58:26  3 of 18 OK created sql table model skdbt_scratch.snowplow_web_base_new_event_limits  [SELECT 1 in 0.52s]
07:58:26  4 of 18 START sql incremental model skdbt_snowplow_manifest.snowplow_web_base_sessions_lifecycle_manifest  [RUN]
07:58:29  4 of 18 OK created sql incremental model skdbt_snowplow_manifest.snowplow_web_base_sessions_lifecycle_manifest  [SELECT 46529 in 2.57s]
07:58:29  5 of 18 START sql table model skdbt_scratch.snowplow_web_base_sessions_this_run  [RUN]
07:58:30  5 of 18 OK created sql table model skdbt_scratch.snowplow_web_base_sessions_this_run  [SELECT 46529 in 0.73s]
07:58:30  6 of 18 START sql table model skdbt_scratch.snowplow_web_base_events_this_run .. [RUN]
07:58:42  6 of 18 OK created sql table model skdbt_scratch.snowplow_web_base_events_this_run  [SELECT 71184 in 12.33s]
07:58:42  7 of 18 START sql table model skdbt_scratch.snowplow_web_pv_engaged_time ....... [RUN]
07:58:43  7 of 18 OK created sql table model skdbt_scratch.snowplow_web_pv_engaged_time .. [SELECT 0 in 1.20s]
07:58:43  8 of 18 START sql table model skdbt_scratch.snowplow_web_pv_scroll_depth ....... [RUN]
07:58:45  8 of 18 OK created sql table model skdbt_scratch.snowplow_web_pv_scroll_depth .. [SELECT 65540 in 1.72s]
07:58:45  9 of 18 START sql table model skdbt_scratch.snowplow_web_sessions_this_run ..... [RUN]
^C08:10:56  CANCEL query model.snowplow_web.snowplow_web_sessions_this_run ................. [CANCEL]
08:10:56  9 of 18 ERROR creating sql table model skdbt_scratch.snowplow_web_sessions_this_run  [ERROR in 731.34s]

If I use the generated command from target/run/snowplow_web/models/sessions/scratch/default/snowplow_web_sessions_this_run.sql and execute it directly in my postgres terminal, it does work within a minute or so!

What could be the problem? And how can I skip step 9 so I can do that step manually and then let the dbt run command continue?

Also: I tried to do dbt compile --select snowplow_web to get all the commands it would execute, so I could do it manually in my postgres terminal. Unfortunately here also, I get an error:

(.venv) moon:dbt-snowplow-web mo$ dbt compile --select snowplow_web
09:26:55  Running with dbt=1.6.6
09:26:55  Registered adapter: postgres=1.6.6
09:26:55  Found 18 models, 103 tests, 3 seeds, 2 operations, 8 sources, 0 exposures, 0 metrics, 645 macros, 0 groups, 0 semantic models
09:26:56  Concurrency: 1 threads (target='dev')
09:26:56  Snowplow Warning: "snowplow"."skdbt_snowplow_manifest"."snowplow_web_incremental_manifest" does not exist. This is expected if you are compiling a fresh installation of the dbt-snowplow-* packages.
09:26:56  10:26:56 + Snowplow: No data in manifest. Processing data from start_date
09:26:56  Snowplow Warning: "snowplow"."skdbt_scratch"."snowplow_web_base_new_event_limits" does not exist. This is expected if you are compiling a fresh installation of the dbt-snowplow-* packages.
09:26:57  Snowplow Warning: "snowplow"."skdbt_scratch"."snowplow_web_base_new_event_limits" does not exist. This is expected if you are compiling a fresh installation of the dbt-snowplow-* packages.
09:26:57  Snowplow Warning: "snowplow"."skdbt_scratch"."snowplow_web_base_sessions_this_run" does not exist. This is expected if you are compiling a fresh installation of the dbt-snowplow-* packages.
09:26:58  Encountered an error:
Runtime Error
  Database Error in model snowplow_web_base_events_this_run (models/base/scratch/default/snowplow_web_base_events_this_run.sql)
    relation "skdbt_scratch.snowplow_web_base_sessions_this_run" does not exist
    LINE 288:             inner join "snowplow"."skdbt_scratch"."snowplow_...

Update: I did a run with the snowplow__start_date: set to yesterday. So it only had to process one day. This works (takes still rather long though - I dont have too much data in my database). I took the commands from that run (dbt.log) and tried to use it again with my original start_data (2023-10-20) - by executing it step by step in my postgres terminal. This worked. However, when I run dbt run --select snowplow_web again, it gets stuck again (even though it should only process the newest events incrementally, if I understand correctly). I also put keepalives_idle: 10 in my profiles.yaml as suggested here - but it doesn’t change anything. When I check the activity in my pgAdmin Dashboard I get this:

Hi @Moritz_Bauer , what specs are you running your postgres instance on? It’s a little outside my experience (I just use the warehouse, scaling was the DBAs problem :sweat_smile:), but it may be that there are resource constraints on your warehouse and the way dbt manages transactions and temp storage.

In particular the sessions table (for postgres/redshift) can have quite a few CTEs, while they are more efficient that subqueries, could maybe lead to a lot of I/O overhead where it has to store and fetch them from memory a lot. Again, not my area, but given the message in your pgadmin it does seem like there is overflow from in-memory to storage and back again. I/O will also explain the constant but not fully utilised CPU.

Just to speculate even more wildly, I believe postgres is row-based rather than columnar like most modern cloud warehouses, so while we try to be efficient in our queries there may be some things we’ve missed that would be postgres specific. In addition I’m not sure if the Snowplow events and shredded event/entity tables on postgres are created with index/partitions that may be beneficial.

As to why it works when you run it manually, I’m not too sure, but I would guess dbt does something slightly different or the connection is treated differently.

What version of the package are you using? We made a large change in version 0.16.0 that moves all the entity joins up front which should have optimised that model.

Hi @Ryan

thanks a lot for your input! I was initially using the micro instance on AWS with 1 GB of RAM, 10 GB of disk space and 2 vCPU. I then upgraded to the biggest instance available (db.t4g.2xlarge) with 8 vCPU and 32 GB of memory + upgraded to 20 GB of disk space. The first queries ran through faster of course but still no luck when it came to the scratch.snowplow_web_sessions_this_run model. It was also hanging there.

So it must be something with postgres + dbt. I guess there’s a reason why the documentation warns against using postgres in production… I’ll have a look into how to migrate to a different database, hopefully that will fix it.

In the mean time I did find a workaround on how to get it running. Here’s my procedure:

  1. Seed the initial data with dbt seed --select snowplow_web --full-refresh
  2. Run the package with snowplow__start_date: set to today, so only the smallest amount of data has to be processed (this takes some time but does work)
  3. Copy the dbt.log file so you have all the commands that you need for when you execute it later directly in the terminal - also take note of all the models present in snowplow_manifest.snowplow_web_incremental_manifest - you’ll need them in step 10
  4. Delete the tables created by dbt
  5. Put snowplow__start_date: to your preferred date
  6. Let the model run until it hangs, kill command with CTRL-C
  7. Go to logfile and copy the last command, execute directly in your terminal
  8. Run dbt again but exclude the models that have already been executed: dbt run --exclude snowplow_web_base_quarantined_sessions snowplow_web_incremental_manifest snowplow_web_base_new_event_limits snowplow_web_base_sessions_lifecycle_manifest snowplow_web_base_sessions_this_run snowplow_web_base_events_this_run snowplow_web_pv_engaged_time snowplow_web_pv_scroll_depth snowplow_web_sessions_this_run
  9. If it still hangs at another model (which it did for me - at model: snowplow_web_page_views_this_run) repeat the process, use the last commands from the dbt.log and execute directly in your terminal
  10. At the very end be sure to add all necessary rows with it’s models and timestamps to snowplow_manifest.snowplow_web_incremental_manifest (if you don’t do that, dbt will start from scratch the next time you run it and hang again)

After this you can run dbt normally and it works. Just be sure to do it every day so the data doesn’t get too big and it hangs again.

Thats my sketchy workaround for now :slight_smile:

I think you’re right with the I/O thing but I have no idea how that could be fixed?! I guess I’ll just upgrade to a proper database soon and hope that fixes it.

I’m running the latest version of the snowplow_web package (0.16.2) and dbt version 1.6.6.

I’m glad you managed to get it working at least!

If it does work with single day volumes, you could always change the snowplow__backfill_limit_days variable down to 1 as well, and each run would only process ~1 day of data, you’d just have to do multiple runs to keep it going. You might also need to decrease the lookback_window_hours to reduce the total window in a run (you can see how this all comes together here: https://docs.snowplow.io/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-advanced-usage/dbt-incremental-logic/#state-4-standard-run).

In general the fact that running the same commands manually works but when they run through dbt it hangs is what’s tripping me up here. The page views and sessions this run tables are the most complex queries, but I don’t see why they would hang when triggered through dbt but not through the warehouse direct.

This won’t change that fact, but it turns out that postgres has a setting called work_mem that caps a sessions’s working memory before spilling to disk - by default this is 4mb, which might be a little small given the number of columns we have (especially if you have any of the enrichments turned on as well). This means even with the higher RAM it actually wouldn’t have been able to utilise it. You might be able to improve the performance by playing around with this value, but we’re now very far beyond my expertise and changing the core workings of your warehouse should be done at your own risk.

1 Like