Data Modeling - events_staged table is empty

Hi Snowplowers,
Our pipeline is deployed on GCP and using BQ for db.

I successfully ran data-models-master and was able to generate the page_views/sessions/users derived tables.

However, as I play with the code logic that I deleted the ‘derived’ and ‘scratch’ datasets, then rerun the data modeling again. But not be able to generate any data in the new derived tables.

I double checked with the event_staged table in ‘scratch’ and it was empty. I believe this is why no data is processed for this run.

I tried to modify the start_date in 01-base-main yml but did not have any luck.

Is there a right way that I can get all historical data processed for events_staged table again?
Thank you

Hey @kuangmichael07,

It’s a complicated model but I’ll give you the easiest unblocking solution first, then I’ll do my best to add explanations that might help you understand the relevant pieces.

The easiest unblock

All of the standard playbook directories have XX-destroy playbooks like this one. Run them all, then run the model from the top again. This will essentially tear everything down and start again.

The general advice I have for modifications or additions to the model is that it supports ‘plugin’ customisation. You can find a guide to that here. Of course you’re free to change whatever you like in the standard model, but do so with the understanding that this is akin to modifying the source code of a tracker - once you’ve forked the logic it becomes hard for us to offer much support if you hit issues.

Most use cases can be done without forking though. For example you can configure the model to skip the update to the derived.page_views table, and instead use your own custom module to create a derived.page_views_custom instead.


Now for a couple of explanations of the detail:

It sounds like the table isn’t updating because despite deleting the derived tables, the manifests probably remained the same - the manifests determine what data is processed into the base module and through the model.

I double checked with the event_staged table in ‘scratch’ and it was empty. I believe this is why no data is processed for this run.

While you’re probably correct, note that the data in the scratch.events_staged normally gets dropped at the end of the model. If you have set cleanup_mode to “debug” or “trace” in the base module’s playbook, it doesn’t get dropped.

Hi @Colm ,
Thank you for the quick reply.
I think I might misunderstood the usage of start_date option.
For example, our application has hooked up with Snowplow since Sep 1st this year and I want to run data modeling for all historical data. Should I set it to 2021-09-01? And the result should cover all days from Sep 1st to now?

BTW I selected all metadata from datamodel_meta and seeing quite many rows were processed but no derived result is showing up

Thank you

Yes start date determines what date the model starts with on it’s first run. If it’s not the first run, it’ll use the manifests to determine when it should start. If you ran the model already, then changed the start date, this will do nothing to the next run of the model, the start date is only used when the manifest is empty.

BTW I selected all metadata from datamodel_meta and seeing quite many rows were processed but no derived result is showing up

Well, that’s not surprising - by your own account you ran the model and then subsequently dropped the derived tables.

Like I explained you just need to run the destroy playbooks and re-run.