Our team had great joy playing around with data modeling project using SQL-runner open source version.
A recent found problem is that, when there is a new “Custom” model needs to be derived, we want it to aggregate from the beginning of historical data (atomic.events table) instead of “event_staged” data only.
And after initial run, the new model should be able to adapt to the “event_staged” source table.
What we are doing right now is to maintain two version of code, one for initial run(manually execute), and job’s version code which will be scheduled automatically with the “Standard” flow.
Is there a smarter way to adapt the workflow automatically?
Thank you for sharing your thoughts and wish you all have a good vacation
Glad you have enjoyed the SQL-Runner data models. Backfilling newly introduced custom modules is an interesting problem. I have just created an issue outlining a potential path for us to make the workflow a little easier. This solution would allow for all previously processed events to be served to events_staged without tearing everything down. In addition it would mitigate the need to maintain two version of the model and remove potential duplication of tables.
The harder part is automating a job to handle this automatically. In an ideal world when backfilling you want to only run the modules that are strictly required. Given the custom module could consume from any number of the _staged tables i.e. it might require data from page_view_staged and events_staged, it is tricky to create a generalised workflow that could do this in an automatic manner.
You could either:
Manually create a bespoke job outlining the required playbook for each backfill. More manual work but less costly.
Add the new module to your standard job, switch the backfill toggle and replay all events through all modules. Less work but more costly.
There are several fixes/improvements I want to make to this models in the new year so I will add this functionality to the list of things to explore.
As an aside, we have an equivalent dbt version of the web model, snowplow-web. This handles the backfilling of new modules automatically by making use of dbt’s graph object.
Thank you so much for the reply.
I am also curious about how you run the data modeling job in GCP?
Currently we are using Dataproc to host the SQL-runner version data modeling code. And Cloud Composer to schedule hourly job to execute the bash command.
Feel like this is a very naive combination and we will need to copy code from GCS to Dataproc for every code merge.