We are working on utilizing the dbt Snowplow package. The data dictionary we are looking at is here: dictionary
We use kafka for our stream and we write into Snowflake with the Kafka connector which we use everywhere else we move data. All this works really well.
However, once our data gets into Snowflake, it doesn’t match the web model that dbt expects. We’d love some clarification on a few fields specified in the data dictionary linked above that don’t exist in our system.
What are the above fields referring to? Our data is completely flat in one table, so we don’t have separate tables for webpage_id, yauaa, and performance contexts.
These fields only really exist in the shredded model for enriched data - which doesn’t really exist in the Snowflake single table, context-as-a-column schema. For other databases like Redshift where there is a core table (events) and context / event tables these fields (specifically root_id and root_tstamp) act as a join key between the core event and the contexts associated with it.
As Mike suggests, this looks like you’re looking at the Redshift/Postgres code in the web model. You should try to follow the Snowflake specific path through the model.
However, there’s likely still some differences between a direct load and data loaded with the Snowplow RDB Loader (for Snowflake), particularly around schema usage and how our loader builds those tables so it can migrate them as schemas evolve.
You might find this interesting reading too on how we recommend loading to Snowflake:
For the columns and datatypes the base reference is here - but as it mentions this table will be mutated according to your custom events and entities. The documentation for these columns can be found in dbt docs (or as a direct reference here).