We just started moving out DBT/Snowplow project to Databricks. Here is a sample of our profile config for connecting:
test:
catalog: null
host: <redacted but databricks sql endpoint url>
http_path: <redacted but databricks http path from the sql endpoint>
schema: enriched_sandbox
threads: 4
token: <a pat token>
type: databricks
What I am finding is that “enriched_sandox_derived”, “enriched_sandox_scratch”, etc. are all being created as schemas under the catalogue parallel to “enriched_sandox” when ideally they should be tables under the “enriched_sandox” schema. This appears to be a Snowplow issue as running the demo, dbt-databricks does not produce the same hierarchy issues. The only difference I can find between the two workflows, is that the demo does not contain snowplow_web or snowplow_utils models.
Any advice is appreciated, but I thought I would bring this to attention.
The Snowplow DBT packages actually set up the schemas in the package project yaml file to append those suffixes to your profile schema, that’s why you’re seeing the behaviour you get and is expected.
You can overwrite the schemas the package writes to in your dbt_project.yml file by using the following:
@Ryan Thanks, but I don’t think that had the effect I wanted.
“By default all scratch/staging tables will be created in the <target.schema>_scratch schema, the derived tables ( e.g. snowplow_web_page_views , snowplow_web_sessions , snowplow_web_users ) will be created in <target.schema>_derived and all manifest tables in <target.schema>_snowplow_manifest”
I didn’t really want these tables to just have a different appended name, I wanted them to be under an existing schema/database in Databricks.
@wambam Ah okay sorry, I see what the issue is now - if you set those all values to null (the keyword null, not an empty string otherwise you’ll get a trailing underscore) in the project yaml you should get it all writing to your pure enriched_sandbox schema. I have tested this on a UC but I don’t believe that should make a difference here as that’s more about finding the data in the first place.
Could you please try that and let me know if that works for you, if it does I’ll update the docs to make it more obvious how to achieve this for anyone in the future