Hello,
I followed the documentation from here Quick start guide | Snowplow Documentation. I selected Azure as the platform. I’m done with most of the steps, I see the data being saved in Blob Storage whenever any API is called but its not saved in Databricks.
I databricks, a hive_metastore table is created.
While setting up the pipeline, it asked to enable one of the storage options as below. I selected “lake_enabled” as true. I guess I need to add more values for the lake_elabled option, but there’s nothing mentioned in the documentation nor the github repo from which I downloaded the code regarding this.
Hi @niraj it sounds like you have the Lake Loader working correctly if you can see data landing in blob storage - have you followed these steps to surface that data in Databricks? Quick start guide | Snowplow Documentation
Hello, thank you for replying.
I selected the “Account Key” as the auth option for testing
When I run the below commands, I’m able to see the data from blob storage
The secrets are in the Notebook according to the databricks documentation for testing purposes. I didnt try the other auth options as I dont have access to those resources.
Unfortunately this looks like something you might need to escalate internally with whoever owns your Databricks installation. The error doesn’t appear related to the data being loaded by the Lake Loader so much as your credentials to access that data.
Ok, but how will the terraform pipeline know when to send the events to data lake? The snowplow documentation doesnt say anything for the data lake creds or where to put them in the terraform.tfvars file. There are only config variables for Snowflake but nothing for data lake. The only field present in it is “lake_enabled” and nothing else.
The documentation says to take a note of “port”, “https”, “http_path” and “access_token” from the cluster but doesnt say where to use them while creating the pipeline
Hey @niraj so the pipeline in this context doesn’t know that Databricks exists! It is just preparing delta files in an abfs volume → this happens constantly as long as data is flowing into the pipeline and new data should be committed roughly every 5 minutes (which is the default windowing period used by the Lake Loader).
The documentation says to take a note of “port”, “https”, “http_path” and “access_token” from the cluster but doesnt say where to use them while creating the pipeline
In the instructions you are talking about there is a big disclaimer that we have “Azure specific instructions” - on Azure the pipeline does not connect to Databricks, we load to the volume and then you connect Databricks up to that volume (that’s the interface between the pipeline and the warehouse).
Ok so it means that the in case of Azure, the pipeline will send events to azure blob storage and the Databricks will read from the blob storage right?
All you got to do in Databricks is mounting the blob storage where the snowplow data is landing in your azure pipeline, and then you can create the schema/table using the mount point. At least that’s how I have approached it.
So just run a py script that looks something like this
dbutils.fs.mount(
source=f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/",
mount_point=mount_point, # Define mount point
extra_configs=configs # Define config settings like the auth type
)
Just make sure, you have a service principal that has access to the landing zone of your azure pipeline, store the client id and secret using databricks secrets, and then once you mount the blob storage, you can just easily create your table/schema as you were trying to do.