Difficulty with Azure Databricks connection

niraj · September 26, 2024, 6:05am

Hello,
I followed the documentation from here Quick start guide | Snowplow Documentation. I selected Azure as the platform. I’m done with most of the steps, I see the data being saved in Blob Storage whenever any API is called but its not saved in Databricks.
I databricks, a hive_metastore table is created.
While setting up the pipeline, it asked to enable one of the storage options as below. I selected “lake_enabled” as true. I guess I need to add more values for the lake_elabled option, but there’s nothing mentioned in the documentation nor the github repo from which I downloaded the code regarding this.

Regards,
Niraj

niraj · September 27, 2024, 6:57am

Hello, can someone help me with this?

niraj · September 30, 2024, 6:07am

Anyone??

josh · September 30, 2024, 8:49am

Hi @niraj it sounds like you have the Lake Loader working correctly if you can see data landing in blob storage - have you followed these steps to surface that data in Databricks? Quick start guide | Snowplow Documentation

niraj · September 30, 2024, 10:40am

Hello, thank you for replying.
I selected the “Account Key” as the auth option for testing
When I run the below commands, I’m able to see the data from blob storage

spark.conf.set(
    "fs.azure.account.key.rssnowplowstorage.dfs.core.windows.net",
    dbutils.secrets.get(scope="rs-databricks-key-vault-secret-scope", key="RS-DATABRICKS-KEY"))

df = spark.read.load("abfss://lake-container@rssnowplowstorage.dfs.core.windows.net/events/")
print(df)
dbutils.fs.ls("abfss://lake-container@rssnowplowstorage.dfs.core.windows.net/events/_delta_log")

I get the following error when I run the command to create the events table

josh · September 30, 2024, 7:06pm

@niraj to confirm did you set the secrets in the notebook or in the cluster parameters itself?

Have you tried any of the other authentication options yet?

niraj · October 1, 2024, 5:20am

The secrets are in the Notebook according to the databricks documentation for testing purposes. I didnt try the other auth options as I dont have access to those resources.

josh · October 1, 2024, 6:31am

Unfortunately this looks like something you might need to escalate internally with whoever owns your Databricks installation. The error doesn’t appear related to the data being loaded by the Lake Loader so much as your credentials to access that data.

niraj · October 1, 2024, 8:40am

Ok, but how will the terraform pipeline know when to send the events to data lake? The snowplow documentation doesnt say anything for the data lake creds or where to put them in the terraform.tfvars file. There are only config variables for Snowflake but nothing for data lake. The only field present in it is “lake_enabled” and nothing else.
The documentation says to take a note of “port”, “https”, “http_path” and “access_token” from the cluster but doesnt say where to use them while creating the pipeline

josh · October 2, 2024, 6:40am

Hey @niraj so the pipeline in this context doesn’t know that Databricks exists! It is just preparing delta files in an abfs volume → this happens constantly as long as data is flowing into the pipeline and new data should be committed roughly every 5 minutes (which is the default windowing period used by the Lake Loader).

The documentation says to take a note of “port”, “https”, “http_path” and “access_token” from the cluster but doesnt say where to use them while creating the pipeline

In the instructions you are talking about there is a big disclaimer that we have “Azure specific instructions” - on Azure the pipeline does not connect to Databricks, we load to the volume and then you connect Databricks up to that volume (that’s the interface between the pipeline and the warehouse).

niraj · October 3, 2024, 5:22am

Ok so it means that the in case of Azure, the pipeline will send events to azure blob storage and the Databricks will read from the blob storage right?

Gabriel_Costa · October 3, 2024, 7:18am

All you got to do in Databricks is mounting the blob storage where the snowplow data is landing in your azure pipeline, and then you can create the schema/table using the mount point. At least that’s how I have approached it.

So just run a py script that looks something like this

    dbutils.fs.mount(
        source=f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/",
        mount_point=mount_point, # Define mount point
        extra_configs=configs  # Define config settings like the auth type
    )

Just make sure, you have a service principal that has access to the landing zone of your azure pipeline, store the client id and secret using databricks secrets, and then once you mount the blob storage, you can just easily create your table/schema as you were trying to do.

CREATE DATABASE xxxx
LOCATION "{mount_point}"

niraj · October 3, 2024, 7:54am

Hey @Gabriel_Costa . Thank you. Will try this approach and let you know.

Topic		Replies	Views
Quickstart for Snowflake hosted on Azure? Setup	4	772	June 22, 2023
New Snowplow Lake Loader New releases	0	643	October 2, 2023
Deploying snowplow in combination of AWS and Azure For engineers	1	810	January 9, 2023
Snowplow Open Source on Azure	6	1593	February 4, 2023
RDB Loader 4.2.0 released	0	1023	July 20, 2022

Difficulty with Azure Databricks connection

Related topics