Deploying snowplow in combination of AWS and Azure

Hi everyone, Happy New year first of all.
I have 2+ years of experience in setting up and managing snowplow open source data pipeline where we used AWS EKS to setup our collector, enricher, stream-transfomer and rdb loader with Redshift.
so the design looked like

                              ---->  Elastic Search Loader
Collector ----> Enricher ----> 
                              ---->  Stream-transformer (before real time it was EMR) ----> RDB Loader Redshift

I have joined a new company where they already have a data lake house in Azure which they cant switch to AWS. So they decided to implement Snowplow on Snowplow managed AWS account and then move the data to Azure Data/Delta Lake.

Following is the design that they have proposed. But I’m confused about the 2 things here.

  1. I think there is a step missing here which is Tranformation step where Raw event is shredded into event + context that is then loaded by the RDB loader.
  2. What would be the best practice to move data to Azure from AWS. And regarding Data bricks. Should this data bricks be in Azure or AWS environment.

1 Like

Hi @ahid_002! Always great to hear from repeat users of Snowplow :slight_smile:

To your questions:

  1. I suppose the “RDB Loader” block on the diagram implies both components — the RDB Transformer and the RDB Loader itself (more specifically, the Databricks Loader). Note that for your transformer you need to select the “wide row” Parquet format.
  2. RDB Loader will work fine with Databricks hosted on Azure.