Hi everyone, Happy New year first of all.
I have 2+ years of experience in setting up and managing snowplow open source data pipeline where we used AWS EKS to setup our collector, enricher, stream-transfomer and rdb loader with Redshift.
so the design looked like
----> Elastic Search Loader
|
Collector ----> Enricher ---->
|
----> Stream-transformer (before real time it was EMR) ----> RDB Loader Redshift
I have joined a new company where they already have a data lake house in Azure which they cant switch to AWS. So they decided to implement Snowplow on Snowplow managed AWS account and then move the data to Azure Data/Delta Lake.
Following is the design that they have proposed. But I’m confused about the 2 things here.
- I think there is a step missing here which is Tranformation step where Raw event is shredded into event + context that is then loaded by the RDB loader.
- What would be the best practice to move data to Azure from AWS. And regarding Data bricks. Should this data bricks be in Azure or AWS environment.