Is it possible to have snowplow events datamart in S3 datalake instead of Redshift/Snowflake/Databriks

vikramcse44 · January 4, 2024, 10:25am

Hey!

We are using snowplow streaming which currently loads the data into redshift every 10 minutes and we have build datamarts in redshift itself.

We want to migrate it to S3 datalake.

I want to build the datalake in S3 for snowplow events data. I am reading the transformed data from S3 folder where data gets flushed after transformation and Loader loads into redshift.

The problem is the number of files generated is very huge and to read millions of files for data processing through EMR, it takes lot of time. Also the data is .gz and we want it in parquet.

Is it possible to have less number of files with more size as transformer output?
Can we have output of transformer as parquet?

Current snowplow versions:

Collector: 2.8.2
Stream Enrich: 3.7.0
S3 Loader: 2.2.6
Elasticsearch Loader: 2.0.9
RDB Loader: 5.3.2
Transformer: 5.3.2

Topic		Replies	Views
How do is store snowplow events to both s3 and redshift For engineers	1	1650	December 2, 2018
Convert Snowplow thrift files (on S3) to parquet For engineers	2	2023	February 25, 2019
Is it possible to load data to Redshift after StreamEnricher? Storage targets	10	2823	September 12, 2018
Data in S3 in JSON format (quickstart-examples) For engineers	5	1535	April 26, 2022
Load Data from s3 to redshift	0	786	January 4, 2023

Is it possible to have snowplow events datamart in S3 datalake instead of Redshift/Snowflake/Databriks

Related topics