We are using snowplow streaming which currently loads the data into redshift every 10 minutes and we have build datamarts in redshift itself.
We want to migrate it to S3 datalake.
I want to build the datalake in S3 for snowplow events data. I am reading the transformed data from S3 folder where data gets flushed after transformation and Loader loads into redshift.
The problem is the number of files generated is very huge and to read millions of files for data processing through EMR, it takes lot of time. Also the data is .gz and we want it in parquet.
Is it possible to have less number of files with more size as transformer output?
Can we have output of transformer as parquet?
Current snowplow versions:
Stream Enrich: 3.7.0
S3 Loader: 2.2.6
Elasticsearch Loader: 2.0.9
RDB Loader: 5.3.2