Hi all,
For an AWS RT events pipeline we are using the snowplow-s3-loader
to put the events .gz
files which are later processed by two EMRs one for loading into redshift and another to convert the files to .parquet
format to be used by AWS Athena.
My question is about the configuration that creates the files in s3
- In general our pipeline processes about 10 Million events per 24 hrs (i.e. not too much events, total size in parquet is less than 1 Gigabyte per 24 hrs)
- The “shredder EMR” (process to load events into redshift) is triggered consecutively, i.e. we have an EMR that’s running, when it’s done (about 30 mins) all the events that were accumulated in the past 30 mins are processed and so on…
This process allows us to have redshift relatively “fresh” (i.e. event data is available after 30 mins in redshift for use of our BI team)
- I have noticed that in each partition (every 30 mins a new partition is created) there are about 70 files, each file size is less than 1Mb (700-800 Kb typically) and am wondering if this is “bad practice” (to have many small files in S3 and run an EMR job where most of the time would be spent on access S3 not processing the data)
stream config from my s3 loader
# Events are accumulated in a buffer before being sent to S3.
# The buffer is emptied whenever:
# - the combined size of the stored records exceeds byteLimit or
# - the number of stored records exceeds recordLimit or
# - the time in milliseconds since it was last emptied exceeds timeLimit
buffer {
byteLimit = 104857600 # 100mb
recordLimit = 100000
timeLimit = 600000 # 10 minutes
}
}
I believe that based on our event volume and velocity we are not using the s3-loader and the EMR to their full potential and we have a lot of “room” to optimize this process.
Also, and a more general question is what are “best practices” for loading events into S3 (file size / partitions / etc.) and what are “best practices” for running the EMR for shredding the events and loading into redshift as well as landing this event data in a datalake (i.e. raw enriched event data in parquet available for query via AWS Athena),