We just did a bit of experiments with processing some small data volumes with the Snowplow batch process and found that to process 500MB of Cloudfront logs it took the
- enrich step 37m; and
- shred step 1h23m
using an m4.4xlarge
node (16CPU and 64GB RAM). The shred step has global event deduplication enabled and the DDB requests were throttled slightly.
Our Spark configuration is:
Classification | Property | Value |
---|---|---|
spark | maximizeResourceAllocation | false |
spark-defaults | spark.yarn.driver.memoryOverhead | 1440m |
spark-defaults | spark.executor.cores | 4 |
spark-defaults | spark.yarn.executor.memoryOverhead | 1440m |
spark-defaults | spark.executor.instances | 3 |
spark-defaults | spark.default.parallelism | 24 |
spark-defaults | spark.driver.cores | 4 |
spark-defaults | spark.driver.memory | 12896m |
spark-defaults | spark.executor.memory | 12896m |
Does anyone have an idea if this is a ‘normal’ amount of time for Snowplow to process this much data. We generally found the shred step to take a longer than the enrich so optimistically it could have been 40mins quicker. Really we’re interested in ball park figures.
Our configuration comes from the spreadsheet here.
Thanks
Gareth