Expected Snowplow performance

We just did a bit of experiments with processing some small data volumes with the Snowplow batch process and found that to process 500MB of Cloudfront logs it took the

  • enrich step 37m; and
  • shred step 1h23m

using an m4.4xlarge node (16CPU and 64GB RAM). The shred step has global event deduplication enabled and the DDB requests were throttled slightly.

Our Spark configuration is:

Classification Property Value
spark maximizeResourceAllocation false
spark-defaults spark.yarn.driver.memoryOverhead 1440m
spark-defaults spark.executor.cores 4
spark-defaults spark.yarn.executor.memoryOverhead 1440m
spark-defaults spark.executor.instances 3
spark-defaults spark.default.parallelism 24
spark-defaults spark.driver.cores 4
spark-defaults spark.driver.memory 12896m
spark-defaults spark.executor.memory 12896m

Does anyone have an idea if this is a ‘normal’ amount of time for Snowplow to process this much data. We generally found the shred step to take a longer than the enrich so optimistically it could have been 40mins quicker. Really we’re interested in ball park figures.

Our configuration comes from the spreadsheet here.


That shred step seems unusually slow. If you turn off event deduplication and run with the same dataset how long does the shredding take?

@mike it takes 15mins to run the shredding on the 500MB data set with the above cluster and config but no global event deduplication.