Expected Snowplow performance

gareth · June 4, 2018, 2:27pm

We just did a bit of experiments with processing some small data volumes with the Snowplow batch process and found that to process 500MB of Cloudfront logs it took the

enrich step 37m; and
shred step 1h23m

using an m4.4xlarge node (16CPU and 64GB RAM). The shred step has global event deduplication enabled and the DDB requests were throttled slightly.

Our Spark configuration is:

Classification	Property	Value
spark	maximizeResourceAllocation	false
spark-defaults	spark.yarn.driver.memoryOverhead	1440m
spark-defaults	spark.executor.cores	4
spark-defaults	spark.yarn.executor.memoryOverhead	1440m
spark-defaults	spark.executor.instances	3
spark-defaults	spark.default.parallelism	24
spark-defaults	spark.driver.cores	4
spark-defaults	spark.driver.memory	12896m
spark-defaults	spark.executor.memory	12896m

Does anyone have an idea if this is a ‘normal’ amount of time for Snowplow to process this much data. We generally found the shred step to take a longer than the enrich so optimistically it could have been 40mins quicker. Really we’re interested in ball park figures.

Our configuration comes from the spreadsheet here.

Thanks
Gareth

mike · June 4, 2018, 10:39pm

That shred step seems unusually slow. If you turn off event deduplication and run with the same dataset how long does the shredding take?

gareth · June 7, 2018, 2:45pm

@mike it takes 15mins to run the shredding on the 500MB data set with the above cluster and config but no global event deduplication.

Topic		Replies	Views
ETL Shred step taking longer and longer AWS batch pipeline (Legacy)	24	3321	March 30, 2017
Snowflake Loader taking too long to process batch	12	1460	April 5, 2020
Handling large volumes of duplicated event_ids AWS batch pipeline (Legacy)	3	1358	July 3, 2018
Snowplow RDB Loader R31 released New releases	7	1569	October 1, 2019
Spark memory woes AWS batch pipeline (Legacy)	1	1937	December 14, 2017

Expected Snowplow performance

Related topics