DynamoDB based duplicate event removal throughput

gareth · February 14, 2018, 4:12pm

Hi

When running with the cross run duplicate event removal we’ve empirically observed that only the write index is used. Reading the snowplow-r88 release notes on this feature it only discusses the write throughput capacity.

We’ve written our own script to manage this throughput, is it sufficient only turn up the write capacity throughput if we wish to avoid throttling?

Thanks
Gareth

anton · February 19, 2018, 8:18am

Hey @gareth,

Yes, you’re right, deduplication in shred job uses only write throughput and it should be enough to tune only write capacity. Read capacity can remain on super-low values such as 5 units or so.

gareth · February 20, 2018, 8:37am

Great, thanks.

alex · February 20, 2018, 11:38pm

It’s an interesting feature of DynamoDB conditional writes that they count against only write throughput (not read), whether or not the condition is met or not (i.e. regardless of whether the operation ends up being a write-read or a read-only).

Topic		Replies	Views
Recovering pipelines with cross-batch deduplication enabled [tutorial] Troubleshooting	3	3529	September 2, 2017
Snowflake Loader taking too long to process batch	12	1461	April 5, 2020
Snowplow R88 Angkor Wat released New releases	13	1594	May 1, 2017
Deduping Events at collector /enricher level in stream Collectors	1	1102	August 19, 2019
Handling large volumes of duplicated event_ids AWS batch pipeline (Legacy)	3	1358	July 3, 2018

DynamoDB based duplicate event removal throughput

Related topics