Deduping Events at collector /enricher level in stream


I am using snowplow collectors to write to a kinesis Sink and enrichers to kinesis pipeline. I see that for various reasons (n/w issues, latencies , to not lose any data) , the collectors/enrichers are retrying to write the events to the shards there by resulting in exact same duplicate events in the S3 folders and redshift tables. (same event_transation id, collector timestamp and record insert timestamp as well)

We know its from the collector end and not the client end after various checks. We also see the same events counts to be 3 most of the times and thats the # of retries the collectors are configured for by default.

Is there a place and way to remove these duplicates in stream or in the enrichers before writing to S3 ? . We don’t want to handle it at the table level as it is going to be expensive at our end. Any suggestions would be really helpful.

Vinuthna G

@Vinuthna_Gaddipati, indeed, some events could be duplicated in the pipeline itself due to at least once processing logic to ensure no event is lost.

Loading data to Redshift means you have shredding in place - that’s where deduplication takes place. While in-batch deduplication should be done for you starting from R86 release, to combat cross-batch natural deduplication you would need to have it enabled yourself as per R88 release. It does come at extra AWS cost due to having DynamoDB added to the pipeline (reason why it is an optional feature).