Deduping Events at collector /enricher level in stream

Vinuthna_Gaddipati · August 19, 2019, 10:54pm

Hi,

I am using snowplow collectors to write to a kinesis Sink and enrichers to kinesis pipeline. I see that for various reasons (n/w issues, latencies , to not lose any data) , the collectors/enrichers are retrying to write the events to the shards there by resulting in exact same duplicate events in the S3 folders and redshift tables. (same event_transation id, collector timestamp and record insert timestamp as well)

We know its from the collector end and not the client end after various checks. We also see the same events counts to be 3 most of the times and thats the # of retries the collectors are configured for by default.

Is there a place and way to remove these duplicates in stream or in the enrichers before writing to S3 ? . We don’t want to handle it at the table level as it is going to be expensive at our end. Any suggestions would be really helpful.

Thanks,
Vinuthna G

ihor · August 19, 2019, 11:33pm

@Vinuthna_Gaddipati, indeed, some events could be duplicated in the pipeline itself due to at least once processing logic to ensure no event is lost.

Loading data to Redshift means you have shredding in place - that’s where deduplication takes place. While in-batch deduplication should be done for you starting from R86 release, to combat cross-batch natural deduplication you would need to have it enabled yourself as per R88 release. It does come at extra AWS cost due to having DynamoDB added to the pipeline (reason why it is an optional feature).

Topic		Replies	Views
[IMPORTANT ALERT] R101 bug may result in duplicated data in the real-time pipeline Open Source Alerts	1	1875	May 26, 2018
Reading data from raw stream to both batch and real time For engineers	2	923	November 27, 2017
Replicate collector events AWS real-time pipeline	2	969	October 12, 2021
Stream Enrich: Duplicated enriched events in R103 (#3745) AWS real-time pipeline	1	1926	April 30, 2018
Collector is sending empty raw events Collectors	3	2286	August 22, 2018

Deduping Events at collector /enricher level in stream

Related topics