[IMPORTANT ALERT] R101 bug may result in duplicated data in the real-time pipeline

BenFradet · April 30, 2018, 1:26pm

Summary

We have discovered an issue which introduces duplicates at the enrichment stage of the real-time pipeline thanks to @asgergb.

Snowplow 101 Neapolis introduced sharing the same Kinesis sink across multiple
Amazon Kinesis Client Library’s RecordProcessors which would result in the same Kinesis
sink being flushed as many times as there were RecordProcessors leading to duplicated events if
there was more than one RecordProcessor running on the same Stream Enrich instance.

Who is affected

You are affected by this issue if:

you’re using stream-enrich version 0.15.0 or 0.16.0 (R101 and R103 respectively) and
using Kinesis and
running a lower number of stream-enrich instances compared to the number of shards in the raw stream

How to recover

If you are only consuming data from a target with an implicit deduplication process (e.g. using Elasticsearch to store your enriched events) and/or a target with an explicit deduplication process (e.g. using the shred job to then load data into Redshift) you won’t be affected by this issue as the duplicates will be implicitly or explicitly removed.

However, if you’re processing your enriched events directly from S3 you will need to run a deduplication process that discards duplicates (events with the same event id).

How to avoid this issue

If you are running stream-enrich 0.15.0 or 0.16.0, you can make sure that you have the same amount of stream-enrich instances running as there are shards in the raw stream.

This issue will be addressed in the upcoming R105.

Topic		Replies	Views
Stream Enrich: Duplicated enriched events in R103 (#3745) AWS real-time pipeline	1	1926	April 30, 2018
Deduping Events at collector /enricher level in stream Collectors	1	1103	August 19, 2019
[IMPORTANT ALERT] R93 bug may result in missing enriched data when resharding Kinesis stream Open Source Alerts	3	2481	October 10, 2017
Real-time pipeline reprocessing AWS real-time pipeline	15	2335	February 1, 2018
Reading data from raw stream to both batch and real time For engineers	2	923	November 27, 2017

[IMPORTANT ALERT] R101 bug may result in duplicated data in the real-time pipeline

Summary

Who is affected

How to recover

How to avoid this issue

Related topics