Summary
We have discovered an issue which introduces duplicates at the enrichment stage of the real-time pipeline thanks to @asgergb.
Snowplow 101 Neapolis introduced sharing the same Kinesis sink across multiple
Amazon Kinesis Client Library’s RecordProcessor
s which would result in the same Kinesis
sink being flushed as many times as there were RecordProcessor
s leading to duplicated events if
there was more than one RecordProcessor
running on the same Stream Enrich instance.
Who is affected
You are affected by this issue if:
- you’re using stream-enrich version 0.15.0 or 0.16.0 (R101 and R103 respectively) and
- using Kinesis and
- running a lower number of stream-enrich instances compared to the number of shards in the raw stream
How to recover
If you are only consuming data from a target with an implicit deduplication process (e.g. using Elasticsearch to store your enriched events) and/or a target with an explicit deduplication process (e.g. using the shred job to then load data into Redshift) you won’t be affected by this issue as the duplicates will be implicitly or explicitly removed.
However, if you’re processing your enriched events directly from S3 you will need to run a deduplication process that discards duplicates (events with the same event id).
How to avoid this issue
If you are running stream-enrich 0.15.0 or 0.16.0, you can make sure that you have the same amount of stream-enrich instances running as there are shards in the raw stream.
This issue will be addressed in the upcoming R105.