[IMPORTANT ALERT] R101 bug may result in duplicated data in the real-time pipeline

Summary

We have discovered an issue which introduces duplicates at the enrichment stage of the real-time pipeline thanks to @asgergb.

Snowplow 101 Neapolis introduced sharing the same Kinesis sink across multiple
Amazon Kinesis Client Library’s RecordProcessors which would result in the same Kinesis
sink being flushed as many times as there were RecordProcessors leading to duplicated events if
there was more than one RecordProcessor running on the same Stream Enrich instance.

Who is affected

You are affected by this issue if:

  • you’re using stream-enrich version 0.15.0 or 0.16.0 (R101 and R103 respectively) and
  • using Kinesis and
  • running a lower number of stream-enrich instances compared to the number of shards in the raw stream

How to recover

If you are only consuming data from a target with an implicit deduplication process (e.g. using Elasticsearch to store your enriched events) and/or a target with an explicit deduplication process (e.g. using the shred job to then load data into Redshift) you won’t be affected by this issue as the duplicates will be implicitly or explicitly removed.

However, if you’re processing your enriched events directly from S3 you will need to run a deduplication process that discards duplicates (events with the same event id).

How to avoid this issue

If you are running stream-enrich 0.15.0 or 0.16.0, you can make sure that you have the same amount of stream-enrich instances running as there are shards in the raw stream.

This issue will be addressed in the upcoming R105.

2 Likes