Setup: AWS infrastructure. After validation every event goes to two places - one copy in ElasticSearch to be queried by Kibana, another copy goes to S3 where a job does batch-processing and writes events into Redshift.
Application: Browser website
There are situations where there is an event with an event_id in Kibana, but not in Redshift. Redshift has another event that has exactly the same attributes but a different event_id.
All the timestamps (dvce_created_tstamp, dvce_sent_tstamp, collector_tstamp) in Kibana have a +2hr offset than what is seen in S3 and Redshift.
Redshift has a group of events, let’s say 7, all of which have the same attributes and the same dvce_created_tstamp but different dvce_sent_tstamps and therefore different collector_tstamps. In other words, event that is created once is sent multiple times and collected multiple times. But all the events have different event_ids. For these somewhat duplicates, Kibana has:
for some users the last duplicate with the same event_id
for others the last duplicate with a different event_id. This different event_id is nowhere to be found in Redshift
and for other users an event sent and collected later than the last duplicate seen in Redshift, of course having a different event_id, nowhere to be seen in Redshift.
Although Kibana and Redshift (also S3) take events from the same validation pipeline, but Redshift (and S3) has many (somewhat) duplicates while Kibana does not. And Kibana has some events that Redshift (and S3) does not.
RDB Shredder is being used, although the version is 0.15.0.jar, which could be updated, but event_fingerprint was never enabled. Only the user_fingerprint is enabled. Could this be a valid cause of faulty deduplication?
And the derived_tstamp also has an offset of 2 hrs.
Hi @Hasan_Shaukat , lack of event_fingerprint can indeed mess up with the deduplication.
You can read up on why and how duplicates are created in detail here. To give you some context in short though, when talking about duplicates, it is important to distinguish between what we call natural and synthetic duplicates. The significance is that each group is dealt with in a different way.
Natural duplicates are most frequently a byproduct of the tracker re-sending events when it has failed to receive confirmation that they have reached the collector. This is done to minimise that risk of data loss. The result could be events with the same event_id and the same payload but with different collector_tstamp . The solution is to identify these duplicates and only keep one of them.
Synthetic duplicates are events that have the same event_id but different payloads. In other words, these are not duplicate events (the payload is different), but rather collisions in the UUID for the event_id field. The solution for these duplicates is usually to assign them a new, unique event_id .
You can have both types of duplicates in the same batch of events (ie the same EMR run) or across batches. Also, the mechanisms that create duplicates can work in tandem, so that a user can end up having both versions for the same event. For example, if a bot sends 50,000 events with the same event_id (synthetic duplicates), a portion of them might get sent more than once, creating natural duplicates.
You’ll notice I’ve been talking about ‘the same payload’ or ‘different payloads’. The way to compare them is via the event_fingerprint, which is what makes it significant. Without it, deduplication is not reliable, and can result in what you are describing.
The events that seem to be duplicates but have different event_id are likely synthetic duplicates that got new event IDs. The ones that seem to have different timestamps are likely natural duplicates where only one was picked. And some might have been both synthetic and natural.