We noticed an issue that quite a few events have almost identical content - the only slight difference is in dvce_sent_tstamp (and then collector_tstamp).
Such events are caught by the Snowplow in-batch deduplication mechanism as synthetic duplicates, and assigned new event_id values. These events look normal in atomic.events table yet bring no value and are considered unwanted by us.
I’m not sure I fully understand the issue. But let’s start from the beginning to be sure we’re not confusing definitions - sorry for lengthy response.
Identical events with different collector_tstamp and dvce_sent_tstamp are created usually due absence of exact-once delivery in pipeline: tracker can send event and not get response from collector and re-send it, while collector actually received original event.
Also, it happens due third-party software installed on user’s machine, such as anti-viruses or adult-content filters. They work by duplicating user HTTP requests: first request hits content-filter’s webserver and it checks if this requests is “safe” (e.g. not trying to download a malware) and if request is safe server allows filter installed on client machine to re-send this HTTP request once again to original destination, which in our case is your collector.
This can be also caused by web-scrappers or using slightly different mechanisms, but result is basically the same: same event hits collector more than once. This is what we call natural duplicates, they’re same events. Synthetic duplicates are also result of third-party software, but they have different user-set payload, so they can be different events.
We dealing with natural and synthetic duplicates in different ways:
Natural (ones you’re describing): shred job (since R76) simply filters them out them. It happens only inside a single batch (that’s why it’s called in-batch), if next job encounters duplicate from previous job - it doesn’t know it’s duplicate and therefore it’ll appear in Redshift as well. This can be avoided since R88 by enabling cross-batch natural deduplication.
Synthetic: since R86 they got assigned new event_id and attached duplicate context. I believe this is a process you’re referring to?
So, what I don’t understand is why events, where only two timestamps are different (therefore natural duplicates) remains in atomic.events with changed event_id (which means they processed by synthetic deduplication) as you described it.
After reading your post once again, I now think that your problem is that duplicates remain across batches. I’m still not sure if synthetic or natural ones bother you, but you can fix latter ones with mentioned before cross-batch syntetic deduplication introduced in R88. Bear in mind that it has its costs both in terms of money and complexity. Synthetic cross-batch deduplication though isn’t yet available, but from what I know about shape of end-result data, amount of synthetic duplicates across batches is neglectable.
You actually described the situation perfectly in the first response.
Currently we are only interested in in-batch deduplication, and the problem is observed in events that have the same etl_tstamp values.
I’m not a Snowplow documentation expert so I was not sure whether differences in dvce_sent_tstamp alone meant that events are synthetic duplicates. And according to your comment they are natural duplicates (which makes a lot of sense).
Unfortunately, this confuses the matters further. original_event_id in duplicate context is the same for problematic events. It means that synthetic deduplication was applied nevertheless…
So when exactly an event becomes a synthetic duplicate and not a natural one? I can only see a timestamp difference. Can it be a bug in Snowplow?
Natural duplicates - events where both event_ids and event_fingerprints are identical. In fact same events.
Synthetic duplicates - events where only event_ids are same.
event_fingerprint is property added by Event fingerprint enrichment (forgot to mention - it strictly required to enable to have correctly working deduplication). It’s a hash of all client-set fields. It allows us to distinguish if payloads are identical.
It deals with all duplicates because Shred job basically doesn’t know if event_fingerprint was properly generated via enrichment or inserted by Shred job. In latter case it’s a random UUID, which is always unique and therefore all natural duplicates become synthetic, because their fingerprints are always unique.
I agree that having such an important part of pipeline as deduplication dependent on something that can be easily missed doesn’t feel entirely right. To “justify” it I can say that event fingerprint enrichment is enabled by default and I also would recommend to enable other enrichments that are enabled in that directory.
I opened a discussion about it. Feel free to add your ideas if you have any.