Bulk Postgres Loader Deduplication

ScalaEnthu · April 10, 2020, 7:02am

I am reading the store data and i am sending the events with different event id, different event fingerprint and etl timestamp could be same. Would it be a deduplicate event?

BATCH 1:
EventId EventFingerPrint etl_tstamp

E1 EFP1 T1
E2 EFP2 T1
E3 EFP3 T2

BATCH 1:
EventId EventFingerPrint etl_tstamp

E3 EFP3 T1
E4 EFP4 T2

ihor · April 10, 2020, 3:59pm

@ScalaEnthu, there is nothing wrong with having the same ETL timestamp - it simply indicates the events were processed in the pipeline at the same time (same batch). Different event IDs and payloads (event fingerprints) mean different events.

When talking about duplicates, it is important to distinguish between what we call natural and synthetic duplicates. They have different causes.

Natural duplicates are most frequently a byproduct of the tracker re-sending events when it has failed to receive confirmation that they have reached the collector. This is done to minimise that risk of data loss. The result could be events with the same event_id and the same payload (event_fingerprint) but with different collector_tstamp.

A similar result could take place in the real-time pipeline itself dues to at-least-once processing semantics. Again, the “at-least-once” processing is deployed to eliminate data loss.

Synthetic duplicates are events that have the same event_id but different payloads. In other words, these are not duplicate events (the payload - event_fingerprint - is different), but rather collisions in the UUID for the event_id field.

Thus, in summary, the following are the reasons for duplicated events

Client-side environment causes events to be sent in with the same ID
Events are sometimes duplicated within the Snowplow pipeline itself

Topic		Replies	Views
Event's id duplicates a lot of times	1	944	September 15, 2022
Recovering pipelines with cross-batch deduplication enabled [tutorial] Troubleshooting	3	3529	September 2, 2017
Duplicated events (same event id) - continuing over time Collectors	1	1414	June 8, 2020
Unwanted deduplicated events are getting through Troubleshooting	7	2447	September 29, 2017
Duplicate event ids Tracking SDKs	7	4184	October 7, 2019

Bulk Postgres Loader Deduplication

Related topics