Event's id duplicates a lot of times

RockstarAlex · September 15, 2022, 3:32pm

Hello everyone!

We use snowplow for data analytics and we have launched it using terraform, which means that all of the parts are in docker containers. And all of the data is in the PostgreSQL DB.

It has been working for almost a year and recently during data analysis in Events Table we have found that we have a lot of duplicates. The most interesting thing is that these duplicate events have the same EventID and Device Created Tstamp while Collector Tstamp and Etl tstamp are different for different events. And there are quite a lot of them.

We went through configuration and googling but did not find anything useful. Also, despite there are some topics in discourse that already have discussed duplicating, none of them describe appropriate solution.

Did anyone face this problem, or may be someone knows how to handle it? If you need more information that will help describing the topic you may just ask me in comments.

Colm · September 15, 2022, 4:04pm

Hi @RockstarAlex,

This is normal behaviour, and alone doesn’t signify that anything is necessarily wrong - although if you have an unusual amount of duplicates that might signify a problem in tracking (normally we use a metre stick of <1% of event IDs duplicated is acceptable).

The likely explanation for the ones you describe is simply a poor connection from the user’s device. The trackers are loss-averse, so if they send data, and it reaches the collector, but the connection hangs or cuts out before the collector can respond to the tracker, then the tracker will treat this as a failed send, and will attempt it again. This would be fairly common if the device is on the edge of a Wifi network’s range, or is switching between different networks a lot, for example.

There are other possible causes of duplicates such as bot traffic, and browser plugins misbehaving. Here’s a detailed blog explaining the subject, in case that’s useful.

Our various supported data models handle duplicates for you out of the box, so any modeling you do off the back of those will take care of it for you.

If you’re querying atomic data directly, then we usually advise to join on both event ID and collector tstamp to avoid cartesian joins.

If you need a more nuanced strategy, the logic we use in the models is to:

Keep the earliest of any duplicates that don’t share a collector_tstamp
Discard both if they do share a collector tstamp

Typically these strategies won’t skew analysis unless you have an outsized number of duplicates.

I hope that’s helpful!

Edit: I’ve checked the blog post I linked to and it looks like something went wrong with a giant image on there (we recently migrated websites) - I’ve flagged it, if it’s not fixed by the time you see it then my apologies - please bear with us!

Topic		Replies	Views
Duplicate event ids Tracking SDKs	7	4187	October 7, 2019
Bulk Postgres Loader Deduplication	1	922	April 10, 2020
Duplicate "event_id" fields? Tracking SDKs	9	2134	January 5, 2018
De-deduplicating events in Hadoop and Redshift [tutorial] For data modelers & consumers	9	6649	June 23, 2017
Unwanted deduplicated events are getting through Troubleshooting	7	2449	September 29, 2017

Event's id duplicates a lot of times

Related topics