We use snowplow for data analytics and we have launched it using terraform, which means that all of the parts are in docker containers. And all of the data is in the PostgreSQL DB.
It has been working for almost a year and recently during data analysis in Events Table we have found that we have a lot of duplicates. The most interesting thing is that these duplicate events have the same EventID and Device Created Tstamp while Collector Tstamp and Etl tstamp are different for different events. And there are quite a lot of them.
We went through configuration and googling but did not find anything useful. Also, despite there are some topics in discourse that already have discussed duplicating, none of them describe appropriate solution.
Did anyone face this problem, or may be someone knows how to handle it? If you need more information that will help describing the topic you may just ask me in comments.
This is normal behaviour, and alone doesn’t signify that anything is necessarily wrong - although if you have an unusual amount of duplicates that might signify a problem in tracking (normally we use a metre stick of <1% of event IDs duplicated is acceptable).
The likely explanation for the ones you describe is simply a poor connection from the user’s device. The trackers are loss-averse, so if they send data, and it reaches the collector, but the connection hangs or cuts out before the collector can respond to the tracker, then the tracker will treat this as a failed send, and will attempt it again. This would be fairly common if the device is on the edge of a Wifi network’s range, or is switching between different networks a lot, for example.
Our various supported data models handle duplicates for you out of the box, so any modeling you do off the back of those will take care of it for you.
If you’re querying atomic data directly, then we usually advise to join on both event ID and collector tstamp to avoid cartesian joins.
If you need a more nuanced strategy, the logic we use in the models is to:
Keep the earliest of any duplicates that don’t share a collector_tstamp
Discard both if they do share a collector tstamp
Typically these strategies won’t skew analysis unless you have an outsized number of duplicates.
I hope that’s helpful!
Edit: I’ve checked the blog post I linked to and it looks like something went wrong with a giant image on there (we recently migrated websites) - I’ve flagged it, if it’s not fixed by the time you see it then my apologies - please bear with us!