A question we often get is how to deal with duplicates. It’s possible – but rare – for 2 or more events to have the same event ID. This can, in some cases, cause problems in Redshift when an inner or left join becomes a cartesian product.
We have written about this topic a couple of times before, but below is an overview of our latest thinking on duplicates.
1. What is causing events to be duplicated?
Duplicates are possible because we set the event ID in the tracker rather than in the pipeline. The reason we do it like this is because it allows us to spot unusual events (like duplicates), but until R76, we didn’t prevent duplicates from being loaded into Redshift.
There are 2 main mechanisms responsible for creating duplicates (for a more detailed explanation, read the first 2 sections of this blogpost).
The first reason is that we have designed the pipeline to minimise the risk of data loss. For example:
- Trackers cache the event until confirmation has been received from the collector. However, if the collector logs the event but the tracker doesn’t receive the confirmation from the collector, it will send the event again, and the same event will appear twice in Redshift.
These events are true duplicates. The actual tracker payload that was sent to the collector is identical (fields like
etl_tstamp that are set later on can still be different because events can arrive at different times).
These duplicates (which are not duplicates in the true sense of the word) are less common. Because all events from this user or bot will have the same event ID, it’s possible to see several thousand events (or more) that all have the same event ID.
2. De-deduplication process in Hadoop
We released Snowplow R76 with a first version of the de-duplication process that runs on Hadoop. This process runs before events are shredded and stops between 95 and 99% of duplicates from being loaded into Redshift.
The current implementation partitions events on event ID and event fingerprint and deletes all but the first event in each partition. In other words, it de-duplicates all natural duplicates that are in the same batch.
It does not:
- de-deduplicate between batches
- de-duplicate synthetic duplicates
We are working on 2 more releases that will cover these 2 cases – both will be released later in 2016.
3. How to deal with existing duplicates in Redshift?
The de-deduplication process that runs on Hadoop stops most new duplicates from being loaded into Redshift, but it doesn’t remove existing duplicates.
Below are a couple of solutions to deal with existing duplicates.
3.1 Join on both event ID and collector timestamp
Most duplicates have the same event ID but a different collector timestamp (this is true for all kinds of duplicates). The easiest solution to get rid of cartesian products is therefore to join tables on both
event_id = root_id and
collector_tstamp = root_tstamp.
To count the number of duplicates in each table:
SELECT count, COUNT(*) FROM (SELECT event_id, COUNT(*) FROM ... GROUP BY 1) GROUP BY 1 ORDER BY 1
This can be used to assess which tables are affected (not all tables have duplicates). It’s also recommended to compare the results before and after updating the join to assess whether this is a good enough solution or not.
Related post: Dealing with duplicates when joining contexts to atomic.events
3.2 De-deduplicate as part of the data modeling process
Loading atomic.events into Redshift is almost never the last step in the pipeline. Most Snowplow users have a data modeling process which transforms and aggregates events into a second set of tables, which are then made available to end users in the business.
If the number of events that are modelled with each run is small enough, it’s possible to deduplicate events before tables are joined. For example, one approach would be to partition on event ID (and/or event fingerprint) and use
ROW_NUMBER (documentation) to pick the first within each partition.
3.3 Remove duplicates using SQL
Another solution is to remove duplicates from
atomic.events and related tables altogether. We released a set of SQL queries that does that. These can be run once or after each run (although doing so can be slow).
Here’s how to run them:
create a schema
create a table
The next couple of steps depend on whether all events were enriched with the event fingerprint enrichment or not. If the event fingerprint is never
NULL, then it’s sufficient to use 01-events.sql.
If not, the following steps will need to be run twice: once using 01-events.sql, a second time using 02-events-without-fingerprint.sql. The first will deduplicate all events with a fingerprint, the second will deduplicate all events without using the fingerprint (which is a lot slower – consider resizing the Redshift cluster to speed things up).
The steps in both cases are the same:
steps (a) to © deal with natural duplicates: move all but the first to
step (d) is an optional step that moves all remaining duplicates to
duplicates.events(this removes duplicates with a different fingerprint)
if running both 01-events.sql and 02-events-without-fingerprint.sql, don’t run step (d) until the second time
De-duplicating unstructured event and context tables requires a bit more work. We published an example to help with this process. Here’s how to use the example:
create a table in
duplicatesidentical to the one in
find and replace
example_unstructwith the exact table name in 03-example-unstruct.sql
add in all missing fields
GROUP BY(relevant line)
run the queries
This will have to be repeated for all tables that have a significant number of duplicates (more than 1% of rows).
Let us know if you have any questions!