Duplicated events (same event id) - continuing over time

rbkn · June 8, 2020, 4:19pm

I’m finding that in my pipeline that I’m running as a demo, there are duplicate event ids consistently coming through into storage (lines completely identical). Not all the duplicated lines are appearing immediately, they tend to just keep duplicating over time in my storage location (BigQuery). Since all these records have the same etl timestamp, collector time, device time, event id, etc. I’m assuming that I’ve misconfigured something in my pipeline.

Has anyone come across this issue before? (just to see if this is an obvious issue before I look into each step separately).

My pipeline looks like:

js tracker->scala collector (single instance)->beam enrich->dockerized bigquery loader (single instance)

R

—UPDATE:
Possible cause is that the acknowledgement deadline on the ‘good’ pubsub topic was too high. Testing this now.

–UPDATE2:
Think this was it…

Colm · June 8, 2020, 5:16pm

—UPDATE:
Possible cause is that the acknowledgement deadline on the ‘good’ pubsub topic was too high. Testing this now.

Indeed that seems like a likely candidate. Glad you found the issue.

In case people find this post in future and that’s not the issue for them, I might as well post what I was mid-way through typing:

The system works under at-least-once semantics, so a certain amount of duplicate events are expected.

Normally you would expect to see some small proportion (less than 1% usually) of events that are duplicated because of the trackers’ cache/retry mechanisms, or similar client-related factors. These would have a different collector tstamp however.

In your case, it’s likely the loader - the etl_tstamp is generated at enrich stage, if duplicates were produced before then, we would see different etl_tstamps.

The loader will retry on failure. Sometimes the row is inserted successfully, but doesn’t return a successful acknowledgment. I’ve seen this happen where a BQ quota is reached (returns a failure related to quota surpassed, but the row actually didn’t fail to insert), and where the loader is operating too close to or above its max memory (weird behaviour would be observed all round).

I’ve cut out the rest of what I was going to say since you’ve found the likely culprit. I hope your demo pipeline produces fruitful results!

Topic		Replies	Views
Event's id duplicates a lot of times	1	955	September 15, 2022
De-deduplicating events in Hadoop and Redshift [tutorial] For data modelers & consumers	9	6648	June 23, 2017
Bulk Postgres Loader Deduplication	1	922	April 10, 2020
Unwanted deduplicated events are getting through Troubleshooting	7	2449	September 29, 2017
Duplicate event ids Tracking SDKs	7	4185	October 7, 2019

Duplicated events (same event id) - continuing over time

Related topics