I’m finding that in my pipeline that I’m running as a demo, there are duplicate event ids consistently coming through into storage (lines completely identical). Not all the duplicated lines are appearing immediately, they tend to just keep duplicating over time in my storage location (BigQuery). Since all these records have the same etl timestamp, collector time, device time, event id, etc. I’m assuming that I’ve misconfigured something in my pipeline.
Has anyone come across this issue before? (just to see if this is an obvious issue before I look into each step separately).
—UPDATE:
Possible cause is that the acknowledgement deadline on the ‘good’ pubsub topic was too high. Testing this now.
Indeed that seems like a likely candidate. Glad you found the issue.
In case people find this post in future and that’s not the issue for them, I might as well post what I was mid-way through typing:
The system works under at-least-once semantics, so a certain amount of duplicate events are expected.
Normally you would expect to see some small proportion (less than 1% usually) of events that are duplicated because of the trackers’ cache/retry mechanisms, or similar client-related factors. These would have a different collector tstamp however.
In your case, it’s likely the loader - the etl_tstamp is generated at enrich stage, if duplicates were produced before then, we would see different etl_tstamps.
The loader will retry on failure. Sometimes the row is inserted successfully, but doesn’t return a successful acknowledgment. I’ve seen this happen where a BQ quota is reached (returns a failure related to quota surpassed, but the row actually didn’t fail to insert), and where the loader is operating too close to or above its max memory (weird behaviour would be observed all round).
I’ve cut out the rest of what I was going to say since you’ve found the likely culprit. I hope your demo pipeline produces fruitful results!