Thanks Colm for your quick response, any clue why we could receive more events than posted?
Probably Network latency? “Network latency can cause events to be resent, resulting in duplicates. You can reduce this by optimizing your network settings or moving your Snowplow collector and enrichment closer to your data sources.”
I’ve checked and all of the events has different event_id. But I read that if not provided, the enrichment is adding it.
Im discarding the idea of the tracker as we are using a tool to post lot of events.
If you’re not using a tracker then that specific point about network latency wouldn’t apply, since the trackers’ retry mechanism is what’s responsible there.
My gut tells me it’s something to do with how you’re sending through the data. Some questions to help debug:
What tool are you using, and what’s the setup there?
Are you getting a lot of failure responses from the collector?
How many events have arrived in the pubsub raw stream for the period of time in question? (I think you should be able to find this out via the GCP console UI for PubSub)
Indeed if you haven’t set an event ID before sending the data, one will be generated for you in enrich.
And just to check some assumptions - when you’ve horizontally scaled out, you have deployed more instances within the same job, rather than a separate job - and there’s only one subscription on the topic - correct?