GCP enrichment scale up considerations

Hi everyone,
We are using showplow for a while in GCP, and we have already 6 instances (with autoscaling) of snowplow/snowplow-enrich-pubsub:3.0.1

We are doing some stress test, and we submitted 5000 events, but we are receiving more events.

Should the enricher scale vertical or horizontally? Should I configure anything? I assume pub/sub will consider how to consume and not reuse the same offset.

Thank you

Horizontal scaling suits enrich best. In Enrich itself you don’t need to configure anything for this to work, you just need to configure the infrastructure itself to scale out.

You are correct in that the PubSub subscriber is responsible for managing the distribution of messages without reusing the same ones (along with how we ack messages in the codebase).

Hope that helps!

Thanks Colm for your quick response, any clue why we could receive more events than posted?

Probably Network latency? “Network latency can cause events to be resent, resulting in duplicates. You can reduce this by optimizing your network settings or moving your Snowplow collector and enrichment closer to your data sources.”

I’ve checked and all of the events has different event_id. But I read that if not provided, the enrichment is adding it.

Im discarding the idea of the tracker as we are using a tool to post lot of events.

If you’re not using a tracker then that specific point about network latency wouldn’t apply, since the trackers’ retry mechanism is what’s responsible there.

My gut tells me it’s something to do with how you’re sending through the data. Some questions to help debug:

  • What tool are you using, and what’s the setup there?
  • Are you getting a lot of failure responses from the collector?
  • How many events have arrived in the pubsub raw stream for the period of time in question? (I think you should be able to find this out via the GCP console UI for PubSub)

Indeed if you haven’t set an event ID before sending the data, one will be generated for you in enrich.

And just to check some assumptions - when you’ve horizontally scaled out, you have deployed more instances within the same job, rather than a separate job - and there’s only one subscription on the topic - correct?

Yes, theres only one subscription on the topic.
Where can I see the job configuration? in what container I can find that?

That’s not a facet of Enrich, it’s a facet of the infrastructure that it’s running on. How did you deploy the app?

Searching a bit, apparently is normal to have duplicated events.

https://snowplow.io/blog/dealing-with-duplicate-event-ids/

Is it possible this is the reason? Also, is there a deduplication feature?

thank you