GCP enrichment scale up considerations

FabianEpic · March 31, 2023, 1:48pm

Hi everyone,
We are using showplow for a while in GCP, and we have already 6 instances (with autoscaling) of snowplow/snowplow-enrich-pubsub:3.0.1

We are doing some stress test, and we submitted 5000 events, but we are receiving more events.

Should the enricher scale vertical or horizontally? Should I configure anything? I assume pub/sub will consider how to consume and not reuse the same offset.

Thank you

Colm · March 31, 2023, 2:37pm

Horizontal scaling suits enrich best. In Enrich itself you don’t need to configure anything for this to work, you just need to configure the infrastructure itself to scale out.

You are correct in that the PubSub subscriber is responsible for managing the distribution of messages without reusing the same ones (along with how we ack messages in the codebase).

Hope that helps!

FabianEpic · March 31, 2023, 3:38pm

Thanks Colm for your quick response, any clue why we could receive more events than posted?

Probably Network latency? “Network latency can cause events to be resent, resulting in duplicates. You can reduce this by optimizing your network settings or moving your Snowplow collector and enrichment closer to your data sources.”

I’ve checked and all of the events has different event_id. But I read that if not provided, the enrichment is adding it.

Im discarding the idea of the tracker as we are using a tool to post lot of events.

Colm · March 31, 2023, 4:38pm

If you’re not using a tracker then that specific point about network latency wouldn’t apply, since the trackers’ retry mechanism is what’s responsible there.

My gut tells me it’s something to do with how you’re sending through the data. Some questions to help debug:

What tool are you using, and what’s the setup there?
Are you getting a lot of failure responses from the collector?
How many events have arrived in the pubsub raw stream for the period of time in question? (I think you should be able to find this out via the GCP console UI for PubSub)

Indeed if you haven’t set an event ID before sending the data, one will be generated for you in enrich.

And just to check some assumptions - when you’ve horizontally scaled out, you have deployed more instances within the same job, rather than a separate job - and there’s only one subscription on the topic - correct?

FabianEpic · April 4, 2023, 2:20pm

Yes, theres only one subscription on the topic.
Where can I see the job configuration? in what container I can find that?

Colm · April 4, 2023, 2:54pm

That’s not a facet of Enrich, it’s a facet of the infrastructure that it’s running on. How did you deploy the app?

FabianEpic · April 5, 2023, 12:08pm

Searching a bit, apparently is normal to have duplicated events.

https://snowplow.io/blog/dealing-with-duplicate-event-ids/

Is it possible this is the reason? Also, is there a deduplication feature?

thank you

Topic		Replies	Views
Enrich Pub Sub on App Engine Scaling Issue GCP pipeline	2	1001	January 21, 2022
Autoscaling in kubernetes for collector and enrich pubsub GCP pipeline	0	941	January 26, 2022
Enrich PubSub - "Cannot construct Input.PubSub from topic" error GCP pipeline	2	1055	July 28, 2021
Snowplow enrichment of SendGrid Webhooks seem to be missing events Enrichment	1	66	September 19, 2024
Pubsub Enricher failed to add enriched values into the events (Javascript enrichment) Enrichment	6	1078	November 18, 2021

GCP enrichment scale up considerations

Related topics