GCP realtime pipeline - growing pubsub queue

sdbeuf · June 26, 2020, 9:15am

Hi

I have a pipeline in production (on GCP) which on I completely provision using Terraform. It has a VM group for the collector and a VM group to facilitate the beam enrich and the BigQuery Mutator & Loader.

So I have monitoring on my pubsub subscriptions and every now and then I get an error where collector.good the queue is building up. It always sorts itself out but in my opinion it’s not normal that I occasionally have message stuck in the subscription for 5-6 minutes.

I did see that the wall time of streaming inserts is long.

Do you have an idea what might be the issue here?

Thanks in advance
Sam

dilyan · June 29, 2020, 7:41am

Hi @sdbeuf, are you using a single worker or is it an auto-scaling Dataflow job? One very common cause for buildup of latency in the Pubsub topic is that the job is not consuming from it while it’s resizing (either up or down). Once the resize is complete, the job catches up.

If you go to the ‘JOB METRICS’ tab in the Dataflow job UI, you can compare the latency and auto-scaling charts to see if any auto-scaling action aligns with latency buildup.

sdbeuf · June 29, 2020, 8:37am

Hi

The autoscaling is on. Here you can find the graphs

Looking to the CPU graph it’s logical that it didn’t scale, so the cause is somewhere else.

What I changed recently is the usage of iglu-resolver-1.0.4 instead of 1.0.1.

dilyan · June 30, 2020, 1:26pm

Hi @sdbeuf,

A few questions to hopefully help diagnose this.

Which version of BigQuery Loader are you using?
Have you noticed a correlation between these latency peaks and spikes in incoming traffic?
Can you please clarify: which component do you mean by iglu-resolver-1.0.4?

sdbeuf · July 1, 2020, 7:19am

Hi @dilyan

0.4.0

No not correlation, as you also can see that the CPU of the dataflow doesn’t peak at all.

It’s the schema in the beam enrich pipeline. In my iglu_resolver.json, the schema I use is 1.0.3 (sorry not 1.0.4).

  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-3",

Thanks in advance!

dilyan · July 10, 2020, 1:32pm

Hi @sdbeuf, apologies for the long radio silence on this.

System latency is a measure of how long an element takes to be processed from source to sink. That is, it is a sum of time spent in the source (the good Pub/Sub topic) plus time spent inside the BQ Loader.

We therefore have two possible avenues of investigation:

the data may be spending too long stuck in the topic instead of being consumed
the data may be taking too long inside the Dataflow pipeline.

To further debug the first option, there is a Stackdriver metric called pubsub.googleapis.com/subscription/oldest_unacked_message_age, which shows the age (in seconds) of the oldest unacknowledged message in a subscription. Are you able to check the values of that metric for the time intervals when the spikes occur? If there is a correlation between the two, the likely reason for the build up in latency is that the data is not being consumed.

On the second option, inside the Dataflow job there is a step where events are checked against their schema and that requires a call to an Iglu server. It might be the case that some events trigger more calls (perhaps they have multiple contexts?). Could you check a chart of network IO, perhaps via the loadbalancing.googleapis.com/https/* Stackdriver metrics, to see if any extra activity is happening that coincides with the latency build up?

sdbeuf · July 10, 2020, 2:37pm

Hi @dilyan

With purple: collector.good
With blue: enriched.good

The highest peaks are around 2.5 minutes.

Backend request count:

request bytes:

I could continue giving you all the loadbalancing.googleapis.com/https/* monitoring views but these follow all the same pattern.

At the moment I cannot find a correlation between the https requests and the pubsub queues.

Thanks in advance,
Sam

dilyan · July 15, 2020, 8:04am

Thanks for sending these @sdbeuf. Unfortunately, they don’t help to establish a firm culprit.

Digging through the Dataflow metrics, it looks like there is a metric called dataflow.googleapis.com/job/per_stage_system_lag. I wonder if that could help pinpoint which step in the Dataflow job seems to contribute most to overall system lag. If you filter by job ID and group by stage, we might be able to see if one of the steps seems to rise above the others.

sdbeuf · July 15, 2020, 8:53am

The red and green line (stage f15 & f17 seem to have less spikes than the others.

Thanks in advance,
Sam

sdbeuf · September 7, 2020, 6:56am

@dilyan The thread turned a bit dead, did you manage to find a potential issue?

dilyan · September 9, 2020, 12:37pm

Hi @sdbeuf, no, we haven’t nailed a culprit in this case.

We are now on version 0.5.1 of the BigQuery Loader apps. Would it be possible to upgrade? At the very least it would mean you’d be using the latest versions of Scio / Beam. 0.4.0 is using a Beam SDK whose version is now 4 releases behind the latest one.

Incidentally, the next BQ Loader release 0.6.0 is slated to add a new (initially experimental) loader that does not rely on Beam / DataFlow, in case that is of interest.

sdbeuf · September 9, 2020, 12:50pm

Hello @dilyan
When is the release of BQ Loader 0.6.0 planned? As this is a costly DataFlow job might be a considerable cost reduction for us. If this is not too far I would postpone the upgrade to the 0.6.0 version.

dilyan · September 10, 2020, 8:38am

No firm date but we are in the final pre-release stages, so hopefully soon.

sdbeuf · September 10, 2020, 8:57am

Ok, super thanks for the information.

dilyan · September 29, 2020, 8:57am

Hi @sdbeuf, this has now been released:

Topic		Replies	Views
Issues on GCP streamloader GCP pipeline	3	1107	March 22, 2022
Why is Bigquery Stream Loader slow? GCP pipeline	5	537	December 1, 2023
BigQuery Stream Loader becomes very non-performant after processing large numbers of events GCP pipeline	4	1347	January 27, 2023
Dataflow jobs scaling out of control GCP pipeline	5	2737	April 22, 2020
Enrich pubsub and bq streamloader down often GCP pipeline	3	2366	November 7, 2021

GCP realtime pipeline - growing pubsub queue

Related topics