I have a pipeline in production (on GCP) which on I completely provision using Terraform. It has a VM group for the collector and a VM group to facilitate the beam enrich and the BigQuery Mutator & Loader.
So I have monitoring on my pubsub subscriptions and every now and then I get an error where collector.good the queue is building up. It always sorts itself out but in my opinion it’s not normal that I occasionally have message stuck in the subscription for 5-6 minutes.
I did see that the wall time of streaming inserts is long.
Hi @sdbeuf, are you using a single worker or is it an auto-scaling Dataflow job? One very common cause for buildup of latency in the Pubsub topic is that the job is not consuming from it while it’s resizing (either up or down). Once the resize is complete, the job catches up.
If you go to the ‘JOB METRICS’ tab in the Dataflow job UI, you can compare the latency and auto-scaling charts to see if any auto-scaling action aligns with latency buildup.
Hi @sdbeuf, apologies for the long radio silence on this.
System latency is a measure of how long an element takes to be processed from source to sink. That is, it is a sum of time spent in the source (the good Pub/Sub topic) plus time spent inside the BQ Loader.
We therefore have two possible avenues of investigation:
the data may be spending too long stuck in the topic instead of being consumed
the data may be taking too long inside the Dataflow pipeline.
To further debug the first option, there is a Stackdriver metric called pubsub.googleapis.com/subscription/oldest_unacked_message_age, which shows the age (in seconds) of the oldest unacknowledged message in a subscription. Are you able to check the values of that metric for the time intervals when the spikes occur? If there is a correlation between the two, the likely reason for the build up in latency is that the data is not being consumed.
On the second option, inside the Dataflow job there is a step where events are checked against their schema and that requires a call to an Iglu server. It might be the case that some events trigger more calls (perhaps they have multiple contexts?). Could you check a chart of network IO, perhaps via the loadbalancing.googleapis.com/https/* Stackdriver metrics, to see if any extra activity is happening that coincides with the latency build up?
Thanks for sending these @sdbeuf. Unfortunately, they don’t help to establish a firm culprit.
Digging through the Dataflow metrics, it looks like there is a metric called dataflow.googleapis.com/job/per_stage_system_lag. I wonder if that could help pinpoint which step in the Dataflow job seems to contribute most to overall system lag. If you filter by job ID and group by stage, we might be able to see if one of the steps seems to rise above the others.
Hi @sdbeuf, no, we haven’t nailed a culprit in this case.
We are now on version 0.5.1 of the BigQuery Loader apps. Would it be possible to upgrade? At the very least it would mean you’d be using the latest versions of Scio / Beam. 0.4.0 is using a Beam SDK whose version is now 4 releases behind the latest one.
Incidentally, the next BQ Loader release 0.6.0 is slated to add a new (initially experimental) loader that does not rely on Beam / DataFlow, in case that is of interest.
When is the release of BQ Loader 0.6.0 planned? As this is a costly DataFlow job might be a considerable cost reduction for us. If this is not too far I would postpone the upgrade to the 0.6.0 version.