We are currently operating a snowplow pipeline on GCP.
I know that the Bigquery Stream Loader uses the Insert ALL API, but how do I resolve this when the Insert ALL API does not consume as many messages as the loader accumulates? (ex. When checking the messages accumulated in the Loader in Grafana, there are 5 million, but the Bigquery Insert Count is actually only 9,000 per second)
I would appreciate it if you could recommend Bigquery Stream Loader Config, which is recommended when there is a lot of traffic.
The current Config is as follows, and in the load test and operation pipeline, the Insert ALL API maintains 10% of Loader messages.
9000 events per second would be on the higher end of throughput by most people’s standards. I’m not sure you could say whether the loader is slow with that information alone, however - if it’s provisioned on one small instance that’s incredibly fast, if it’s running across 10 large vms then it’s slow.
All of our technology is built for a horizontal scaling model. The infrastructure you deploy the pipeline on should be configured with autoscaling configurations that are appropriate to the throughput you expect.
By the sounds of it, what you did here was run a load test where the volume of data coming in outpaced the infrastrcture configuration - in which case the expected behaviour is for latency to build up as you’ve described, but the data will be processed eventually (unless you’re drastically underprovisioned).
When running on Kubernetes and processing 9000 cases per second, the Loader Pod specifications are CPU 4vCPU / Memory 4Gi / javaOpts: “-Xms2g -Xmx2g”, and scales out to MAX 15ea by applying hpa.
As you mentioned, we have configured the pipeline with horizontal expansion in progress, but the curious thing is that the BigQuery Streaming Insert operation does not consume as many messages as the Loader Pod. So I’m curious about the recommended options for Bigquery Stream Loader Config!
So I’m not the most knowledgeable in this topic, but I can try to help figure it out.
As regards the configuration, retrySettings just determines how the loader behaves when there are failures, and consumerSettings govern reading the input pubsub topic - so neither of those seem likely to make much difference.
There are some more advanced sink settings, but I’m not sure if those are any use here. I think the task here is to identify what the limiting factor is.
Is there anything in the application logs that indicate errors coming from BigQuery, failure messages, retries, or quotas being hit?
Snowplow BigQuery Loader, an alternative to StreamLoader, in the form of a Google Cloud Dataflow job.
, and got good throughput, but backed away because it’s unsupported.
I also ran a load test by setting the sinkoption value twice as large as the default value, but the same phenomenon occurred. (A phenomenon in which Bigquery InsertALL cannot match the number of events accumulated in the Loader) There is no quota exceedance or error message, and the application log is recorded as follows.
Or is it because of the Biguqery Insert All API quote limitation?
[io-compute-1] INFO Terminating event sink. Waiting 1 minute for it to complete
[io-compute-1] INFO com.snowplowanalytics.snowplow.storage.bigquery.streamloader.Shutdown - Completed sinking and checkpointing events
[io-compute-1] INFO com.snowplowanalytics.snowplow.storage.bigquery.streamloader.Shutdown - Source of events was canceled