We are currently in the process of implementing Snowplow on the Google Cloud platform. Our first setup seems to be working but we found a problem with the service Dataflow: Additional Dataflow jobs (Big Query Loader) appear to be started after a while even though only one (the initial one) is needed and none of the started Dataflow jobs are being stopped. Thus, the number of Dataflow jobs increases over time and drives costs up.
We have already checked the auto-scaling settings and the parameters are here set to:
–maxNumWorkers=1 --autoscalingAlgorithm=NONE
As we assume this setting to be correct, we would need advice on what else we can do to stop the service from scaling out of control.
How are you starting / orchestrating your Dataflow jobs? The autoscaling looks fine (although you may not necessarily want maxWorkers 1) however this applies to workers only rather than jobs.
Subsequently, an instance group is started with this template and (autoscale=off, number of instances=1).
At first, a dataflow job (loader) starts and then every 10 minutes another dataflow job (loader) starts,
until there are a total of 4 jobs (BQ loaders) are started.
We expect only one dataflow job to start, not yet another one every 10 minutes.
For example the enrich job is started only once, just as expected.
In this tutorial he uses an VM to initialize the pipelines. The issue I had was that at start-up the vm used too much CPU, causing it to spin up another VM, initializing another pipeline. You can adjust the settings of the instance template (that’s how I solved it). Another way is to fix the name of the dataflow job, in this way when they’re trying to spin a new dataflow job, it’ll already exist. Bear in mind that for some reason I have not figured out yet how to fix the bigquery loader dataflow job name, just setting --job-name does not work for some reason.