Dataflow jobs scaling out of control

Hi everyone!

We are currently in the process of implementing Snowplow on the Google Cloud platform. Our first setup seems to be working but we found a problem with the service Dataflow: Additional Dataflow jobs (Big Query Loader) appear to be started after a while even though only one (the initial one) is needed and none of the started Dataflow jobs are being stopped. Thus, the number of Dataflow jobs increases over time and drives costs up.

We have already checked the auto-scaling settings and the parameters are here set to:

–maxNumWorkers=1 --autoscalingAlgorithm=NONE

As we assume this setting to be correct, we would need advice on what else we can do to stop the service from scaling out of control.

Thanks in advance for your suggestions and help!

How are you starting / orchestrating your Dataflow jobs? The autoscaling looks fine (although you may not necessarily want maxWorkers 1) however this applies to workers only rather than jobs.

We have followed the instructions of Simo Ahava (https://www.simoahava.com/analytics/install-snowplow-on-the-google-cloud-platform/#step-1-create-the-instance-template).
That means we create an instance template with the following startup script.

Loader-Job Startup

#! /bin/bash
enrich_version="0.3.0"
bq_version="0.1.0"
bucket_name="xxx"
project_id="xxx"
region="europe-west1"
sudo apt-get update
sudo apt-get -y install default-jre
sudo apt-get -y install unzip
wget https://dl.bintray.com/snowplow/snowplow-generic/snowplow_bigquery_loader_$bq_version.zip
unzip snowplow_bigquery_loader_$bq_version.zip
wget https://dl.bintray.com/snowplow/snowplow-generic/snowplow_bigquery_mutator_$bq_version.zip
unzip snowplow_bigquery_mutator_$bq_version.zip
gsutil cp gs://$bucket_name/iglu-resolver.json .
gsutil cp gs://$bucket_name/bigquery-config.json .
./snowplow-bigquery-mutator-$bq_version/bin/snowplow-bigquery-mutator create --config $(cat bigquery-config.json | base64 -w 0) --resolver $(cat iglu-resolver.json | base64 -w 0)
./snowplow-bigquery-mutator-$bq_version/bin/snowplow-bigquery-mutator listen --config $(cat bigquery-config.json | base64 -w 0) --resolver $(cat iglu-resolver.json | base64 -w 0) &
./snowplow-bigquery-loader-$bq_version/bin/snowplow-bigquery-loader --project=$project_id --runner=DataFlowRunner --region=$region --gcpTempLocation=gs://$bucket_name/temp-files --resolver=$(cat iglu-resolver.json | base64 -w 0) --workerMachineType=n1-standard-1 --config=$(cat bigquery-config.json | base64 -w 0) --maxNumWorkers=1 --autoscalingAlgorithm=NONE

Subsequently, an instance group is started with this template and (autoscale=off, number of instances=1).
At first, a dataflow job (loader) starts and then every 10 minutes another dataflow job (loader) starts,
until there are a total of 4 jobs (BQ loaders) are started.
We expect only one dataflow job to start, not yet another one every 10 minutes.
For example the enrich job is started only once, just as expected.

Enrich-Job Startup

#! /bin/bash
enrich_version="0.3.0"
bq_version="0.1.0"
bucket_name="xxx"
project_id="xxx"
region="europe-west1"

sudo apt-get update
sudo apt-get -y install default-jre
sudo apt-get -y install unzip

wget https://dl.bintray.com/snowplow/snowplow-generic/snowplow_beam_enrich_$enrich_version.zip
unzip snowplow_beam_enrich_$enrich_version.zip

gsutil cp gs://$bucket_name/iglu-resolver.json .
gsutil cp gs://$bucket_name/bigquery-config.json .

./beam-enrich-$enrich_version/bin/beam-enrich --project=$project_id --job-name=beam-enrich --runner=DataFlowRunner --region=$region --gcpTempLocation=gs://$bucket_name/temp-files --resolver=iglu-res

@ihor @anton
I am setting up pipeline on GCP. I followed below pattern.
Collector => Beam enrichment => Big query Loader.

I have some doubts while setting up pipeline.

  1. Beam enrichment creating a dataflow job. Should this job be running all the time as real time pipeline?

  2. Big query loader is also creating a dataflow job . is it batch pipeline process. if this is a batch pipeline process when should we stop this?

  3. Can you please tell us about real-time pipeline and batch pipeline architecture of snowplow on GCP?

@anshratn1997, only real-time architecture is supported on GCP. There will be no batch processing.

In this tutorial he uses an VM to initialize the pipelines. The issue I had was that at start-up the vm used too much CPU, causing it to spin up another VM, initializing another pipeline. You can adjust the settings of the instance template (that’s how I solved it). Another way is to fix the name of the dataflow job, in this way when they’re trying to spin a new dataflow job, it’ll already exist. Bear in mind that for some reason I have not figured out yet how to fix the bigquery loader dataflow job name, just setting --job-name does not work for some reason.