GCP: Ideal setup


I’m wondering about the setup of Snowplow on GCP where you could possible help me. For example what are the recommended sizes for compute engines?

  • Collector?
  • Beam Enrich?
  • Big Query Loader? Mutator?

We do have around 100 million hits per month.

Is it possible to run Enrich and BQ Loader on one compute engine or do I need separate ones for every job?

What is the job of the BQ forwarder?

Is it possible to use Cloud Storage and BigQuery in parallel or does the BigQuery Loader replace the Storage loading?

Also does the web model exist for BigQuery somewhere?

Thanks for your help!


This will depend on not just your volume but the number of bytes you are sending with each request. You want this to be autoscaling but for something with this volume you are probably fine with a few n1-standard-1s or fewer n1-standard-2s.

Beam Enrich and Big Query loader both run on Dataflow (which uses Compute Engine under the hood) and there is a setting to autoscale workers here. You’ll want to make sure you set a maxWorkers setting here but in general these jobs are quite efficient in terms of number of workers required.

You should run these as two separate Dataflow jobs - each one will have it’s own compute under the hood which Dataflow will manage for you.

To forward failed inserts - most commonly due to table mutations in which an event may have columns that do not exist yet in the destination table.

There’s nothing to stop you running both in parallel if required - particularly if you are doing batch inserts into BigQuery rather than streaming inserts. For streaming inserts from PubSub to BigQuery you don’t really need to persist the events to Cloud Storage - though you can if required.

Not yet as far as I’m aware.

1 Like

Wow, thank you for the elaborate answers!

Do we know the state of GCP? We are running in GCP and I would like to run all of these. I have been working on getting everything up and running but when I got to the Loader/Mutator . Looking at https://github.com/snowplow-incubator/snowplow-bigquery-loader there is 1 commit on master, and then a bunch of commits in the release/0.2.0 branch. I noticed that the forwarded does not really even seem to work because of https://github.com/snowplow-incubator/snowplow-bigquery-loader/issues/15 and the beam SDK version is 2.6.0 which google is notifying as deprecated and out of date, but is updated in the unreleased 0.2.0 branch. @volderette were you able to get the GCP setup up and working? Thanks!

I’m not sure about the 0.2.0 RCs but the 0.1.0 release candidates for both the loader and forwarder should be fine to run in production.

Edit See below for new links after we migrated to our new docs site!

There are also some rather well hidden docs over here:-

I only started looking at the 0.2.0 branch because when I try and run the forwarder i get

Exception in thread "main" java.lang.IllegalArgumentException: Pubsub subscription is not in projects/<project_id>/subscriptions/<subscription_name> format: projects/***/topics/***
	at org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$PubsubSubscription.fromPath(PubsubIO.java:210)
	at org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$Read.fromSubscription(PubsubIO.java:594)
	at org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$Read.fromSubscription(PubsubIO.java:587)
	at com.snowplowanalytics.snowplow.storage.bigquery.forwarder.Forwarder$.run(Forwarder.scala:40)
	at com.snowplowanalytics.snowplow.storage.bigquery.forwarder.Main$.main(Main.scala:23)
	at com.snowplowanalytics.snowplow.storage.bigquery.forwarder.Main.main(Main.scala)

even though I passed in --failedInsertsSub and when I looked at the code it looks like https://github.com/snowplow-incubator/snowplow-bigquery-loader/blob/1cf0648cfc3f9e68825a1f1a806529b49bd7162d/forwarder/src/main/scala/com/snowplowanalytics/snowplow/storage/bigquery/forwarder/Forwarder.scala#L40 its using the topic not the subscription, and the ticket that I mentioned https://github.com/snowplow-incubator/snowplow-bigquery-loader/issues/15 seemed to confirm that

Here are the new corresponding links (docs site was migrated that caused the links discrepancies):