This will depend on not just your volume but the number of bytes you are sending with each request. You want this to be autoscaling but for something with this volume you are probably fine with a few n1-standard-1s or fewer n1-standard-2s.
Beam Enrich and Big Query loader both run on Dataflow (which uses Compute Engine under the hood) and there is a setting to autoscale workers here. You’ll want to make sure you set a maxWorkers setting here but in general these jobs are quite efficient in terms of number of workers required.
You should run these as two separate Dataflow jobs - each one will have it’s own compute under the hood which Dataflow will manage for you.
To forward failed inserts - most commonly due to table mutations in which an event may have columns that do not exist yet in the destination table.
There’s nothing to stop you running both in parallel if required - particularly if you are doing batch inserts into BigQuery rather than streaming inserts. For streaming inserts from PubSub to BigQuery you don’t really need to persist the events to Cloud Storage - though you can if required.
Do we know the state of GCP? We are running in GCP and I would like to run all of these. I have been working on getting everything up and running but when I got to the Loader/Mutator . Looking at https://github.com/snowplow-incubator/snowplow-bigquery-loader there is 1 commit on master, and then a bunch of commits in the release/0.2.0 branch. I noticed that the forwarded does not really even seem to work because of https://github.com/snowplow-incubator/snowplow-bigquery-loader/issues/15 and the beam SDK version is 2.6.0 which google is notifying as deprecated and out of date, but is updated in the unreleased 0.2.0 branch. @volderette were you able to get the GCP setup up and working? Thanks!
I only started looking at the 0.2.0 branch because when I try and run the forwarder i get
Exception in thread "main" java.lang.IllegalArgumentException: Pubsub subscription is not in projects/<project_id>/subscriptions/<subscription_name> format: projects/***/topics/***
at org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$PubsubSubscription.fromPath(PubsubIO.java:210)
at org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$Read.fromSubscription(PubsubIO.java:594)
at org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$Read.fromSubscription(PubsubIO.java:587)
at com.snowplowanalytics.snowplow.storage.bigquery.forwarder.Forwarder$.run(Forwarder.scala:40)
at com.snowplowanalytics.snowplow.storage.bigquery.forwarder.Main$.main(Main.scala:23)
at com.snowplowanalytics.snowplow.storage.bigquery.forwarder.Main.main(Main.scala)