GCP: Ideal setup

volderette · September 2, 2019, 2:02pm

Hi,

I’m wondering about the setup of Snowplow on GCP where you could possible help me. For example what are the recommended sizes for compute engines?

Collector?
Beam Enrich?
Big Query Loader? Mutator?

We do have around 100 million hits per month.

Is it possible to run Enrich and BQ Loader on one compute engine or do I need separate ones for every job?

What is the job of the BQ forwarder?

Is it possible to use Cloud Storage and BigQuery in parallel or does the BigQuery Loader replace the Storage loading?

Also does the web model exist for BigQuery somewhere?

Thanks for your help!

Andreas

mike · September 4, 2019, 9:10am

This will depend on not just your volume but the number of bytes you are sending with each request. You want this to be autoscaling but for something with this volume you are probably fine with a few n1-standard-1s or fewer n1-standard-2s.

Beam Enrich and Big Query loader both run on Dataflow (which uses Compute Engine under the hood) and there is a setting to autoscale workers here. You’ll want to make sure you set a maxWorkers setting here but in general these jobs are quite efficient in terms of number of workers required.

You should run these as two separate Dataflow jobs - each one will have it’s own compute under the hood which Dataflow will manage for you.

To forward failed inserts - most commonly due to table mutations in which an event may have columns that do not exist yet in the destination table.

There’s nothing to stop you running both in parallel if required - particularly if you are doing batch inserts into BigQuery rather than streaming inserts. For streaming inserts from PubSub to BigQuery you don’t really need to persist the events to Cloud Storage - though you can if required.

Not yet as far as I’m aware.

volderette · September 9, 2019, 5:44am

Wow, thank you for the elaborate answers!

camerondavison · January 28, 2020, 7:16pm

Do we know the state of GCP? We are running in GCP and I would like to run all of these. I have been working on getting everything up and running but when I got to the Loader/Mutator . Looking at https://github.com/snowplow-incubator/snowplow-bigquery-loader there is 1 commit on master, and then a bunch of commits in the release/0.2.0 branch. I noticed that the forwarded does not really even seem to work because of https://github.com/snowplow-incubator/snowplow-bigquery-loader/issues/15 and the beam SDK version is 2.6.0 which google is notifying as deprecated and out of date, but is updated in the unreleased 0.2.0 branch. @volderette were you able to get the GCP setup up and working? Thanks!

mike · January 28, 2020, 9:44pm

I’m not sure about the 0.2.0 RCs but the 0.1.0 release candidates for both the loader and forwarder should be fine to run in production.

PaulBoocock · January 29, 2020, 1:07pm

Edit See below for new links after we migrated to our new docs site!

There are also some rather well hidden docs over here:-

Scala Stream Collector: https://docs.snowplowanalytics.com/open-source/snowplow/collectors/scala-stream-collector/0.14.0/scala-stream-collector/
Beam Enrich: https://docs.snowplowanalytics.com/open-source/snowplow/enrichment-platforms/beam-enrich/0.1.0/beam-enrich/
BigQuery Loader: https://docs.snowplowanalytics.com/open-source/snowplow/loaders/bigquery-loader/0.1.0/bigquery-loader/
Cloud Storage Loader: https://docs.snowplowanalytics.com/open-source/snowplow/loaders/google-cloud-storage-loader/0.1.0/google-cloud-storage-loader/

camerondavison · January 29, 2020, 6:11pm

I only started looking at the 0.2.0 branch because when I try and run the forwarder i get

Exception in thread "main" java.lang.IllegalArgumentException: Pubsub subscription is not in projects/<project_id>/subscriptions/<subscription_name> format: projects/***/topics/***
	at org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$PubsubSubscription.fromPath(PubsubIO.java:210)
	at org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$Read.fromSubscription(PubsubIO.java:594)
	at org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$Read.fromSubscription(PubsubIO.java:587)
	at com.snowplowanalytics.snowplow.storage.bigquery.forwarder.Forwarder$.run(Forwarder.scala:40)
	at com.snowplowanalytics.snowplow.storage.bigquery.forwarder.Main$.main(Main.scala:23)
	at com.snowplowanalytics.snowplow.storage.bigquery.forwarder.Main.main(Main.scala)

even though I passed in --failedInsertsSub and when I looked at the code it looks like https://github.com/snowplow-incubator/snowplow-bigquery-loader/blob/1cf0648cfc3f9e68825a1f1a806529b49bd7162d/forwarder/src/main/scala/com/snowplowanalytics/snowplow/storage/bigquery/forwarder/Forwarder.scala#L40 its using the topic not the subscription, and the ticket that I mentioned https://github.com/snowplow-incubator/snowplow-bigquery-loader/issues/15 seemed to confirm that

ihor · April 30, 2020, 7:12pm

Here are the new corresponding links (docs site was migrated that caused the links discrepancies):

Topic		Replies	Views
BigQuery Loader - Mutator GCP pipeline	6	1602	May 7, 2020
[RFC] Big Query Loader (Google Cloud Dataflow version) deprecation RFCs	0	844	July 8, 2022
BigQuery Loader and table schema GCP pipeline	1	1204	July 21, 2020
About BigQuery startup script GCP pipeline	2	1059	October 16, 2021
Components are being removed from GCP? For engineers	3	704	June 20, 2019

GCP: Ideal setup

Related topics