Setup Snowplow on GCP

Hi All,

Just wanted to share detailed steps for how to setup Snowplow on GCP that I use:
(You may just need to update the versions if you like)


- Create a new project
- Enable billing
- Enable Compute Engine API, Pub/Sub API, and Dataflow API
- Create these Pub/Sub topics, "good", "bad", "bq-bad-rows", "bq-failed-inserts", "bq-types", "enriched-bad", "enriched-good"
- Create Pub/Sub subscriptions to the "good" topic called "good"
- Create Pub/Sub subscriptions to the "bq-types" topic called "bq-types"
- Create Pub/Sub subscriptions to the "enriched-good" topic called "enriched-good"
- Create a firewall rule in VPC network
  - Give it a name "snowplow"
  - Add target tag "collector"
  - Add IP range ""
  - Under Protocols and ports, check tcp and type 8080 as the port value
- Create a bucket and remember its name
- Upload application.conf, bigquery_config.json, and iglu_resolver.json to it
- Create a folder in the bucket and name it "temp"
- Setup the collector as an instance group in front of a load balancer
  - Create an instance template called "collector"
  - Use debian(10) not ubuntu
  - Add "collector" network tag
  - Allow HTTP traffic
  - Give access to the Pub/Sub API
  - Go to and get the latest scala stream collector version
  - And add the following collector startup script with the correct bucket name and collector version
#! /bin/bash
sudo apt-get update
sudo apt-get -y install default-jre
sudo apt-get -y install unzip
sudo apt-get -y install wget
gsutil cp gs://$bucket_name/application.conf .
unzip $archive
java -jar snowplow-stream-collector-google-pubsub-$collector_version.jar --config application.conf &

- Create an instance group from the template and name it "collectors"
  - Make sure you selected the instance template
  - Create a health check
    - Name it "snowplow"
    - Protocol "HTTP"
    - Port "8080"
    - Request path "/health"
  - Submit
  - You can now check the health of the collector using "curl http://<EXTERNAL_IP_HERE>:8080/health"

- Create an HTTP(S) load balancer from Network services
  - Choose "From Internet to my VMs"
  - Name it "collectors-load-balancer"
  - Create a back end service 
    - Name it "snowplow"
    - Choose the instance group you just created
    - Set Port numbers to "8080"
    - Choose the health check you just created
    - Submit
  - Click Host and path rules, and make sure the backend service you created is visible in the rule
  - Click Frontend configuration
    - Name it "snowplow"
    - Protocol "HTTPS"
    - Select the IP Address list, and click create IP address to reserve a new static IP address
      - Name it "snowplow", and click reserve
    - Make sure the new IP address is selected in the frontend configuration, and copy the IP address to a text editor or something. You’ll need it when configuring the DNS of your custom domain name!
    - Make sure 443 is set as the Port
    - In the Certificates menu, choose Create a new certificate
      - Name it "snowplow"
      - Choose Create Google-managed certificate
      - Enter your domain (with the subdomain)
      - Submit
    - Click Done
  - Submit

- Now go to your domain provider, and add an A record with the subdomain you like, and point it to the IP you copied before
- You can test it with "host <subdomain>.<domain>"
- Remember that creating a new certificate takes some time

- Create a new bigquery dataset called snowplow

- Create a new instance template
  - Name it "etl"
  - Give access to 
    BigQuery: Enabled
    Cloud Pub/Sub: Enabled
    Compute Engine: Read Write
    Storage: Full
  - Add the following startup script after setting the correct variable values

#! /bin/bash
sudo apt-get update
sudo apt-get -y install unzip
sudo apt-get -y install wget
sudo apt-get install software-properties-common -y
sudo apt-add-repository 'deb stretch/updates main'
sudo apt-get update
sudo apt-get install openjdk-8-jdk -y
unzip snowplow_beam_enrich_$
unzip snowplow_bigquery_loader_$
unzip snowplow_bigquery_mutator_$
gsutil cp gs://$bucket_name/iglu_resolver.json .
gsutil cp gs://$bucket_name/bigquery_config.json .
./beam-enrich-$enrich_version/bin/beam-enrich --runner=DataFlowRunner --project=$project_id --streaming=true --region=$region --gcpTempLocation=gs://$bucket_name/temp --job-name=beam-enrich --raw=projects/$project_id/subscriptions/good --enriched=projects/$project_id/topics/enriched-good --bad=projects/$project_id/topics/enriched-bad --resolver=iglu_resolver.json --workerMachineType=n1-standard-1 -Dscio.ignoreVersionWarning=true
./snowplow-bigquery-mutator-$bq_version/bin/snowplow-bigquery-mutator create --config $(cat bigquery_config.json | base64 -w 0) --resolver $(cat iglu_resolver.json | base64 -w 0)
./snowplow-bigquery-mutator-$bq_version/bin/snowplow-bigquery-mutator listen --config $(cat bigquery_config.json | base64 -w 0) --resolver $(cat iglu_resolver.json | base64 -w 0) &
./snowplow-bigquery-loader-$bq_version/bin/snowplow-bigquery-loader --config=$(cat bigquery_config.json | base64 -w 0) --resolver=$(cat iglu_resolver.json | base64 -w 0) --runner=DataFlowRunner --project=$project_id --region=$region --gcpTempLocation=gs://$bucket_name/temp --maxNumWorkers=2 --workerMachineType=n1-standard-1 --autoscalingAlgorithm=NONE

- Create a new instance group from that template
  - Name it "etl"
  - Turn of autoscaling
  - Submit

- You are done!
- Now send some events, and check them in bigquery

Thanks for this!

1 Like

You’re Welcome

1 Like