Setup Snowplow on GCP

Marwan_abdel_moneim · October 4, 2021, 12:06pm

Hi All,

Just wanted to share detailed steps for how to setup Snowplow on GCP that I use:
(You may just need to update the versions if you like)

References:
- https://www.simoahava.com/analytics/install-snowplow-on-the-google-cloud-platform/
- https://docs.snowplowanalytics.com/docs/getting-started-on-snowplow-open-source/setup-snowplow-on-gcp/
- https://github.com/kayalardanmehmet/Snowplow-GCloud-Tutorial

Installation:
- Create a new project
- Enable billing
- Enable Compute Engine API, Pub/Sub API, and Dataflow API
- Create these Pub/Sub topics, "good", "bad", "bq-bad-rows", "bq-failed-inserts", "bq-types", "enriched-bad", "enriched-good"
- Create Pub/Sub subscriptions to the "good" topic called "good"
- Create Pub/Sub subscriptions to the "bq-types" topic called "bq-types"
- Create Pub/Sub subscriptions to the "enriched-good" topic called "enriched-good"
- Create a firewall rule in VPC network
  - Give it a name "snowplow"
  - Add target tag "collector"
  - Add IP range "0.0.0.0/0"
  - Under Protocols and ports, check tcp and type 8080 as the port value
- Create a bucket and remember its name
- Upload application.conf, bigquery_config.json, and iglu_resolver.json to it
- Create a folder in the bucket and name it "temp"
- Setup the collector as an instance group in front of a load balancer
  - Create an instance template called "collector"
  - Use debian(10) not ubuntu
  - Add "collector" network tag
  - Allow HTTP traffic
  - Give access to the Pub/Sub API
  - Go to https://dl.bintray.com/snowplow/snowplow-generic/ and get the latest scala stream collector version
  - And add the following collector startup script with the correct bucket name and collector version
  
#! /bin/bash
collector_version="1.0.1"
bucket_name="<bucket-name>"
sudo apt-get update
sudo apt-get -y install default-jre
sudo apt-get -y install unzip
sudo apt-get -y install wget
archive=snowplow_scala_stream_collector_google_pubsub_$collector_version.zip
wget https://dl.bintray.com/snowplow/snowplow-generic/$archive
gsutil cp gs://$bucket_name/application.conf .
unzip $archive
java -jar snowplow-stream-collector-google-pubsub-$collector_version.jar --config application.conf &

- Create an instance group from the template and name it "collectors"
  - Make sure you selected the instance template
  - Create a health check
    - Name it "snowplow"
    - Protocol "HTTP"
    - Port "8080"
    - Request path "/health"
  - Submit
  - You can now check the health of the collector using "curl http://<EXTERNAL_IP_HERE>:8080/health"

- Create an HTTP(S) load balancer from Network services
  - Choose "From Internet to my VMs"
  - Name it "collectors-load-balancer"
  - Create a back end service 
    - Name it "snowplow"
    - Choose the instance group you just created
    - Set Port numbers to "8080"
    - Choose the health check you just created
    - Submit
  - Click Host and path rules, and make sure the backend service you created is visible in the rule
  - Click Frontend configuration
    - Name it "snowplow"
    - Protocol "HTTPS"
    - Select the IP Address list, and click create IP address to reserve a new static IP address
      - Name it "snowplow", and click reserve
    - Make sure the new IP address is selected in the frontend configuration, and copy the IP address to a text editor or something. You’ll need it when configuring the DNS of your custom domain name!
    - Make sure 443 is set as the Port
    - In the Certificates menu, choose Create a new certificate
      - Name it "snowplow"
      - Choose Create Google-managed certificate
      - Enter your domain (with the subdomain)
      - Submit
    - Click Done
  - Submit

- Now go to your domain provider, and add an A record with the subdomain you like, and point it to the IP you copied before
- You can test it with "host <subdomain>.<domain>"
- Remember that creating a new certificate takes some time

- Create a new bigquery dataset called snowplow

- Create a new instance template
  - Name it "etl"
  - Give access to 
    BigQuery: Enabled
    Cloud Pub/Sub: Enabled
    Compute Engine: Read Write
    Storage: Full
  - Add the following startup script after setting the correct variable values

#! /bin/bash
enrich_version="1.2.3"
bq_version="0.6.1"
bucket_name="<project-id>-history"
project_id="<project-id>-history"
region="us-central1"
sudo apt-get update
sudo apt-get -y install unzip
sudo apt-get -y install wget
sudo apt-get install software-properties-common -y
sudo apt-add-repository 'deb http://security.debian.org/debian-security stretch/updates main'
sudo apt-get update
sudo apt-get install openjdk-8-jdk -y
wget https://dl.bintray.com/snowplow/snowplow-generic/snowplow_beam_enrich_$enrich_version.zip
unzip snowplow_beam_enrich_$enrich_version.zip
wget https://dl.bintray.com/snowplow/snowplow-generic/snowplow_bigquery_loader_$bq_version.zip
unzip snowplow_bigquery_loader_$bq_version.zip
wget https://dl.bintray.com/snowplow/snowplow-generic/snowplow_bigquery_mutator_$bq_version.zip
unzip snowplow_bigquery_mutator_$bq_version.zip
gsutil cp gs://$bucket_name/iglu_resolver.json .
gsutil cp gs://$bucket_name/bigquery_config.json .
./beam-enrich-$enrich_version/bin/beam-enrich --runner=DataFlowRunner --project=$project_id --streaming=true --region=$region --gcpTempLocation=gs://$bucket_name/temp --job-name=beam-enrich --raw=projects/$project_id/subscriptions/good --enriched=projects/$project_id/topics/enriched-good --bad=projects/$project_id/topics/enriched-bad --resolver=iglu_resolver.json --workerMachineType=n1-standard-1 -Dscio.ignoreVersionWarning=true
./snowplow-bigquery-mutator-$bq_version/bin/snowplow-bigquery-mutator create --config $(cat bigquery_config.json | base64 -w 0) --resolver $(cat iglu_resolver.json | base64 -w 0)
./snowplow-bigquery-mutator-$bq_version/bin/snowplow-bigquery-mutator listen --config $(cat bigquery_config.json | base64 -w 0) --resolver $(cat iglu_resolver.json | base64 -w 0) &
./snowplow-bigquery-loader-$bq_version/bin/snowplow-bigquery-loader --config=$(cat bigquery_config.json | base64 -w 0) --resolver=$(cat iglu_resolver.json | base64 -w 0) --runner=DataFlowRunner --project=$project_id --region=$region --gcpTempLocation=gs://$bucket_name/temp --maxNumWorkers=2 --workerMachineType=n1-standard-1 --autoscalingAlgorithm=NONE

- Create a new instance group from that template
  - Name it "etl"
  - Turn of autoscaling
  - Submit

- You are done!
- Now send some events, and check them in bigquery

EddieM · November 8, 2021, 1:45pm

Hi,
Thanks for this!
Eddie

Marwan_abdel_moneim · November 8, 2021, 1:59pm

You’re Welcome

Topic		Replies	Views
Porting Snowplow to Google Cloud Platform RFCs	7	7907	February 8, 2024
Looking for a script/solution to build a simple GCP pipeline with Bigquery as sink GCP pipeline	1	915	January 20, 2022
Components are being removed from GCP? For engineers	3	704	June 20, 2019
GCP: Ideal setup For engineers	7	1290	April 30, 2020
Suggestions about GCP Setup Documentation For engineers	3	741	December 9, 2020

Setup Snowplow on GCP

Related topics