Hi All,
Just wanted to share detailed steps for how to setup Snowplow on GCP that I use:
(You may just need to update the versions if you like)
References:
- https://www.simoahava.com/analytics/install-snowplow-on-the-google-cloud-platform/
- https://docs.snowplowanalytics.com/docs/getting-started-on-snowplow-open-source/setup-snowplow-on-gcp/
- https://github.com/kayalardanmehmet/Snowplow-GCloud-Tutorial
Installation:
- Create a new project
- Enable billing
- Enable Compute Engine API, Pub/Sub API, and Dataflow API
- Create these Pub/Sub topics, "good", "bad", "bq-bad-rows", "bq-failed-inserts", "bq-types", "enriched-bad", "enriched-good"
- Create Pub/Sub subscriptions to the "good" topic called "good"
- Create Pub/Sub subscriptions to the "bq-types" topic called "bq-types"
- Create Pub/Sub subscriptions to the "enriched-good" topic called "enriched-good"
- Create a firewall rule in VPC network
- Give it a name "snowplow"
- Add target tag "collector"
- Add IP range "0.0.0.0/0"
- Under Protocols and ports, check tcp and type 8080 as the port value
- Create a bucket and remember its name
- Upload application.conf, bigquery_config.json, and iglu_resolver.json to it
- Create a folder in the bucket and name it "temp"
- Setup the collector as an instance group in front of a load balancer
- Create an instance template called "collector"
- Use debian(10) not ubuntu
- Add "collector" network tag
- Allow HTTP traffic
- Give access to the Pub/Sub API
- Go to https://dl.bintray.com/snowplow/snowplow-generic/ and get the latest scala stream collector version
- And add the following collector startup script with the correct bucket name and collector version
#! /bin/bash
collector_version="1.0.1"
bucket_name="<bucket-name>"
sudo apt-get update
sudo apt-get -y install default-jre
sudo apt-get -y install unzip
sudo apt-get -y install wget
archive=snowplow_scala_stream_collector_google_pubsub_$collector_version.zip
wget https://dl.bintray.com/snowplow/snowplow-generic/$archive
gsutil cp gs://$bucket_name/application.conf .
unzip $archive
java -jar snowplow-stream-collector-google-pubsub-$collector_version.jar --config application.conf &
- Create an instance group from the template and name it "collectors"
- Make sure you selected the instance template
- Create a health check
- Name it "snowplow"
- Protocol "HTTP"
- Port "8080"
- Request path "/health"
- Submit
- You can now check the health of the collector using "curl http://<EXTERNAL_IP_HERE>:8080/health"
- Create an HTTP(S) load balancer from Network services
- Choose "From Internet to my VMs"
- Name it "collectors-load-balancer"
- Create a back end service
- Name it "snowplow"
- Choose the instance group you just created
- Set Port numbers to "8080"
- Choose the health check you just created
- Submit
- Click Host and path rules, and make sure the backend service you created is visible in the rule
- Click Frontend configuration
- Name it "snowplow"
- Protocol "HTTPS"
- Select the IP Address list, and click create IP address to reserve a new static IP address
- Name it "snowplow", and click reserve
- Make sure the new IP address is selected in the frontend configuration, and copy the IP address to a text editor or something. You’ll need it when configuring the DNS of your custom domain name!
- Make sure 443 is set as the Port
- In the Certificates menu, choose Create a new certificate
- Name it "snowplow"
- Choose Create Google-managed certificate
- Enter your domain (with the subdomain)
- Submit
- Click Done
- Submit
- Now go to your domain provider, and add an A record with the subdomain you like, and point it to the IP you copied before
- You can test it with "host <subdomain>.<domain>"
- Remember that creating a new certificate takes some time
- Create a new bigquery dataset called snowplow
- Create a new instance template
- Name it "etl"
- Give access to
BigQuery: Enabled
Cloud Pub/Sub: Enabled
Compute Engine: Read Write
Storage: Full
- Add the following startup script after setting the correct variable values
#! /bin/bash
enrich_version="1.2.3"
bq_version="0.6.1"
bucket_name="<project-id>-history"
project_id="<project-id>-history"
region="us-central1"
sudo apt-get update
sudo apt-get -y install unzip
sudo apt-get -y install wget
sudo apt-get install software-properties-common -y
sudo apt-add-repository 'deb http://security.debian.org/debian-security stretch/updates main'
sudo apt-get update
sudo apt-get install openjdk-8-jdk -y
wget https://dl.bintray.com/snowplow/snowplow-generic/snowplow_beam_enrich_$enrich_version.zip
unzip snowplow_beam_enrich_$enrich_version.zip
wget https://dl.bintray.com/snowplow/snowplow-generic/snowplow_bigquery_loader_$bq_version.zip
unzip snowplow_bigquery_loader_$bq_version.zip
wget https://dl.bintray.com/snowplow/snowplow-generic/snowplow_bigquery_mutator_$bq_version.zip
unzip snowplow_bigquery_mutator_$bq_version.zip
gsutil cp gs://$bucket_name/iglu_resolver.json .
gsutil cp gs://$bucket_name/bigquery_config.json .
./beam-enrich-$enrich_version/bin/beam-enrich --runner=DataFlowRunner --project=$project_id --streaming=true --region=$region --gcpTempLocation=gs://$bucket_name/temp --job-name=beam-enrich --raw=projects/$project_id/subscriptions/good --enriched=projects/$project_id/topics/enriched-good --bad=projects/$project_id/topics/enriched-bad --resolver=iglu_resolver.json --workerMachineType=n1-standard-1 -Dscio.ignoreVersionWarning=true
./snowplow-bigquery-mutator-$bq_version/bin/snowplow-bigquery-mutator create --config $(cat bigquery_config.json | base64 -w 0) --resolver $(cat iglu_resolver.json | base64 -w 0)
./snowplow-bigquery-mutator-$bq_version/bin/snowplow-bigquery-mutator listen --config $(cat bigquery_config.json | base64 -w 0) --resolver $(cat iglu_resolver.json | base64 -w 0) &
./snowplow-bigquery-loader-$bq_version/bin/snowplow-bigquery-loader --config=$(cat bigquery_config.json | base64 -w 0) --resolver=$(cat iglu_resolver.json | base64 -w 0) --runner=DataFlowRunner --project=$project_id --region=$region --gcpTempLocation=gs://$bucket_name/temp --maxNumWorkers=2 --workerMachineType=n1-standard-1 --autoscalingAlgorithm=NONE
- Create a new instance group from that template
- Name it "etl"
- Turn of autoscaling
- Submit
- You are done!
- Now send some events, and check them in bigquery