GCP Iglu Server Health Checks Failing

I am going through the GCP Quick Start guide and having an issue where the Iglu server instance is failing health checks and keeps restarting. The terraform deploy will time out after 20 minutes waiting for the health checks to pass.
module.iglu_server.module.service.google_compute_region_instance_group_manager.grp: Still creating…

  • I can SSH into the instance with the GCP UI.
  • I have activated the 5 appropriate APIs in the quick start guide (Compute Engine API, Cloud Resource Manager API, Identity and Access Management (IAM) API, Cloud Pub/Sub API, Cloud SQL Admin API)
  • I setup a Cloud NAT, and confirmed I can receive a response when I curl example.com when I am SSH’d into the instance
  • The logs for the instance don’t appear to give any errors, only system message notices about it booting and being recreated

There are several topics with similar issues but none of them solved it. I am on windows and even tried converting the .tf files with dos2unix

I am not sure how to verify that the server is running correctly. When I SSH in there is no /opt/snowplow folder that I see mentioned in the startup-script of the iglu_server module.

Appreciate any help, thanks!

Here is my config, stripped of credentials

# Please accept the terms of the Snowplow Limited Use License Agreement to proceed. (https://docs.snowplow.io/limited-use-license-1.0/)
accept_limited_use_license = true

# Will be prefixed to all resource names
# Use this to easily identify the resources created and provide entropy for subsequent environments
prefix = "sp"

# The project to deploy the infrastructure into
project_id = "project-4358349857394"

# Where to deploy the infrastructure
region = "us-central1"

# --- Network
# NOTE: The network & sub-network configured must be configured with a Cloud NAT to allow the deployed Compute Engine instances to
#       connect to the internet to download the required assets
network    = "default"
subnetwork = ""

# --- SSH
# Update this to the internal IP of your Bastion Host
ssh_ip_allowlist = ["XX.XX.XX.XX/32"]
# Generate a new SSH key locally with `ssh-keygen`
# ssh-keygen -t rsa -b 4096 
# ssh_key_pairs = []
ssh_key_pairs = [
  {
    user_name  = "snowplow"
    public_key = "MY_PUBLIC_KEY"
  }
]

# --- Snowplow Iglu Server
iglu_db_name     = "iglu"
iglu_db_username = "iglu"
# Change and keep this secret!
iglu_db_password = "MY_PASSWORD"

# Used for API actions on the Iglu Server
# Change this to a new UUID and keep it secret!
iglu_super_api_key = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"

# NOTE: To push schemas to your Iglu Server, you can use igluctl
# igluctl: https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/igluctl
# igluctl static push --public schemas/ http://CHANGE-TO-MY-IGLU-IP 00000000-0000-0000-0000-000000000000

# See for more information: https://github.com/snowplow-devops/terraform-google-iglu-server-ce#telemetry
# Telemetry principles: https://docs.snowplowanalytics.com/docs/open-source-quick-start/what-is-the-quick-start-for-open-source/telemetry-principles/
user_provided_id  = ""
telemetry_enabled = false

# --- SSL Configuration (optional)
ssl_information = {
  certificate_id = ""
  enabled        = false
}

# --- Extra Labels to append to created resources (optional)
labels = {}

Edit - I noticed HTTP traffic is off for the instances, not sure if that matters
image

If there’s no /opt/snowplow/config dir then something may have failed earlier in the startup script.

Is Docker successfully installed on the instance when you SSH into it? Do you get any error output from the startup script?

Appreciate the help!
Docker is not installed, and no errors are coming from the startup script.

Hm. There’s some churn in the Debian packages for Docker recently, the docker.io package has been split into docker.io and docker-cli but I’m not sure if/how/why that would have impacted the Ubuntu image used here.

Maybe try re-running the script manually with sudo google_metadata_script_runner startup (per this) to see if/where it continues to fail.

Looks like there is an error returned when I run that

Starting startup scripts (version 20231004.02-0ubuntu1~20.04.4).
Found startup-script in metadata.
startup-script: /bin/bash: /tmp/metadata-scripts583314085/startup-script: /bin/bash^M: bad interpreter: No such file or directory
startup-script exit status 126
Finished running startup scripts.

Funny enough, maybe it is related to line endings? I was going to see if I could edit the line endings but its in /tmp and changes each time: newline - Bash script – "/bin/bash^M: bad interpreter: No such file or directory" - Stack Overflow

Edit - Ok maybe not, I ran that command on /usr/bin/google_metadata_script_runner and tried again and got a Segmentation Fault

Ah, right. OK, so the line endings of the .tf file itself shouldn’t matter, but the template file that it builds from probably do.

So you probably want at least:

  • dos2unix .terraform/modules/iglu_server.service/templates/startup-script.sh.tmpl
  • dos2unix .terraform/modules/iglu_server/templates/startup-script.sh.tmpl
  • dos2unix .terraform/modules/iglu_server.telemetry/templates/user-data.sh.tmpl
  • dos2unix .terraform/modules/iglu_server/templates/config.hocon.tmpl
  • dos2unix .terraform/modules/iglu_server.telemetry/templates/gcp_ubuntu_20_04.sh.tmpl

after you have done terraform init, and then do another terraform apply to update the userdata and try again.

2 Likes

You are amazing. That solved my issue and allowed the startup script to start working. The logs also showed I had to activate the Cloud Logging API and now it is all working.

Thank you so much
image

Running this in git bash was a quick way to change all the .tmpl files up to 4 levels deep

find . -maxdepth 4 -type f -name "*.tmpl" -exec dos2unix -v {} +
2 Likes