Real-time visualization

Hi. We are exploring migrating from GA3 to our own Snowplow running in AWS. Our users want to have real-time stats: the most popular pages, how many visitors for each page, etc.

Has anyone implemented a visualization tool like opensearch dashboard, kibana, etc or implemented some other architecture to make this happen? If so how did you make it work?

Right now we’ve setup the AWS bootstrap which uses Kinesis. We have that loading into Redshift but need a different tool/arch for real-time.

Thanks

Hi @yangabanga - loading Elasticsearch / Opensearch would likely be the best strategy here to deliver real-time dashboards. We also have an OS Terraform Module for this which you can use as a way to get started quickly.

The alternative if Elasticsearch / Kibana is not something you are interested in would be to setup the Postgres Loader which is also real-time / streaming. This one is not battle-tested for really high ingress rates however so how well this works might vary.


With Opensearch you get Kibana included so can easily spinup dashboards that way but with Postgres you are likely going to have an easier time plugging this in to your existing BI Tools and querying it in much the same syntax as Redshift.

Would either of these options work for what you are looking to implement?

Yes looks like opensearch/kibana could be the way forward. thanks!

@josh might be a dumb question but will this module work with aws opensearch and opensearch dashboard or is it elasticsearch only?

Internally we exclusively use the Elasticsearch Loader with AWS Opensearch clusters which then has Kibana out of the box as well. So yes it for sure works!

1 Like

Hi @josh , thank you for the link above. I also have a question related to this. My infra was launched via aws quickstart with postgres loader. I would like to add the capability for real time dashboard with aws opensearch kibana. Can (Terraform Registry) be added as is to the quickstart tf or I have to build new infra with elasticsearch loader? Thank you!

Hi @alvin so the quick-start won’t support it but you can simply edit the terraform to have the extra module for the ES Loader and plug it into the existing Kinesis streams you have already created → ultimately forking the quick-start repo to be your own.

So you don’t need to start from scratch - just add the extra Terraform in for the loader you want!

1 Like

Hi @josh, Thank you for your response! I truly appreciate it. I am encountering a blocking issue with the terraform module (Terraform Registry).

│ The given value is not suitable for module.elasticsearch-loader-kinesis-ec2.var.subnet_ids declared at
│ .terraform/modules/elasticsearch-loader-kinesis-ec2/variables.tf:11,1-22: list of string required.

I am confident that I declared the variable subnet_ids as a list of strings. The value I used for subnet_ids is the private subnet that PostgreSQL is using. I have also tried using two different values for the required variables of Elasticsearch, but both the AWS OpenSearch Engine (OpenSearch 1.3) and Elasticsearch 7.10 are throwing the same error for the subnet_ids in the module.

subnet_ids = ["subnet-XXXXXXXXXXX", "subnet-XXXXXXXXXXX"]

Would you be able to help me with checking this module? I can only use version 0.1.0 and version 0.1.1 due to version constraints on my current AWS Quickstart infrastructure. Thank you!

Hi @alvin can you share your Terraform here please as it will make it much easier to debug.

1 Like

yes sure!

Here it is @josh

added this to my main.tf (pipeline\main.tf)

module "elasticsearch-loader-kinesis-ec2" {
  source  = "snowplow-devops/elasticsearch-loader-kinesis-ec2/aws"
  version = "0.1.0"
  bad_stream_name          = var.bad_stream_name
  es_cluster_endpoint      = var.es_cluster_endpoint
  es_cluster_index         = var.es_cluster_index
  es_cluster_port          = var.es_cluster_port
  in_stream_name           = var.in_stream_name
  in_stream_type           = var.in_stream_type
  name                     = var.name
  ssh_key_name             = var.ssh_key_name
  subnet_ids               = var.subnet_ids
  vpc_id                   = var.vpc_id
  es_cluster_document_type = var.es_cluster_document_type
  es_cluster_name          = var.es_cluster_name
}

then declare variables on variables.tf (pipeline\variables.tf)

for the es values. I add it to values path with postgres (pipeline\postgres.terraform.tfvars)

when i run plan below. error occurs

terraform plan -var-file=postgres.terraform.tfvars

So the type on the module is set correctly - can you share the vars file as well that is being used and redact anything sensitive?

Would you be able to help me with checking this module? I can only use version 0.1.0 and version 0.1.1 due to version constraints on my current AWS Quickstart infrastructure. Thank you!

Upgrading to the latest module versions is always recommended here so its worth upgrading the other quick-start modules to remove this limitation.

Hi Josh, thanks for checking the module I provided. It seems that my quick-start modules are in the latest version. But starting from es-loader v.0.2.0 it is throwing version constraints error.

iglu_server | "provider_aws" | ~> 3.45.0 |
pipeline | "provider_aws" | ~> 3.45.0 |

$ terraform init -upgrade

Initializing provider plugins...
- Finding hashicorp/aws versions matching ">= 3.25.0, ~> 3.45.0, >= 3.75.0"...
- Finding hashicorp/random versions matching ">= 3.0.0, ~> 3.1.0"...
- Finding snowplow-devops/snowplow versions matching ">= 0.4.0"...
- Using previously-installed snowplow-devops/snowplow v0.7.1
- Using previously-installed hashicorp/random v3.1.3
╷
│ Error: Failed to query available provider packages
│
│ Could not retrieve the list of available versions for provider hashicorp/aws: no available releases match the given constraints >= 3.25.0, ~> 3.45.0, >= 3.75.0

making me use lower es-loader v.0.1.0/0.1.1

Also, please see vars file

# Will be prefixed to all resource names
# Use this to easily identify the resources created and provide entropy for subsequent environments
prefix = "XXXXXX"

# --- S3
s3_bucket_name = "XXXXXX"

# To use an existing bucket set this to false
s3_bucket_deploy = true

# To save objects in a particular sub-directory you can pass in an optional prefix (e.g. 'foo/' )
s3_bucket_object_prefix = ""

# --- VPC
# Update to the VPC you would like to deploy into which must have public & private subnet layers across which to deploy
# different layers of the application
vpc_id = "XXXXXX"

# Load Balancer will be deployed in this layer
public_subnet_ids = ["subnet-XXXXXX", "subnet-XXXXXX"]
# EC2 Servers & RDS will be deployed in this layer
private_subnet_ids = ["subnet-XXXXXX", "subnet-XXXXXX"]

# --- SSH
# Update this to the internal IP of your Bastion Host
ssh_ip_allowlist = ["XXXXXX"]
# Generate a new SSH key locally with `ssh-keygen`
# ssh-keygen -t rsa -b 4096 
ssh_public_key = "XXXXXX"

# --- Iglu Server Configuration
# Iglu Server DNS output from the Iglu Server stack
iglu_server_dns_name = "XXXXXX"
# Used for API actions on the Iglu Server
# Change this to the same UUID from when you created the Iglu Server
iglu_super_api_key = "XXXXXX"

# --- Snowplow Postgres Loader
pipeline_db          = "XXXXXX"
postgres_db_name     = "XXXXXX"
postgres_db_username = "XXXXXX"
# Change and keep this secret!
postgres_db_password = "XXXXXX"
# IP ranges that you want to query the Pipeline Postgres RDS from
# Note: these IP ranges will need to be internal to your VPC like from a Bastion Host
postgres_db_ip_allowlist = ["XXXXXX"]

# Controls the write throughput of the KCL tables maintained by the various consumers deployed
pipeline_kcl_write_max_capacity = 50

# See for more information: https://registry.terraform.io/modules/snowplow-devops/collector-kinesis-ec2/aws/latest#telemetry
# Telemetry principles: https://docs.snowplowanalytics.com/docs/open-source-quick-start/what-is-the-quick-start-for-open-source/telemetry-principles/
user_provided_id  = ""
telemetry_enabled = false

# --- AWS IAM (advanced setting)
iam_permissions_boundary = "" # e.g. "arn:aws:iam::0000000000:policy/MyAccountBoundary"

# --- SSL Configuration (optional)
ssl_information = {
  certificate_arn = "XXXXXX"
  enabled         = true
}

# --- Extra Tags to append to created resources (optional)
tags = {}

# --- CloudWatch logging to ensure logs are saved outside of the server
cloudwatch_logs_enabled = false
#cloudwatch_logs_retention_days = 7
bad_stream_name          = "XXXXXX"
es_cluster_name          = "XXXXXX"
es_cluster_endpoint      = "XXXXXX" #same error for both internet and vpc endpoint
es_cluster_index         = "XXXXXX"
es_cluster_port          = XXXXXX
es_cluster_document_type = "XXXXXX"
in_stream_name           = "XXXXXX"
in_stream_type           = "XXXXXX"
name                     = "XXXXXX"
ssh_key_name             = "XXXXXX"
subnet_ids               = ["subnet-XXXXXX", "subnet-XXXXXX"]


And lastly can you share the new var declarations? These are the configured values but need to see the actual types you have assigned to these new variables.

Hi @josh , Yes you are right! it was declared as “string” only in var file. Thank you for sorting this!

One question. When I plan it, it seems that s3 enriched loader will be removed too. Why is that? I want to keep enriched in s3.

Hey @alvin that I would not know - have you potentially moved / changed that S3 Loader module in some fashion?

2 Likes

Thank you @josh it was deployed completely now. Appreciate all your support!

3 Likes