Significant lag with transformer and redshift loader

WTravisH · April 20, 2023, 1:50am

We’re processing on the order of 1MM events per day, using the basic Terraform quickstart from GitHub - snowplow/quickstart-examples: Examples of how to automate creating a Snowplow Open Source pipeline with the AWS “secure” setup.

The only exception is that rather than using Postgres or Snowflake, we’re using Redshift by implementing the rdb-redshift-loader app (version 5.2.1).

Looking at S3, our “raw” and “enriched” paths are getting update with new data every few minutes, basically in real time. But as of this writing, our “transformed” data hasn’t been updated in about 12 hours. Further, the most recent data that has been loaded into Redshift is now about 36 hours old.

We’re looking ideally for events to be in Redshift no more than an hour after they’re received.

What could be causing the slowness with the transform and loading steps? How can we debug?

mike · April 20, 2023, 2:36am

How often are you running the transformer / loader? If you are getting data to S3 reasonably quickly I’d say the bottleneck is either going to be the transformer / loader or the Redshift cluster itself.

WTravisH · April 20, 2023, 3:51am

It looks like the transformer runs every 5 minutes but sometimes misses several or a whole bunch of runs in a row. (I’m inferring from the names of the folders in the S3 bucket.)

These are the last 10 folder names in the /transformed/good/ folder in S3 as of right now:

run=2023-04-19-08-30-00/
run=2023-04-19-08-35-00/
run=2023-04-19-08-40-00/
run=2023-04-19-08-45-00/
run=2023-04-19-10-20-00/
run=2023-04-19-10-25-00/
run=2023-04-19-10-45-00/
run=2023-04-19-10-50-00/
run=2023-04-19-10-55-00/
run=2023-04-19-11-00-00/

I don’t know how often or under what circumstances either the transformer or loader run, or why they might miss a run. Where is that configured?

WTravisH · April 20, 2023, 11:50am

Some more detail here.

Looking at the logs, the Redshift loader seems to be loading the transformed events within a few minutes of them becoming available in S3. For example, the 9:05am run this morning was processed by the Redshift loader at 9:10am:

2023-04-20T09:10:45.831Z|INFO DataDiscovery: Received a new message
2023-04-20T09:10:45.831Z|INFO DataDiscovery: Total 2206 messages received, 2146 loaded…
2023-04-20T09:10:45.831Z|INFO DataDiscovery: New data discovery at run=2023-04-20-09-05-00 with following shredded types…
A bunch of lines specific to loading in the new data
2023-04-20T09:12:48.535Z|INFO Loader: Total 2206 messages received, 2147 loaded; Loader is in Idle state; Last state update at 2023-04-20 09:10:51.252
2023-04-20T09:17:48.537Z|INFO snowplow.rdbloader.minimum_age_of_loaded_data = 86689

Interestingly, there is one more folder in S3 (a 9:10am run), which has not been processed by Redshift even several hours later. Perhaps the message to import the previous run into Redshift is sent at the same time the next transformer run goes into S3?

Regardless, the problem seems to be with the transformer, not Redshift.

Meanwhile, the transformer logs are full of warnings that seem more like errors. Here’s just the last one minute: [cats-effect-blocker-0] WARN software.amazon.kinesis.lifecycle.ShardConsumerSubs - Pastebin.com

Any clues in here?

josh · April 20, 2023, 12:31pm

Hi @WTravisH would you mind sharing how you have configured the Transformer perhaps by dropping the relevant Terraform code here for it so we can better debug this?

Looking at S3, our “raw” and “enriched” paths are getting update with new data every few minutes, basically in real time. But as of this writing, our “transformed” data hasn’t been updated in about 12 hours. Further, the most recent data that has been loaded into Redshift is now about 36 hours old.

Raw and Enriched are not really related paths when using the “streaming transformer” - this transformer reads directly from the stream unlike our old “batch transformer” which would use the output for the “enriched” data from the S3 Loader to get data into a warehouse. It does mean the issue is in the transformer configuration but just wanted to clarify that these are not strictly related things.

WTravisH · April 20, 2023, 12:56pm

I believe this is everything we have configured for the transformer. Let me know if this is what you needed.

module "transformer_enriched" {
  source = "snowplow-devops/transformer-kinesis-ec2/aws"
  version = "0.1.0"
  count = 1

  name                           = "snowplow-transformer-kinesis-enriched-server"
  vpc_id                         = "vpc-xxx"
  subnet_ids                     = ["subnet-xxx", "subnet-xxx", "subnet-xxx"]
  ssh_key_name                   = "snowplow-pipeline"
  ssh_ip_allowlist               = ["xx.xx.xx.xx/32", "10.0.1.xx/32"]
  stream_name                    = "snowplow-enriched-stream"
  s3_bucket_name                 = "dv3-snowplow-terraform-bucket"
  s3_bucket_object_prefix        = "transformed/good"
  window_period_min              = 5
  sqs_queue_name                 = "snowplow-loader.fifo"
  transformation_type            = "shred"
  custom_iglu_resolvers          = [
    {
      name            = "Iglu Server"
      priority        = 0
      uri             = "https://iglu.xxx.xxx/api"
      api_key         = "xxxxxx"
      vendor_prefixes = []
    }
  ]
  kcl_write_max_capacity         = 50
  iam_permissions_boundary       = ""
  telemetry_enabled              = false
  user_provided_id               = ""
  tags                           = {
    dv3_app = "snowplow"
  }
  cloudwatch_logs_enabled        = true
  cloudwatch_logs_retention_days = 7
}

josh · April 20, 2023, 1:08pm

So first thing I would try is to upgrade that (and the loader module) to the latest available versions. This hasn’t gone into the docs / quickstart guides yet but I have been working hard to bring them up to speed with a lot of performance, stability and security improvements - this is in fact for ALL AWS modules so far.

My hunch here is that your transformer is likely underprovisioned / potentially silently crashing or cycling and running out of memory (all of which has been by and large solved in the later module versions).

For the transformer you want module v0.3.3 and for the loader its just v0.1.0 (GitHub - snowplow-devops/terraform-aws-redshift-loader-ec2).

Could you give that a try? If lag persists I would bump the instance_type from a t3a.small out to a beefier t3a.large or even an m5.xlarge for large enough volumes of traffic or a very big backlog.

WTravisH · April 20, 2023, 3:56pm

Okay, this seems to have done the trick, at least tentatively.

I upgraded all the docker containers to the latest version across our entire snowplow infrastructure. Since then, the transformed folders have populated every 5 minutes, and all of the warnings are gone from the transformer logs.

Redshift appears to be loading every 5 minutes still, no problems, and now that the transformer is catching up, redshift is catching up too.

Also, I believe the transformer instance was previously a t3a.micro, and is now a t3a.small, which can’t have hurt with the memory issues.

Thank you for your help!

josh · April 21, 2023, 9:45am

Ahh that’s great news @WTravisH ! Ill be doing a webinar sometime soon on scaling the AWS pipeline for scale as well which will touch on using these exact modules and how to keep them working up to quite large throughput which could be useful to help explain these apps in greater detail.

Topic		Replies	Views
How to transform and load data from s3 into redshift Troubleshooting	3	502	December 11, 2023
Snowplow-rdb-loader timing out For engineers	2	733	March 5, 2020
How can we Reload the missing data in to redshift AWS real-time pipeline	0	324	February 20, 2024
Is it possible to have snowplow events datamart in S3 datalake instead of Redshift/Snowflake/Databriks Enrichment	0	364	January 4, 2024
Databricks Loader Building Up Lag Troubleshooting	15	1558	January 11, 2023

Significant lag with transformer and redshift loader

Related topics