Significant lag with transformer and redshift loader

We’re processing on the order of 1MM events per day, using the basic Terraform quickstart from GitHub - snowplow/quickstart-examples: Examples of how to automate creating a Snowplow Open Source pipeline with the AWS “secure” setup.

The only exception is that rather than using Postgres or Snowflake, we’re using Redshift by implementing the rdb-redshift-loader app (version 5.2.1).

Looking at S3, our “raw” and “enriched” paths are getting update with new data every few minutes, basically in real time. But as of this writing, our “transformed” data hasn’t been updated in about 12 hours. Further, the most recent data that has been loaded into Redshift is now about 36 hours old.

We’re looking ideally for events to be in Redshift no more than an hour after they’re received.

What could be causing the slowness with the transform and loading steps? How can we debug?

How often are you running the transformer / loader? If you are getting data to S3 reasonably quickly I’d say the bottleneck is either going to be the transformer / loader or the Redshift cluster itself.

It looks like the transformer runs every 5 minutes but sometimes misses several or a whole bunch of runs in a row. (I’m inferring from the names of the folders in the S3 bucket.)

These are the last 10 folder names in the /transformed/good/ folder in S3 as of right now:

  • run=2023-04-19-08-30-00/
  • run=2023-04-19-08-35-00/
  • run=2023-04-19-08-40-00/
  • run=2023-04-19-08-45-00/
  • run=2023-04-19-10-20-00/
  • run=2023-04-19-10-25-00/
  • run=2023-04-19-10-45-00/
  • run=2023-04-19-10-50-00/
  • run=2023-04-19-10-55-00/
  • run=2023-04-19-11-00-00/

I don’t know how often or under what circumstances either the transformer or loader run, or why they might miss a run. Where is that configured?

Some more detail here.

Looking at the logs, the Redshift loader seems to be loading the transformed events within a few minutes of them becoming available in S3. For example, the 9:05am run this morning was processed by the Redshift loader at 9:10am:

  • 2023-04-20T09:10:45.831Z|INFO DataDiscovery: Received a new message
  • 2023-04-20T09:10:45.831Z|INFO DataDiscovery: Total 2206 messages received, 2146 loaded…
  • 2023-04-20T09:10:45.831Z|INFO DataDiscovery: New data discovery at run=2023-04-20-09-05-00 with following shredded types…
  • A bunch of lines specific to loading in the new data
  • 2023-04-20T09:12:48.535Z|INFO Loader: Total 2206 messages received, 2147 loaded; Loader is in Idle state; Last state update at 2023-04-20 09:10:51.252
    2023-04-20T09:17:48.537Z|INFO snowplow.rdbloader.minimum_age_of_loaded_data = 86689

Interestingly, there is one more folder in S3 (a 9:10am run), which has not been processed by Redshift even several hours later. Perhaps the message to import the previous run into Redshift is sent at the same time the next transformer run goes into S3?

Regardless, the problem seems to be with the transformer, not Redshift.

Meanwhile, the transformer logs are full of warnings that seem more like errors. Here’s just the last one minute: [cats-effect-blocker-0] WARN -

Any clues in here?

Hi @WTravisH would you mind sharing how you have configured the Transformer perhaps by dropping the relevant Terraform code here for it so we can better debug this?

Looking at S3, our “raw” and “enriched” paths are getting update with new data every few minutes, basically in real time. But as of this writing, our “transformed” data hasn’t been updated in about 12 hours. Further, the most recent data that has been loaded into Redshift is now about 36 hours old.

Raw and Enriched are not really related paths when using the “streaming transformer” - this transformer reads directly from the stream unlike our old “batch transformer” which would use the output for the “enriched” data from the S3 Loader to get data into a warehouse. It does mean the issue is in the transformer configuration but just wanted to clarify that these are not strictly related things.

I believe this is everything we have configured for the transformer. Let me know if this is what you needed.

module "transformer_enriched" {
  source = "snowplow-devops/transformer-kinesis-ec2/aws"
  version = "0.1.0"
  count = 1

  name                           = "snowplow-transformer-kinesis-enriched-server"
  vpc_id                         = "vpc-xxx"
  subnet_ids                     = ["subnet-xxx", "subnet-xxx", "subnet-xxx"]
  ssh_key_name                   = "snowplow-pipeline"
  ssh_ip_allowlist               = ["xx.xx.xx.xx/32", "10.0.1.xx/32"]
  stream_name                    = "snowplow-enriched-stream"
  s3_bucket_name                 = "dv3-snowplow-terraform-bucket"
  s3_bucket_object_prefix        = "transformed/good"
  window_period_min              = 5
  sqs_queue_name                 = "snowplow-loader.fifo"
  transformation_type            = "shred"
  custom_iglu_resolvers          = [
      name            = "Iglu Server"
      priority        = 0
      uri             = ""
      api_key         = "xxxxxx"
      vendor_prefixes = []
  kcl_write_max_capacity         = 50
  iam_permissions_boundary       = ""
  telemetry_enabled              = false
  user_provided_id               = ""
  tags                           = {
    dv3_app = "snowplow"
  cloudwatch_logs_enabled        = true
  cloudwatch_logs_retention_days = 7

So first thing I would try is to upgrade that (and the loader module) to the latest available versions. This hasn’t gone into the docs / quickstart guides yet but I have been working hard to bring them up to speed with a lot of performance, stability and security improvements - this is in fact for ALL AWS modules so far.

My hunch here is that your transformer is likely underprovisioned / potentially silently crashing or cycling and running out of memory (all of which has been by and large solved in the later module versions).

For the transformer you want module v0.3.3 and for the loader its just v0.1.0 (GitHub - snowplow-devops/terraform-aws-redshift-loader-ec2).

Could you give that a try? If lag persists I would bump the instance_type from a t3a.small out to a beefier t3a.large or even an m5.xlarge for large enough volumes of traffic or a very big backlog.

1 Like

Okay, this seems to have done the trick, at least tentatively.

I upgraded all the docker containers to the latest version across our entire snowplow infrastructure. Since then, the transformed folders have populated every 5 minutes, and all of the warnings are gone from the transformer logs.

Redshift appears to be loading every 5 minutes still, no problems, and now that the transformer is catching up, redshift is catching up too.

Also, I believe the transformer instance was previously a t3a.micro, and is now a t3a.small, which can’t have hurt with the memory issues.

Thank you for your help!


Ahh that’s great news @WTravisH ! Ill be doing a webinar sometime soon on scaling the AWS pipeline for scale as well which will touch on using these exact modules and how to keep them working up to quite large throughput which could be useful to help explain these apps in greater detail.

1 Like