The only exception is that rather than using Postgres or Snowflake, we’re using Redshift by implementing the rdb-redshift-loader app (version 5.2.1).
Looking at S3, our “raw” and “enriched” paths are getting update with new data every few minutes, basically in real time. But as of this writing, our “transformed” data hasn’t been updated in about 12 hours. Further, the most recent data that has been loaded into Redshift is now about 36 hours old.
We’re looking ideally for events to be in Redshift no more than an hour after they’re received.
What could be causing the slowness with the transform and loading steps? How can we debug?
How often are you running the transformer / loader? If you are getting data to S3 reasonably quickly I’d say the bottleneck is either going to be the transformer / loader or the Redshift cluster itself.
It looks like the transformer runs every 5 minutes but sometimes misses several or a whole bunch of runs in a row. (I’m inferring from the names of the folders in the S3 bucket.)
These are the last 10 folder names in the /transformed/good/ folder in S3 as of right now:
run=2023-04-19-08-30-00/
run=2023-04-19-08-35-00/
run=2023-04-19-08-40-00/
run=2023-04-19-08-45-00/
run=2023-04-19-10-20-00/
run=2023-04-19-10-25-00/
run=2023-04-19-10-45-00/
run=2023-04-19-10-50-00/
run=2023-04-19-10-55-00/
run=2023-04-19-11-00-00/
I don’t know how often or under what circumstances either the transformer or loader run, or why they might miss a run. Where is that configured?
Looking at the logs, the Redshift loader seems to be loading the transformed events within a few minutes of them becoming available in S3. For example, the 9:05am run this morning was processed by the Redshift loader at 9:10am:
2023-04-20T09:10:45.831Z|INFO DataDiscovery: Received a new message
2023-04-20T09:10:45.831Z|INFO DataDiscovery: Total 2206 messages received, 2146 loaded…
2023-04-20T09:10:45.831Z|INFO DataDiscovery: New data discovery at run=2023-04-20-09-05-00 with following shredded types…
A bunch of lines specific to loading in the new data
2023-04-20T09:12:48.535Z|INFO Loader: Total 2206 messages received, 2147 loaded; Loader is in Idle state; Last state update at 2023-04-20 09:10:51.252
2023-04-20T09:17:48.537Z|INFO snowplow.rdbloader.minimum_age_of_loaded_data = 86689
Interestingly, there is one more folder in S3 (a 9:10am run), which has not been processed by Redshift even several hours later. Perhaps the message to import the previous run into Redshift is sent at the same time the next transformer run goes into S3?
Regardless, the problem seems to be with the transformer, not Redshift.
Hi @WTravisH would you mind sharing how you have configured the Transformer perhaps by dropping the relevant Terraform code here for it so we can better debug this?
Looking at S3, our “raw” and “enriched” paths are getting update with new data every few minutes, basically in real time. But as of this writing, our “transformed” data hasn’t been updated in about 12 hours. Further, the most recent data that has been loaded into Redshift is now about 36 hours old.
Raw and Enriched are not really related paths when using the “streaming transformer” - this transformer reads directly from the stream unlike our old “batch transformer” which would use the output for the “enriched” data from the S3 Loader to get data into a warehouse. It does mean the issue is in the transformer configuration but just wanted to clarify that these are not strictly related things.
So first thing I would try is to upgrade that (and the loader module) to the latest available versions. This hasn’t gone into the docs / quickstart guides yet but I have been working hard to bring them up to speed with a lot of performance, stability and security improvements - this is in fact for ALL AWS modules so far.
My hunch here is that your transformer is likely underprovisioned / potentially silently crashing or cycling and running out of memory (all of which has been by and large solved in the later module versions).
Could you give that a try? If lag persists I would bump the instance_type from a t3a.small out to a beefier t3a.large or even an m5.xlarge for large enough volumes of traffic or a very big backlog.
Okay, this seems to have done the trick, at least tentatively.
I upgraded all the docker containers to the latest version across our entire snowplow infrastructure. Since then, the transformed folders have populated every 5 minutes, and all of the warnings are gone from the transformer logs.
Redshift appears to be loading every 5 minutes still, no problems, and now that the transformer is catching up, redshift is catching up too.
Also, I believe the transformer instance was previously a t3a.micro, and is now a t3a.small, which can’t have hurt with the memory issues.
Ahh that’s great news @WTravisH ! Ill be doing a webinar sometime soon on scaling the AWS pipeline for scale as well which will touch on using these exact modules and how to keep them working up to quite large throughput which could be useful to help explain these apps in greater detail.