Scaling quickstart

So, i’m attempting a poc with snowplow quickstart on aws+snowflake using a couple of enrichments, and it seems I have a bottleneck in the kinesis-stream-enrich.
Im counting 1M events a day and it seems Get Iterator Max Age shows it is always 86.4M milliseconds.
Now I’m not quite sure if what I need to scale up is the snowflake loader-ec2, kinesis-transformer-ec2 or enricher-ec2.

The kinesis enricher stream has shards x 4 and I have 4 instances of each of those ec2 up and running showing little to no cpu activity, plus I see these alarms which don’t make a lot of sense with my knowledge.
snowplow-transformer-kinesis-enriched-server ConsumedReadCapacityUnits < 3900 for 15 datapoints within 15 minutes

Hey @tomascaraccia which Stream is showing the latency? Is it the “raw” stream (which enrich consumes) or the “enriched” stream (which the transformer consumes).

From what you are describing Enrich is keeping up fine (given low CPU) and this would be confirmed if the latency on the “raw” stream is also low.

If that’s the case then following the chain downstream you need to find the consumer of the “enriched” stream that is falling behind - this is very likely to be the transformer in this case (unless you have another service consuming that stream?).

Once you have identified which service is falling behind the next step is to figure out why! I imagine each part of the pipeline as a node in a graph so you want to look at the service itself but also its peripheral connections.

  • Is the CPU running high?
  • Is the Memory allocation running high? Has enough memory been allocated to the process to let it run properly?
  • Can you see anything in the application logs? Is it throwing OutOfMemory exceptions or other types of exceptions?
  • Are there any bottlenecks downstream of the transformer? Is it failing to write to its target destinations? (Hopefully the logs will tell you whats going wrong here).
  • Is the DynamoDB table used for the Kinesis shard sychronization sufficiently scaled with read/write units to keep up with check-pointing? (this should be named the same as the application you are debugging)

Depending on what information you can glean from these checks the next steps should be pretty easy to figure out whats going wrong.

Most of these checks you can do from outside the service but checking memory allocation will require you to SSH into the VM and validate the JVM allocation directly.

Hope this helps - and please do reach back with the symptoms you manage to find!


Footnote: What is this alarm?

The alarm snowplow-transformer-kinesis-enriched-server ConsumedReadCapacityUnits < 3900 for 15 datapoints within 15 minutes is talking about the DynamoDB table used for check-pointing and shard synchronization for this Kinesis consumer application. Its likely that this table cannot scale any lower so its perpetually in alarm (this is normal and expected behavior!).

@josh Thank you so much for your reply!

So my findings are
Raw Stream
GetRecords iterator age: 0ms
GetRecords latency - average: 9ms
EC2 Raw Loader - Around 0.4 % Avg.

The latency I talked about in my previous post is for the enriched stream.

Right so if the enriched stream is lagging then it is the consumers pulling from that stream that are struggling - presumably this would be the transformer? What are the statistics around this application?

It’s literally under no load.
In case it helps these are the values I use for buffers:
pipeline_kcl_read_min_capacity = 5
pipeline_kcl_write_min_capacity = 50
pipeline_kcl_read_max_capacity = 3000
pipeline_kcl_write_max_capacity = 3000
enrich_record_buffer = 1500
enrich_byte_buffer = 2000000
enrich_time_buffer = 1500

Is the Transformer actually producing any output @tomascaraccia ? What are the logs saying for it? It lagging but also not doing anything sounds like it is simply not working at all.

It is!
I see actual rows being loaded into snowflake with wide_row transformations.
But the logs

pool-15-thread-1] INFO software.amazon.kinesis.leases.LeaseCleanupManager - Number of pending leases to clean after the scan : 1
[ioapp-compute-1] INFO com.snowplowanalytics.snowplow.rdbloader.transformer.kinesis.sinks.generic.Partitioned - Rotating window from Window(2022,10,17,21,55) to Window(2022,10,17,22,0)
[ioapp-compute-1] INFO com.snowplowanalytics.snowplow.rdbloader.transformer.kinesis.sinks.generic.KeyedEnqueue - Sinks initialised for output=good/
[ioapp-compute-1] INFO com.snowplowanalytics.snowplow.rdbloader.transformer.kinesis.sinks.generic.ItemEnqueue - Pulling 1 elements
[ioapp-compute-1] INFO com.snowplowanalytics.snowplow.rdbloader.transformer.kinesis.sinks.generic.KeyedEnqueue - KeyedEnqueue has been closed
[ioapp-compute-1] INFO com.snowplowanalytics.snowplow.rdbloader.transformer.kinesis.sinks.generic.KeyedEnqueue - Creating new KeyedEnqueue
[ioapp-compute-1] INFO com.snowplowanalytics.snowplow.rdbloader.transformer.kinesis.sinks.generic.Partitioned - Tried to terminate Window(2022,10,17,21,55) window's KeyedEnqueue as resource
[cats-effect-blocker-0] INFO software.amazon.kinesis.coordinator.DiagnosticEventLogger - Current thread pool executor state: ExecutorStateEvent(executorName=SchedulerThreadPoolExecutor, currentQueueSize=0, activeThreads=0, coreThreads=0, leasesOwned=2, largestPoolSize=1, maximumPoolSize=2147483647)
[cats-effect-blocker-0] INFO software.amazon.kinesis.coordinator.Scheduler - Current stream shard assignments: shardId-000000000016, shardId-000000000010
[cats-effect-blocker-0] INFO software.amazon.kinesis.coordinator.Scheduler - Sleeping ...```