Hi! Having a problem scaling up the pipeline. Enrich app (ec2) is running at 40% cpu, kinesis raw stream shows GetRecords.IteratorAgeMilliseconds > 80 mln milliseconds (~22 h).
So kinesis has a long queue, and the app can’t catch up it seems. I’ve increased the number of shards on kinesis stream from 1 to 5, and added additional enrich instance, but nothing happens - its cpu load is under 1%.
In dynamo db in the table ‘enrich-server’ I also can see only one record for just one shard with an owner set to one of the ec2s (the one with 40% load)
the logs from this ‘sleeping’ ec2 is this:
- “Starting periodic shard sync background process for SHARD_END shard sync strategy”
- Number of pending leases to clean before the scan : 0
- No activities assigned
- Sleeping ...
Am I missing something? Can’t figure why the additional ec2 does not start offloading the load. I saw in another topic this description of KCL behaviour
The way the KCL is implemented we need to process all child shards before it can start on the active parent shards so until the child shards have been fully processed it won’t be able to access the higher bandwidth.
could it be the problem? and if so, does it mean I need just wait for all the current messages to be processed? but it seem strange that nothing was added into the dynamo table.
Hey @pavvell that is exactly the problem - you will need to wait until that first parent shard has been fully consumed before any of the child shards can start to be worked on (and when things should start to speed up).
Once you have completed that shard the new EC2 nodes will start to fan out the higher number of child shards and your processing speed should pick up quite dramatically.
One word of caution though is to dig into “why” Enrich couldn’t keep up - generally this would mean that you have a downstream bottleneck (especially if you are only running at 40% load). Have you checked that you do not have any throttling on the “enriched” and “bad” output streams from Enrich that could be causing processing to go slower than expected?
If these are throttled then its very unlikely Enrich will ever catch up.
thanks @josh ! eventually, everything was processed, and now I see the shards in dynamo as well
Enrich couldn’t keep up - generally this would mean that you have a downstream bottleneck (especially if you are only running at 40% load)
yes, that was strange, the bad stream was alright, we almost do not have bad data incoming. the enriched stream was in its usual state, no spikes in any of the metrics, no throttles, and the loader instances both to snowflake and to s3 were running at less than 1% CPU. so for now it remains a mystery what it was.