When working with the Scala Stream Enricher, I noticed that after splitting shards the enricher was not pulling any data from the collector good output stream. Here’s what happened:
- upped capacity of collector output good stream from 1 shard to 2 shards
a. shard-00000000 was closed, and shard-00000001 and shard-00000002 were created
- the enricher was still connected to shard-00000000, and did not connect to either of the new shards
- I had to restart the enricher application for it to connect to the new shards
My question is this: is this the expected behavior?
Assuming it is, I know I can use TRIM HORIZON as the initial starting position to make sure I don’t lose data and all records get processed. However, this does mean I need to restart the application and/or drop existing enricher servers once they’ve caught up and finished processing their (now closed) shards. I wanted to check on this in case there’s a different or better solution!
Hey @chrisprijic - no, this isn’t expected behavior. Under the hood, Stream Enrich uses the Kinesis Client Library, which handles Kinesis stream shard splits and merges automatically. A single EC2 instance running Stream Enrich can fall behind on the stream, but it shouldn’t ignore child shards.
When you say the enricher was still connected to shard-00000000, do you mean that it was still connected to shard-00000000 and was idling at the head of the stream? If it was still working through a shard-00000000 backlog, then that is expected behavior - the KCL won’t start processing child shards until the parent shard has been fully processed; this is to maintain order guarantees across shard ancestries.
My recommendation is to try this experiment again:
- Split the shards into 4, or merge back to 1
- Observe what happens in the Stream Enrich logs and crucially also in the DynamoDB table that the KCL maintains
- Confirm that Stream Enrich does or doesn’t process the child shards once the parent child has been fully processed
Then report back here!
Thanks for the quick response!
It was idling at the head of the stream. This happened overnight on our test pipeline, and so no records got from the collector stream to Redshift until I restarted the app.
Something else to note: I used ‘Update Shard Count’ to split the shards, but under the hood on AWS’s side it’s doing the typical splitting and merging anyways.
I’ll do the test and get back to you!
@alex I did the test two ways:
- I merged into 1 shard – this worked
- I split back into 2 shards – this worked
The logs on the (now stopped) original enricher that stalled did not crash, and did not stop running. It was still polling from shard-00000000 though, retrieving no records since it was up to date. The entire down time was 7 hours, until I restarted the application this morning.
We are using TRIM HORIZON so we got the records processed with no data loss. Took a bit to catch up (of course) but it didn’t lose a thing!
For some reason I cannot reproduce the issue I had – however the old enricher had been running for weeks so it could have just been a stale application. I’ll set up some alarm-based notifications in case this happens again and handle it manually for now. I’ll report back if it happens again or I can reproduce the issue.
Interesting - that’s not a failure state we’ve seen across our Managed Service RT customers, and we have auto-scaling tech which leads to a lot of automated stream splits and merges for some of these customers.
Do let us know if you see it happening again or if you can reproduce it!