After upgrading the ES loaders from version 0.10.1 to the latest 2.0.5/6, we are seeing the following is happening randomly: the loaders’ KCL apps can become idle at random shards for uncertain periods of time.
That effectively leads to the spikes in the GetRecords.IteratorAgeMillis metric on a source Kinesis stream, and also results in incomplete data in ElasticSearch: some of the events are just missing there, because their respective kinesis shards were just not read yet.
The restart of a loader can alleviate that for some time. But later this happens again. The loader’s error log doesn’t show any errors.
Please see the screenshots attached to illustrate the problem. The upper chart shows incoming bytes, per shard, into the source Kinesis stream. The bottom chart shows the records processed per shard by the ES loader’s KCL app.
On the first screenshot, before the spikes on the bottom chart, we can see incoming bytes are evenly dispersed across all the shards.
On the second screenshot, before the spikes on the bottom chart, we can see there are no records processed metrics for some of the shards.
On the third screenshot, right after the spike on the bottom chart, we see the missing metrics have appeared.
Thank you! We didn’t see any warnings/errors with the INFO logging level. The issue still persists. We reached out to the AWS support and provided them the DEBUG-level logs. They escalated to their engineering for investigation.
The problem you described seems like very similar to this one. We were seeing that problem with oversized events. May I ask if you have any estimate about the average size of the items sent to the Elasticsearh Loader ?
Another thing we can recommend at the moment is using version 1.0.7. We think that we start to get this problem starting with 2.0.0. If you start to use 1.0.7 and have this problem again, can you let us know, please ?
AWS engineers investigated the logs and found situations where our ES cluster sporadically responds Forbidden to a write request. Soon after that we begin to see the log lines like:
2022-07-29 09:48:11:938 +0000 [main] DEBUG com.amazonaws.services .kinesis.clientlibrary.lib.worker.ShardConsumer - Previous PROCESS task still pending for shard shardId-00000000000X since MILLISECONDS ms ago. Not submitting new task.
where the MILLISECONDS value keeps growing.
Other shards can keep working normally, i.e. their records being loaded to the cluster. But the shard X got stuck.
From AWS:
KCL does not populate metrics when there is no activity. KCL waits for the customer’s record processor to return before it will attempt to retrieve any more records from Kinesis. Here record processor faced authorization issues from OpenSearch and that could not complete activity.
To prevent KCL getting blocked make sure RecordProcesser is not stuck when it faces authorization issue from ES cluster.
I understand that the cluster’s Forbidden response is an issue in itself on our end. Please let me know if you need more info.
Response from AWS support, from which it sounds the ES loader itself can be responsible for the mentioned 403 responses from the ES cluster:
After reviewing your response I can see that you are having occasional authentication issues when making HTTP requests to your OpenSearch cluster.
A 403 error is due to authorization and is not indicative of any issues with the domain. This specific error relates to the fact that the provided signing on the HTTP request does not match what the service has computed [1].
I have reviewed the GitHub link you have provided and have identified that the software provided is using it’s own calculation for the AWS signature. Because of this, it might be worth creating an issue in the software provider’s repo and provide the following information of a failed request to allow the owner’s of the repo to troubleshoot:
HTTP Request Method
Parameters of the request
HTTP Request Headers
HTTP Request Body
As the client that is being used to sign these requests has not been developed by AWS it is outside the scope of AWS support.
For anyone coming to this thread, we’ve been discussing the same thing over in this github issue thread. Summary: upgrading to 2.0.7 didn’t fix things yet, so we need to dig into it a bit deeper.