Enricher doesn't scale out correctly and keep getting errors

Alexandre5602 · April 24, 2023, 2:35pm

Hi,

We use the snowplow collector and enricher docker container on ECS, they both have 4 tasks running in parallel with 2 VCPUs and 4GB of RAM, but when I try to add more tasks to the enricher (for instance 6 tasks instead of 4) I do not see a difference in the amount of data read in total by the enrichers, so its seems like adding more enricher tasks does not help to read more data from the kinesis stream after the collector, any idea why?

Also looking at the logs of the enrichers I see those 2 errors coming back all the time:
1)[RecordProcessor-0005] ERROR com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Caught shutdown exception, skipping checkpoint.
2) com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: Can’t update checkpoint - instance doesn’t hold the lease for this shard.
Anyone an idea of what could cause these errors?

Thanks in advance,

Alexandre

BenB · April 25, 2023, 6:45am

Hi @Alexandre5602 ,

its seems like adding more enricher tasks does not help to read more data from the kinesis stream after the collector, any idea why?

How many shards do you have in the Kinesis stream that holds the collector payloads ? A shard can be consumed by only one instance at a time, so if you have more Enrich instances than the number of shards, the additional instances will do nothing (you should be able to confirm that thanks to the logs).

Also looking at the logs of the enrichers I see those 2 errors coming back all the time:
1)[RecordProcessor-0005] ERROR com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Caught shutdown exception, skipping checkpoint.
2) com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: Can’t update checkpoint - instance doesn’t hold the lease for this shard.
Anyone an idea of what could cause these errors?

It might be that the additional instances try to steal the lease of the other instances.

Rob_Ellison · April 29, 2024, 2:04pm

We’ve been seeing a similar issue I think. I’ve attached the log below. I’m running the enricher on EKS. We do have more shard than containers. This was first spotted when reviewing the Iterator age on the collected good stream.

We don’t have on-demand streams and scale the streams using a lambda. I’ve noticed this happens when scaling operations occur but this isn’t always the case

Do you have any advice on this?

[LeaseCoordinator-0000] INFO com.amazonaws.services.kinesis.leases.impl.LeaseTaker - Worker snowplowenrichereks-5c5db89d44-8wh57:048304f2-3847-4757-bfdc-2e123a78abcc successfully took 2 leases: shardId-000000002321, shardId-000000002319
[main] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker - Created new shardConsumer for : ShardInfo [shardId=shardId-000000002321, concurrencyToken=8899206d-efe2-4c42-a190-b84216fe6c3b, parentShardIds=[shardId-000000002292, shardId-000000002293], checkpoint={SequenceNumber: 49651560759438308688055198851092527528857757062172545298,SubsequenceNumber: 0}]
[main] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker - Created new shardConsumer for : ShardInfo [shardId=shardId-000000002319, concurrencyToken=52667d32-7f2b-4c8e-99be-a5c55ffcbb1e, parentShardIds=[shardId-000000002288, shardId-000000002289], checkpoint={SequenceNumber: 49651552554860345677459711552456726607503694522889572594,SubsequenceNumber: 0}]
[RecordProcessor-0056] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.BlockOnParentShardTask - No lease found for shard shardId-000000002292. Not blocking on completion of this shard.
[RecordProcessor-0058] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.BlockOnParentShardTask - No lease found for shard shardId-000000002288. Not blocking on completion of this shard.
[RecordProcessor-0056] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.BlockOnParentShardTask - No lease found for shard shardId-000000002293. Not blocking on completion of this shard.
[RecordProcessor-0056] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.BlockOnParentShardTask - No need to block on parents [shardId-000000002292, shardId-000000002293] of shard shardId-000000002321
[RecordProcessor-0058] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.BlockOnParentShardTask - No lease found for shard shardId-000000002289. Not blocking on completion of this shard.
[RecordProcessor-0058] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.BlockOnParentShardTask - No need to block on parents [shardId-000000002288, shardId-000000002289] of shard shardId-000000002319
[RecordProcessor-0058] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisDataFetcher - Initializing shard shardId-000000002321 with 49651560759438308688055198851092527528857757062172545298
[RecordProcessor-0056] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisDataFetcher - Initializing shard shardId-000000002319 with 49651552554860345677459711552456726607503694522889572594
[RecordProcessor-0058] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Initializing record processor for shard: shardId-000000002321
[RecordProcessor-0056] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Initializing record processor for shard: shardId-000000002319
[pool-7-thread-1] INFO com.amazonaws.services.kinesis.leases.impl.LeaseCleanupManager - Number of pending leases to clean before the scan : 0
[main] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker - Current stream shard assignments: shardId-000000002331, shardId-000000002321, shardId-000000002332, shardId-000000002333, shardId-000000002328, shardId-000000002318, shardId-000000002329, shardId-000000002319
[main] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker - Sleeping ...
[LeaseRenewer-0001] INFO com.amazonaws.services.kinesis.leases.impl.LeaseRenewer - Worker snowplowenrichereks-5c5db89d44-8wh57:048304f2-3847-4757-bfdc-2e123a78abcc lost lease with key shardId-000000002333
[pool-7-thread-1] INFO com.amazonaws.services.kinesis.leases.impl.LeaseCleanupManager - Number of pending leases to clean before the scan : 0
[RecordProcessor-0056] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator - Worker snowplowenrichereks-5c5db89d44-8wh57:048304f2-3847-4757-bfdc-2e123a78abcc could not update checkpoint for shard shardId-000000002333 because it does not hold the lease
[RecordProcessor-0056] ERROR com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Caught shutdown exception, skipping checkpoint.
com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: Can't update checkpoint - instance doesn't hold the lease for this shard
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:198)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.advancePosition(RecordProcessorCheckpointer.java:307)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:85)
at com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource$RawEventProcessor.$anonfun$checkpoint$2(KinesisSource.scala:242)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
at com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource$RawEventProcessor.$anonfun$checkpoint$1(KinesisSource.scala:240)
at scala.util.control.Breaks.breakable(Breaks.scala:42)
at com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource$RawEventProcessor.checkpoint(KinesisSource.scala:240)
at com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource$RawEventProcessor.processRecords(KinesisSource.scala:219)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.V1ToV2RecordProcessorAdapter.processRecords(V1ToV2RecordProcessorAdapter.java:42)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.callProcessRecords(ProcessTask.java:221)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.call(ProcessTask.java:176)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
[RecordProcessor-0056] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask - Invoking shutdown() for shard shardId-000000002333, concurrencyToken: b20911cd-529d-4145-9c77-971afc8abd10, original Shutdown reason: ZOMBIE. childShards:null
[RecordProcessor-0056] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Shutting down record processor for shard: shardId-000000002333
[LeaseRenewer-0000] INFO com.amazonaws.services.kinesis.leases.impl.LeaseRenewer - Worker snowplowenrichereks-5c5db89d44-8wh57:048304f2-3847-4757-bfdc-2e123a78abcc lost lease with key shardId-000000002321
[pool-5-thread-1] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.DeterministicShuffleShardSyncLeaderDecider - Elected leaders: snowplowenrichereks-5c5db89d44-qlj7f:7c8282b7-cb0e-4640-83e8-68125d3d853d
[main] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker - Current stream shard assignments: shardId-000000002331, shardId-000000002332, shardId-000000002328, shardId-000000002318, shardId-000000002329, shardId-000000002319
[main] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker - Sleeping ...

Topic		Replies	Views
Enrich-kinesis-3.5.0 does nothing Enrichment	3	1037	December 6, 2022
Stream Enrich in Kubernetes cluster AWS real-time pipeline	4	1867	April 12, 2019
Scaling kinesis enricher for high loads Enrichment	11	2345	December 11, 2018
Enrichments not working with image: snowplow/snowplow-enrich-kinesis:latest Enrichment	3	858	May 12, 2023
Enricher high CPU utilisation issue Enrichment	33	4476	May 4, 2022

Enricher doesn't scale out correctly and keep getting errors

Related topics