Enricher doesn't scale out correctly and keep getting errors

Hi,

We use the snowplow collector and enricher docker container on ECS, they both have 4 tasks running in parallel with 2 VCPUs and 4GB of RAM, but when I try to add more tasks to the enricher (for instance 6 tasks instead of 4) I do not see a difference in the amount of data read in total by the enrichers, so its seems like adding more enricher tasks does not help to read more data from the kinesis stream after the collector, any idea why?

Also looking at the logs of the enrichers I see those 2 errors coming back all the time:
1)[RecordProcessor-0005] ERROR com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Caught shutdown exception, skipping checkpoint.
2) com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: Can’t update checkpoint - instance doesn’t hold the lease for this shard.
Anyone an idea of what could cause these errors?

Thanks in advance,

Alexandre

Hi @Alexandre5602 ,

its seems like adding more enricher tasks does not help to read more data from the kinesis stream after the collector, any idea why?

How many shards do you have in the Kinesis stream that holds the collector payloads ? A shard can be consumed by only one instance at a time, so if you have more Enrich instances than the number of shards, the additional instances will do nothing (you should be able to confirm that thanks to the logs).

Also looking at the logs of the enrichers I see those 2 errors coming back all the time:
1)[RecordProcessor-0005] ERROR com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Caught shutdown exception, skipping checkpoint.
2) com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: Can’t update checkpoint - instance doesn’t hold the lease for this shard.
Anyone an idea of what could cause these errors?

It might be that the additional instances try to steal the lease of the other instances.

We’ve been seeing a similar issue I think. I’ve attached the log below. I’m running the enricher on EKS. We do have more shard than containers. This was first spotted when reviewing the Iterator age on the collected good stream.

We don’t have on-demand streams and scale the streams using a lambda. I’ve noticed this happens when scaling operations occur but this isn’t always the case

Do you have any advice on this?

[LeaseCoordinator-0000] INFO com.amazonaws.services.kinesis.leases.impl.LeaseTaker - Worker snowplowenrichereks-5c5db89d44-8wh57:048304f2-3847-4757-bfdc-2e123a78abcc successfully took 2 leases: shardId-000000002321, shardId-000000002319
[main] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker - Created new shardConsumer for : ShardInfo [shardId=shardId-000000002321, concurrencyToken=8899206d-efe2-4c42-a190-b84216fe6c3b, parentShardIds=[shardId-000000002292, shardId-000000002293], checkpoint={SequenceNumber: 49651560759438308688055198851092527528857757062172545298,SubsequenceNumber: 0}]
[main] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker - Created new shardConsumer for : ShardInfo [shardId=shardId-000000002319, concurrencyToken=52667d32-7f2b-4c8e-99be-a5c55ffcbb1e, parentShardIds=[shardId-000000002288, shardId-000000002289], checkpoint={SequenceNumber: 49651552554860345677459711552456726607503694522889572594,SubsequenceNumber: 0}]
[RecordProcessor-0056] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.BlockOnParentShardTask - No lease found for shard shardId-000000002292. Not blocking on completion of this shard.
[RecordProcessor-0058] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.BlockOnParentShardTask - No lease found for shard shardId-000000002288. Not blocking on completion of this shard.
[RecordProcessor-0056] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.BlockOnParentShardTask - No lease found for shard shardId-000000002293. Not blocking on completion of this shard.
[RecordProcessor-0056] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.BlockOnParentShardTask - No need to block on parents [shardId-000000002292, shardId-000000002293] of shard shardId-000000002321
[RecordProcessor-0058] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.BlockOnParentShardTask - No lease found for shard shardId-000000002289. Not blocking on completion of this shard.
[RecordProcessor-0058] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.BlockOnParentShardTask - No need to block on parents [shardId-000000002288, shardId-000000002289] of shard shardId-000000002319
[RecordProcessor-0058] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisDataFetcher - Initializing shard shardId-000000002321 with 49651560759438308688055198851092527528857757062172545298
[RecordProcessor-0056] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisDataFetcher - Initializing shard shardId-000000002319 with 49651552554860345677459711552456726607503694522889572594
[RecordProcessor-0058] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Initializing record processor for shard: shardId-000000002321
[RecordProcessor-0056] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Initializing record processor for shard: shardId-000000002319
[pool-7-thread-1] INFO com.amazonaws.services.kinesis.leases.impl.LeaseCleanupManager - Number of pending leases to clean before the scan : 0
[main] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker - Current stream shard assignments: shardId-000000002331, shardId-000000002321, shardId-000000002332, shardId-000000002333, shardId-000000002328, shardId-000000002318, shardId-000000002329, shardId-000000002319
[main] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker - Sleeping ...
[LeaseRenewer-0001] INFO com.amazonaws.services.kinesis.leases.impl.LeaseRenewer - Worker snowplowenrichereks-5c5db89d44-8wh57:048304f2-3847-4757-bfdc-2e123a78abcc lost lease with key shardId-000000002333
[pool-7-thread-1] INFO com.amazonaws.services.kinesis.leases.impl.LeaseCleanupManager - Number of pending leases to clean before the scan : 0
[RecordProcessor-0056] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator - Worker snowplowenrichereks-5c5db89d44-8wh57:048304f2-3847-4757-bfdc-2e123a78abcc could not update checkpoint for shard shardId-000000002333 because it does not hold the lease
[RecordProcessor-0056] ERROR com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Caught shutdown exception, skipping checkpoint.
com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: Can't update checkpoint - instance doesn't hold the lease for this shard
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:198)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.advancePosition(RecordProcessorCheckpointer.java:307)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:85)
at com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource$RawEventProcessor.$anonfun$checkpoint$2(KinesisSource.scala:242)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
at com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource$RawEventProcessor.$anonfun$checkpoint$1(KinesisSource.scala:240)
at scala.util.control.Breaks.breakable(Breaks.scala:42)
at com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource$RawEventProcessor.checkpoint(KinesisSource.scala:240)
at com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource$RawEventProcessor.processRecords(KinesisSource.scala:219)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.V1ToV2RecordProcessorAdapter.processRecords(V1ToV2RecordProcessorAdapter.java:42)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.callProcessRecords(ProcessTask.java:221)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.call(ProcessTask.java:176)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
[RecordProcessor-0056] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask - Invoking shutdown() for shard shardId-000000002333, concurrencyToken: b20911cd-529d-4145-9c77-971afc8abd10, original Shutdown reason: ZOMBIE. childShards:null
[RecordProcessor-0056] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Shutting down record processor for shard: shardId-000000002333
[LeaseRenewer-0000] INFO com.amazonaws.services.kinesis.leases.impl.LeaseRenewer - Worker snowplowenrichereks-5c5db89d44-8wh57:048304f2-3847-4757-bfdc-2e123a78abcc lost lease with key shardId-000000002321
[pool-5-thread-1] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.DeterministicShuffleShardSyncLeaderDecider - Elected leaders: snowplowenrichereks-5c5db89d44-qlj7f:7c8282b7-cb0e-4640-83e8-68125d3d853d
[main] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker - Current stream shard assignments: shardId-000000002331, shardId-000000002332, shardId-000000002328, shardId-000000002318, shardId-000000002329, shardId-000000002319
[main] INFO com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker - Sleeping ...