S3 Loader cannot update checkpoint error

Hi,

we are receiving these error on the s3 loader module a 2-3 times per day:

Is this something to worry about? Will the data be processed and moved to s3 or do we have possible data loss there? How can we prevent that from happening. In our kinesis stream or s3 loader configuration?

Hey @mgloel,

Is this something to worry about? Will the data be processed and moved to s3 or do we have possible data loss there? How can we prevent that from happening. In our kinesis stream or s3 loader configuration?

This is generally due to scaling actions of your consumer group and shards being re-balanced across the available consumers. Is any scaling activity happening close to when this error happens?

In terms of data-loss no there should not be any - we only progress on a shard after a successful checkpoint. As such in this case you might have a chance of duplicates entering your bucket but there should not be any data-loss.

1 Like

Great, thanks for the info.
Just one question regarding the porential duplicates. The will be removed by the shredder deduplication, right?

They should be yes - but ill see if @anton can add any extra details to deduplication.

Hi @mgloel,

The will be removed by the shredder deduplication, right?

Very likely - yes, but there’s a small chance of no as well. In short: if these duplicates get to the batch - they’re certainly removed. If not, e.g. if your S3DistCp started between two files with duplicates are flushed - they only will be deduplicated if you have cross-batch deduplication enabled.

You can find more details about deduplication here:

https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/loaders-storage-targets/snowplow-rdb-loader/event-deduplication/

(You’re interested in natural in/cross-batch deduplication, not synthetic)

1 Like