Hello,
I just discovered that my RDB Stream Shredder have stopped working since Nov 16 (my bad for not monitoring it) with the following error: java.lang.OutOfMemoryError: Java heap space
. It’s weird cause it has been working without any issue during more than 1 month prior to that date (see my other topic). Could be because of an increase in traffic during this period but not sure.
I’m adding the logs at the end.
It was running on a t3a.small and I just tried to upgrade it to a t3a.medium then to a t3a.large, without success. After a few minutes the error comes back.
I eventually installed htop on the instance and the memory is not even close to reaching its limit.
Some more details:
- I’m running the shredder in AWS Elastic Beanstalk Docker
- I have 1 day of data retention on the Kinesis stream so I guess when I restart the shredder now it shreds the events from yesterday (as it was failing since Nov 16 and I saw the problem today)
Dockerfile:
FROM snowplow/snowplow-rdb-stream-shredder:1.2.0
# Required by Beanstalk
EXPOSE 8000
CMD [\
"--config", \
# Must be in base 64. To generate it: base64 congif.hocon
"ewogICMgSHVtYW4t...", \
"--iglu-config", \
# Must be in base 64. To generate it: base64 iglu-resolver.json
"ewogICJzY2hlbWEiOiAia..." \
]
config.hocon:
{
# Human-readable identificator, can be random
"name": "Acme Redshift",
# Machine-readable unique identificator, must be UUID
"id": "123e4567-e89b-12d3-a456-426655440000",
# Data Lake (S3) region
"region": ${S3_REGION},
# SQS topic name used by Shredder and Loader to communicate
"messageQueue": ${SQS_TOPIC},
# Shredder-specific configs
"shredder": {
# "batch" for Spark job and "stream" for fs2 streaming app
"type" : "stream",
# For batch: path to enriched archive (must be populated separately with run=YYYY-MM-DD-hh-mm-ss directories) for S3 input
# "input": "s3://bucket/input/",
# For stream: appName, streamName, region triple for kinesis
"input": {
# kinesis and file are the only options for stream shredder
"type": "kinesis",
# KCL app name - a DynamoDB table will be created with the same name
"appName": ${DYNAMO_TABLE},
# Kinesis Stream name
"streamName": ${KINESIS_STREAM_ENRICHER_GOOD},
# Kinesis region
"region": ${KINESIS_REGION},
# Kinesis position: LATEST or TRIM_HORIZON
"position": "TRIM_HORIZON"
},
# For stream shredder : frequency to emit loading finished message - 5,10,15,20,30,60 etc minutes
"windowing": "5 minutes",
# Path to shredded archive
"output": {
# Path to shredded output
"path": ${S3_BUCKET_SCHREDDED},
# Shredder output compression, GZIP or NONE
"compression": "GZIP"
}
},
# Schema-specific format settings (recommended to leave all three groups empty and use TSV as default)
"formats": {
# Format used by default (TSV or JSON)
"default": "TSV",
# Schemas to be shredded as JSONs, corresponding JSONPath files must be present. Automigrations will be disabled
"json": [
"iglu:com.acme/json-event/jsonschema/1-0-0",
"iglu:com.acme/json-event/jsonschema/2-*-*"
],
# Schemas to be shredded as TSVs, presence of the schema on Iglu Server is necessary. Automigartions enabled
"tsv": [ ],
# Schemas that won't be loaded
"skip": [
"iglu:com.acme/skip-event/jsonschema/1-*-*"
]
},
# Optional. S3 path that holds JSONPaths
#"jsonpaths": "s3://bucket/jsonpaths/",
# Warehouse connection details
"storage" : {
# Database, redshift is the only acceptable option
"type": "redshift",
# Redshift hostname
"host": ${REDSHIFT_HOST},
# Database name
"database": ${REDSHIFT_DATABASE},
# Database port
"port": 5439,
# AWS Role ARN allowing Redshift to load data from S3
"roleArn": ${REDSHIFT_IAM_ROLE_ARN},
# DB schema name
"schema": ${REDSHIFT_SCHEMA},
# DB user with permissions to load data
"username": ${REDSHIFT_USER},
# DB password
"password": ${REDSHIFT_PASSWORD},
# Custom JDBC configuration
"jdbc": {"ssl": true},
# MAXERROR, amount of acceptable loading errors
"maxError": 10
},
# Additional steps. analyze, vacuum and transit_load are valid values
"steps": ["analyze"],
# Observability and reporting options
"monitoring": {
# # Snowplow tracking (optional)
# "snowplow": {
# "appId": "redshift-loader",
# "collector": "snplow.acme.com",
# }
# # Optional, for tracking runtime exceptions
# "sentry": {
# "dsn": "http://sentry.acme.com"
# },
# # Optional, configure how metrics are reported
# "metrics": {
# # Optional, send metrics to StatsD server
# "statsd": {
# "hostname": "localhost",
# "port": 8125,
# # Any key-value pairs to be tagged on every StatsD metric
# "tags": {
# "app": "rdb-loader"
# }
# # Optional, override the default metric prefix
# # "prefix": "snowplow.rdbloader."
# },
# # Optional, print metrics on stdout (with slf4j)
# "stdout": {
# # Optional, override the default metric prefix
# # "prefix": "snowplow.rdbloader."
# }
# }
}
}
Logs when I discovered the incident: https://pastebin.com/raw/Ezzd4d9K
Note the last log: INFO software.amazon.kinesis.leases.LeaseCleanupManager - Number of pending leases to clean before the scan : 0
. This message has been looping since Nov 16 to Dec 01. There was nothing else at the tail of the logs.
Logs when I restart the shredder, after a while: https://pastebin.com/raw/gvvuCeuc
It works for like 5-10 minutes and eventually java.lang.OutOfMemoryError: Java heap space
comes back.
Any idea how I could solve the issue?
I might try to create new Kinesis/SQS instances in the meantime, in case if it comes from there.
Edit: the last restart failed again but with a different error, and what’s interesting in this case is that it eventually shutdowns itself and so Docker was able to restart it automatically: https://pastebin.com/raw/Yw7bsQXD
Edit2: there is a new error about the KinesisAsyncClient: https://pastebin.com/raw/RD5g3YPW. It’s now looping on this error.
Edit3: from what I’m starting to understand:
after a few minutes there is a message: INFO software.amazon.kinesis.coordinator.PeriodicShardSyncManager - Skipping shard sync for ${kinesis_stream} due to the reason - Hash Ranges are complete for ${kinesis_stream}
(don’t know what that means tho, maybe an error of sync between Kinesis, Dynamo and Shredder?)
which is directly followed by something about “elected leaders”: INFO software.amazon.kinesis.coordinator.DeterministicShuffleShardSyncLeaderDecider - Elected leaders: ${uuid}
which is directly followed by a tentative to send some stuff into CloudWatch, which always fails: WARN software.amazon.kinesis.metrics.CloudWatchMetricsPublisher - Could not publish 20 datums to CloudWatch ... java.util.concurrent.TimeoutException
(is there something to configure in the hocon? or do I need to modify the IAM role?)
from there it stays forever in error and eventually java.lang.OutOfMemoryError: Java heap space
messages start to appear.
Edit4: I added cloudwatch:* on resource:* to the IAM role and sending the metrics to CloudWatch still fails: https://pastebin.com/raw/3BfMTRna. Or maybe it’s a more general network issue? (all the ping and connection errors). Seems like it’s restarting by itself tho, so at least it keeps shredding (for now).