Issue while upgrading from collector 2.10.0 to 3.1.0. version

I am getting the below error very frequently while using the docker image snowplow/scala-stream-collector-kinesis:3.1.0 but it’s not happening while using the snowplow/scala-stream-collector-kinesis:2.10.0

Issue:

[io-compute-0] ERROR org.http4s.server.service-errors - Error servicing request: POST /com.snowplowanalytics.snowplow/tp2 from 10.x.x.x
org.http4s.InvalidBodyException: Received premature EOF.

As per the release log, I understood there is a transition from Akka HTTP to http4s as the HTTP framework.

We run the collector application on the ECS fargate container with ALB for incoming traffic. I cannot get the input payload since the events are not sent to the snowplow collector bad events kinesis stream. Can someone help me resolve this issue?

Hey @Vishal_Periyasamy, thank you for reaching out. The collector series 3.x is a major update that has a slightly different performance characteristics than its predecessor. We tend to configure and tune it carefully to meet our customers usage patterns. However, we haven’t seen the error you’re hitting.

To be able to provide any useful suggestions, it’d be good to understand your runtime environment.
How does your ALB configuration look like - idle timeout settings, active connections?
Have you set any overrides to networking configuration? - Previously this setting was available through akka.networking section, but is moved since 3.0.0.
What are the resources available to the collector container? Are you overriding any of JVM options?

The error you are seeing occurs when the incoming POST connection is shutdown before full request body is read by the collector. So anything related to network settings and behavior is of essence here.

@peel lemme know if I need to update the configurations.
Sharing the collector configuration below,

collector {
  license { accept = true }
  interface = "0.0.0.0"
  port = 8080
  ssl {
    enable = false
    redirect = false
    port = 9543
  }
  p3p {
    policyRef = "/w3c/p3p.xml"
    CP = "NOI DSP COR NID PSA OUR IND COM NAV STA"
  }
  crossDomain {
    enabled = false
    domains = [ "*" ]
    secure = true
  }
  cookie {
    enabled = true
    expiration = "365 days" # e.g. "365 days"
    name = collector_cookie
    secure = false
    httpOnly = false
  }
  doNotTrackCookie {
    enabled = false
    name = collector-do-not-track-cookie
    value = collector-do-not-track-cookie-value
  }
  cookieBounce {
    enabled = false
    name = "n3pc"
    fallbackNetworkUserId = "00000000-0000-4000-A000-000000000000"
    forwardedProtocolHeader = "X-Forwarded-Proto"
  }
  enableDefaultRedirect = true
  redirectMacro {
    enabled = false
    placeholder = "[TOKEN]"
  }
  rootResponse {
    enabled = false
    statusCode = 302
    headers = {
      Location = "https://127.0.0.1/",
      X-Custom = "something"
    }
    body = "302, redirecting"
  }
  cors {
    accessControlMaxAge = 5 seconds
  }
  streams {
    good = "collected-good-events-stream"
    bad = "collected-bad-events-stream"
    useIpAddressAsPartitionKey = false
    sink {
      enabled = kinesis
      region = us-east-1
      threadPoolSize = 10
      aws {
        accessKey = default
        secretKey = default
      }
      backoffPolicy {
        minBackoff = 10
        maxBackoff = 10
      }
    }
    buffer {
      byteLimit = 3000000
      recordLimit = 300
      timeLimit = 5000
    }
  }
}
akka {
  loglevel = DEBUG
  loggers = ["akka.event.slf4j.Slf4jLogger"]
  http.server {
    remote-address-header = on
    raw-request-uri-header = on
    parsing {
      max-uri-length = 32768
      uri-parsing-mode = relaxed
    }
  }
}

Hey @Vishal_Periyasamy, thank you for providing collector configuration.

Could you also provide answers to the rest of questions from @peel? What I mean specifically:

  1. How does your ALB configuration look like, e.g. idle timeout settings, active connections.
  2. What are the resources available to the collector container?
  3. Are you overriding any of JVM options?

Collector configuration is important, but in general, as stated by @peel, understanding your runtime environment is crucial here.

That would be really helpful and allow us to analyze the problem properly :slight_smile:

  1. we are using the AWS ALB default parameters,
  • Idle timeout - 60 seconds
  • At a peak time sum of total active connections throughout the 5-minute range will be around 2k.
  1. Collector configurations,
CPU: "1024",
memory: "2048"
  • At peak, the CPU usage could be 65%
  • At peak, the memory usage could be 32%
  1. Yes we are overwriting log level alone
ENV JAVA_OPTS="-Dorg.slf4j.simpleLogger.defaultLogLevel=error"

Hi @pondzix and @peel any update on the above issue?

Hi @pondzix and @peel it’s been more than two weeks, any update on the above issue?

Hi @Vishal_Periyasamy I believe what you could try is set following configuration:

networking {
    maxConnections = 8126
    idleTimeout = 610 seconds
}

But you also should have similar settings set for AWS so connections don’t get terminated there as that is what seems to be happening.

Also, have you observed any relationship between the failure and load?

Another approach that could help investigating this would be enabling tracing logs in the collector to see what is the behaviour and input that causes the requests to fail.
This could be done by setting -Dorg.slf4j.simpleLogger.showDateTime=true -Dorg.slf4j.simpleLogger.dateTimeFormat=HH:mm:ss.SSSZ -Dorg.slf4j.simpleLogger.log.org.http4s.blaze=TRACE -Dorg.slf4j.simpleLogger.levelInBrackets=true flags for collector container.

We’ve been able to synthesise that the kind of errors you’re reporting by POST requests which contain Content-Length longer than actual body.

An example that will inevitably cause the kind of error is this request:

curl -i --http1.1 -H Content-Length\:\ 10 -H Connection\:\ keep-alive -H Referer\:\ https\://local2.host/ -H Origin\:\ https\://local2.host  -H Accept\:\ \*/\* -H Content-Type\:\ application/json\;\ charset\=UTF-8 -XPOST http\://localhost:9090/com.snowplowanalytics.snowplow/tp2 -d \{\}

Can you see the kind of events in your pipeline?

Hi @peel, the suggested configuration changes didn’t resolve the issue. I tried setting the logger to TRACE level to identify any common patterns of failure. However, I’m having trouble differentiating each request. Is there a way to modify the logging format to include a request_id or any unique_id for distinguishing each log message?

Have you tried the most recent version - 3.2.0?
We believe that unusual behaviour is due to long-standing idle connections that are blocking the server when a Content-Length that’s longer than actual body is received. The server will wait for the connection to complete the body for the idleTime period. Tuning values in this section should prevent the issue from happening.

Also, idleTimeout is usually set to a high value in GCP where LB uses idle connections for pool management. It is not encouraged to keep a long idleTime in deployments where it’s not strictly necessary.

Thanks for your response @peel
But we currently have the Connection idle timeout 60 seconds, should we reduce even lesser than 60seconds?

Updated the configuration on snowplow and ALB,

  networking {
    maxConnections = 1024
    idleTimeout = 60 seconds
    responseHeaderTimeout = 5 seconds
    bodyReadTimeout = 5 second
    maxRequestLineLength = 20480
    maxHeadersLength = 40960
  }

But still facing the same issue on 3.2.0 collector version.

Hi @peel & @pondzix any update on the above issue?