Snowbridge HTTP Retry failure_target problems

Hi,

We have setup Snowbridge to send HTTP requests to ServerSide GTM and we are seeing issues for a very low proportion of traffic but this remains fairly constant.

level=info msg="TargetResults:521,MsgFiltered:493,MsgSent:27,MsgFailed:1,OversizedTargetResults:0,OversizedMsgSent:0,OversizedMsgFailed:0,InvalidTargetResults:0,InvalidMsgSent:0,InvalidMsgFailed:0,MaxProcLatency:5005,MaxMsgLatency:5115,MaxFilterLatency:19,MaxTransformLatency:1,SumTransformLatency:12,SumProcLatency:5492,SumMsgLatency:9831,MinReqLatency:0,MaxReqLatency:5005,SumReqLatency:5478" name=Observer
level=warning msg="<mark>Retrying</mark> <mark>func</mark> (<mark>attempts</mark>: <mark>5</mark>): target.Write: Error sending http requests: 1 error occurred:\n\t* Post \"https://example.domain.com/com.snowplowanalytics.snowplow/enriched\": EOF\n\n"

I tried using the failure_target of stdout to inspect the logs for the failing events to see if there are any commonalities but I can’t find any logs output. The log level is set to debug and I can see logs for the non-failed events appearing.

I have since tried another approach of outputting failed events to SQS queue as it’s likely we would need to diagnose or reprocess these. When Snowbridge starts it does so successfully. However, I still see errors in the logs but nothing in SQS.

config.hcl
source {
  use "kinesis" {
    stream_name       = "${env("SB_ENV")}-snowplow-analytics-enriched-good"
    region            = "${env("AWS_REGION")}"
    app_name          = "${env("SB_APPNAME")}"
  }
}

transform {
  use "spGtmssPreview" {}
}

transform {
  use "spEnrichedToJson" {}
}

target {
  use "http" {
    url                        = "https://example.domain.com/com.snowplowanalytics.snowplow/enriched"
    request_timeout_in_seconds = 10
    content_type               = "application/json"
    dynamic_headers = true
  }
}

failure_target {
  use "sqs" {
    # SQS queue name
    queue_name = "${env("SB_ENV")}-snowbridge-failevent"

    # AWS region of SQS queue
    region     = "${env("AWS_REGION")}"
  }
}

log_level = "${env("LOG_LEVEL")}"

Do you have any thoughts on what the issue could be?

Thanks,
Rob

The istio proxy sidecar on the destination component is reporting 503 with UC response. This suggests the connection is being prematurely terminated by GoogleTagManager.

This is occurring after 5s. I think this is the default Node.js timeout which is what GTM is written in.

In theory though this should be sent to the failure_target. It doesn’t appear as though this is happening. As a result there could be a loss of data here right?

Hey @Rob_Ellison, I think I can help explain!

At present, any failure response from the http target is treated as retryable, and gets retried eventually (we’re working on changes to improve this).

MsgFailed denotes these - it’s definitely a misleading name for the metric (we’re not yet working on this but I’m very keen to simplify metrics too! It is on the list). So anything reported as MsgFailed is retried, and doesn’t go to the failure target.

In the metrics, Invalid and Oversized are what goes to the failure target.

It’s hard to understand what caused the EOF - but the timeout is configurable on the Snowbridge side. If you set a shorter timeout for the request, you’ll get a failure sooner but it’ll just be context deadline exceeded. I expect that if your explanation for the EOF error is correct, you’ll see lots of those in the logs too.

I hope this helps!

Thanks @Colm ,

We have managed to resolve the issue now. The issue was between the istio gateway and the GTM component. The connection was being terminated prematurely we believe due to idle timeout.

The odd thing looking at the logs through is that we were only seeing the 5th attempt and not attempts 1-4.

We also saw no invalid messages sent. This should explain why there are no messages in the queue.