Snowbridge HTTP Retry failure_target problems

Rob_Ellison · October 7, 2024, 6:52pm

Hi,

We have setup Snowbridge to send HTTP requests to ServerSide GTM and we are seeing issues for a very low proportion of traffic but this remains fairly constant.

level=info msg="TargetResults:521,MsgFiltered:493,MsgSent:27,MsgFailed:1,OversizedTargetResults:0,OversizedMsgSent:0,OversizedMsgFailed:0,InvalidTargetResults:0,InvalidMsgSent:0,InvalidMsgFailed:0,MaxProcLatency:5005,MaxMsgLatency:5115,MaxFilterLatency:19,MaxTransformLatency:1,SumTransformLatency:12,SumProcLatency:5492,SumMsgLatency:9831,MinReqLatency:0,MaxReqLatency:5005,SumReqLatency:5478" name=Observer

level=warning msg="<mark>Retrying</mark> <mark>func</mark> (<mark>attempts</mark>: <mark>5</mark>): target.Write: Error sending http requests: 1 error occurred:\n\t* Post \"https://example.domain.com/com.snowplowanalytics.snowplow/enriched\": EOF\n\n"

I tried using the failure_target of stdout to inspect the logs for the failing events to see if there are any commonalities but I can’t find any logs output. The log level is set to debug and I can see logs for the non-failed events appearing.

I have since tried another approach of outputting failed events to SQS queue as it’s likely we would need to diagnose or reprocess these. When Snowbridge starts it does so successfully. However, I still see errors in the logs but nothing in SQS.

config.hcl

source {
  use "kinesis" {
    stream_name       = "${env("SB_ENV")}-snowplow-analytics-enriched-good"
    region            = "${env("AWS_REGION")}"
    app_name          = "${env("SB_APPNAME")}"
  }
}

transform {
  use "spGtmssPreview" {}
}

transform {
  use "spEnrichedToJson" {}
}

target {
  use "http" {
    url                        = "https://example.domain.com/com.snowplowanalytics.snowplow/enriched"
    request_timeout_in_seconds = 10
    content_type               = "application/json"
    dynamic_headers = true
  }
}

failure_target {
  use "sqs" {
    # SQS queue name
    queue_name = "${env("SB_ENV")}-snowbridge-failevent"

    # AWS region of SQS queue
    region     = "${env("AWS_REGION")}"
  }
}

log_level = "${env("LOG_LEVEL")}"

Do you have any thoughts on what the issue could be?

Thanks,
Rob

Rob_Ellison · October 8, 2024, 3:46pm

The istio proxy sidecar on the destination component is reporting 503 with UC response. This suggests the connection is being prematurely terminated by GoogleTagManager.

This is occurring after 5s. I think this is the default Node.js timeout which is what GTM is written in.

In theory though this should be sent to the failure_target. It doesn’t appear as though this is happening. As a result there could be a loss of data here right?

Colm · October 10, 2024, 12:41pm

Hey @Rob_Ellison, I think I can help explain!

At present, any failure response from the http target is treated as retryable, and gets retried eventually (we’re working on changes to improve this).

MsgFailed denotes these - it’s definitely a misleading name for the metric (we’re not yet working on this but I’m very keen to simplify metrics too! It is on the list). So anything reported as MsgFailed is retried, and doesn’t go to the failure target.

In the metrics, Invalid and Oversized are what goes to the failure target.

It’s hard to understand what caused the EOF - but the timeout is configurable on the Snowbridge side. If you set a shorter timeout for the request, you’ll get a failure sooner but it’ll just be context deadline exceeded. I expect that if your explanation for the EOF error is correct, you’ll see lots of those in the logs too.

I hope this helps!

Rob_Ellison · October 15, 2024, 7:29am

Thanks @Colm ,

We have managed to resolve the issue now. The issue was between the istio gateway and the GTM component. The connection was being terminated prematurely we believe due to idle timeout.

The odd thing looking at the logs through is that we were only seeing the 5th attempt and not attempts 1-4.

We also saw no invalid messages sent. This should explain why there are no messages in the queue.

Topic		Replies	Views
Snowbridge 2.1.0 Released New releases	0	758	May 18, 2023
Snowplow JavaScript trackers v3.17.0 released New releases	0	671	November 15, 2023
Error related to target url Tracking SDKs	2	1299	February 10, 2018
Shred step failure, no error message For engineers	4	741	June 1, 2021
Event error with INVALID_DATA_PAYLOAD For engineers	6	131	October 11, 2024

Snowbridge HTTP Retry failure_target problems

Related topics