Howdy folks. I’ve been attempting to troubleshoot intermittent (less than one percent of traffic) ELB_5XX errors on our ELB that sits in front of our scala collector ASG. Presently, we have two collectors (c4.large), one each in an AZ, and the ELB performs cross-AZ load balancing. The collector consumes events produced by the javascript tracker. We are not processing (what I would consider) large volumes of requests. The maximum requests per hour we’ve seen is around 6000, and typically hovers around 2500 requests per hour for our busy periods. Regardless of traffic levels, we tend to observe one ELB_5XX per hour (averaged; there isn’t a regular point in time where the error occurs).
We are noticing on our backend that we are occasionally missing certain snowplow data that our own tracking says should be present. I suspect that the ELB_5XX errors are the likely source of this missing data for us. As an example, our own conversion data might indicate that a given interaction was associated with some domain_sessionid or userid value, and yet atomic.events is either empty for those IDs, or at least missing our custom snowplow conversion event (typically the last event to be fired in the funnel).
After enabling ELB access logs to further investigate, every 5XX is actually a 504, and the client interaction typically looks like a couple requests with a 504 littered in somewhere:
2017-03-28T15:30:46.477991Z snowplow-collector-elb 24.93.XXX.XXX:53085 172.31.XXX.XXX:8080 0.000045 0.001724 0.000039 200 200 0 0 "OPTIONS http://snowplow-collector-elb-XXX.us-west-2.elb.amazonaws.com:80/com.snowplowanalytics.snowplow/tp2 HTTP/1.1" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; GTB7.3; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648)" - -
2017-03-28T15:31:00.717323Z snowplow-collector-elb 24.93.XXX.XXX:53085 - -1 -1 -1 504 0 0 0 "POST http://snowplow-collector-elb-XXX.us-west-2.elb.amazonaws.com:80/com.snowplowanalytics.snowplow/tp2 HTTP/1.1" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.3; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)" - -
Sometimes, there will be a few successful POSTs, but one fails. Sometimes it’s the last POST from the IP that fails. Sometimes, like above, it’s only one POST, and it fails.
Correlating the ELB_5XX metric and the ELB latency metric, each 504 corresponds with a spike in the MAX latency that meets the ELB’s timeout value (currently 45 seconds), which typically (to me) would indicate a problem with the backend application.
===========================================================
Amazon’s own documentation mentions two possibilities for these 504s:
- Cause 1: The application takes longer to respond than the configured idle timeout.
- Cause 2: Registered instances closing the connection to Elastic Load Balancing.
I have spent a fair amount of time attempting to find any explanation for these 504s, and the paired latency spikes with the ELB_5XX errors led me to believe that Cause 1 must be the issue.
I first…
- correlated ELB, Kinesis, and EC2 logs in CW under the hypothesis that perhaps we’re exceeding throughput on the kinesis stream and the collector is stalling until it can write to the stream again. Dynamic scaling of the shards is something I’ve put off given our low volume. No correlations though.
- correlated the 504s with the stdout and stderr of the collector (currently with log level at INFO). No indication of errors.
- correlated the 504s with the instance itself. Maybe the app is getting bounced? some other resource constraint on the instance? Negative.
I also looked into Cause 2, and the default timeouts for spray-can and the ELB were initially the same for our deployment (60 seconds). AWS documentation indicated that the ELB should be responsible for closing connections to the backend, and so I set the ELB timeout to 45 seconds. Maybe we were hitting edge cases where the collector app tore down the connection while the ELB thought it was still valid? Granted, that wouldn’t be paired with a latency spike (it’d be an immediate rejection from the backend that got sent to the client, I’d think), but still… Alas, no. Still receiving 504s.
Then I decided to dive deeper, and enabled flow logs on the relevant ENIs for the collectors and the ELB, and performed a full tcpdump on each of the collector instance’s eth0 interfaces (minus SSH traffic). The flow logs proved not to be particularly helpful, since it seems that the ENIs associated with the load balancer in each AZ are for its outside interfaces (where the ELB access logs already provide more information; and as an aside, the flow log rows seem too broad to be helpful troubleshooting this particular issue since they combine and summarize IP traffic into capture windows).
===========================================================
However, the tcpdumps seem the most helpful. They are showing me… no POST attempt at all. I’ll filter on an X-Forwarded-For for the problem IP in the ELB access logs, and I’ll see the same traffic on the collector, minus the POST that the ELB says timed out. It’s as if the POST never makes it to the collector. I’ll look at the associated tcp stream and not see any other issues with it. Sometimes other POSTs (for other client IPs) will come in afterwards too on that same stream, before the LB eventually FINs the stream from inactivity. Checking the other collector instance doesn’t show any X-Forwarded-For traffic for the problem IP either, so it’s not as if it just got routed to the other collector. Also, checking all traffic on the other instance from the timeframe the successful tracker-collector interactions were happening doesn’t show anything out of order either. Everything appears nominal.
At this point, I checked for dropped packets on the collector instances’ interfaces. Nothing. tcpdump dropping packets from capture? Ugh, yes, over the past five days or so a few were dropped. Could I be so unlucky as to have every last one of those dropped packets be the “missing” POSTs and not have any other helpful logs from previous correlation efforts? Doubtful. Maybe MTU fragmentation was to blame? I enabled ICMP traffic at least, thinking that perhaps the client was attempting to renegotiate MTU and couldn’t due to the SGs on the ELB and collector not permitting ICMP traffic. No change.
===========================================================
At this point, I’m kind of at a loss as to what might be causing these ELB 504s. My best guess is some sort of network wonkiness between the ELBs’ inside interfaces and the collector interfaces. Has anyone else run into this issue before? Is there something I missed in my troubleshooting? Do I need to pay AWS for the Developer-level support plan for a month so that they can take a look at things? If it turns out I’ve missed something, that’s one thing. If it turns out I have to pay AWS to tell them about something that turns out to be on their end… well, that’s a more sour pill.