Clojure Collector bottleneck

T_P · September 19, 2017, 11:49am

Hi guys,

We’re using Amazon Elastic Beanstalk with a Clojure-based collector.
We used to have an M1.small instance but since the number of tracking events have been growing it was replaced with an M4.large EC2 instance.
This current instance has 450 Mbps bandwidth and also Enhanced Networking by default, also has 8GB RAM and 2 vCPUs.

The action taken by the collector seems relatively simple: log the event, in case of click action, redirect to the indicated URL, in case of open action, serve a pixel.

The problem? We’re facing a huge bottleneck.

We’ve compiled some benchmarks that can illustrate the huge TTFB (Time To First Byte) when clicking on a tracking URL:

M1.small - 58.02ms (2.2 req/sec)
M4.large - 1.05s (9.2 req/sec)
M1.small - 56.13ms (5.4 req/sec)
M4.large - 7.60s (17.0 req/sec)
M1.small - 88.27ms (3.2 req/sec)
M4.large - 7.77s (13.7 req/sec)
M4.large - 110.33ms (5.5 req/sec)

It is impractical that you click on a link and have to wait 7 seconds until something finally happens!

Can anyone helps us by-pass this situation?

TIA,

T_P · September 19, 2017, 5:09pm

If that helps, I’ve attached the monitoring for the past 8 hours:

At first sight everything looks normal…
Those spikes on Network Out occur every hour when the logs are deployed to S3.

Also some info regarding response rates from /status:

   "ring.responses.rate":{
      "type":"meter",
      "rates":{
         "1":4.609870456257155,
         "5":4.770831762632824,
         "15":4.8355326453594305
      }
   },
   "ring.responses.rate.2xx":{
      "type":"meter",
      "rates":{
         "1":3.539407799370498,
         "5":3.7155430291082774,
         "15":3.7829888548645725
      }
   },
   "ring.responses.rate.3xx":{
      "type":"meter",
      "rates":{
         "1":1.0256593339035567,
         "5":1.0292773068203749,
         "15":1.018990987053793
      }
   }

josh · September 20, 2017, 8:28am

Hi @T_P,

A few follow-up questions to see if we can get to the bottom of this:

What Trackers are you using?
Are you using GET or POST requests? If POST how many are being bundled?
With the m4.large server what type of EBS volume have you attached? Standard, gp2, io1?
Is the latency something you are seeing in the Elastic Beanstalk UI or are these time measurements at your host? Does sending the request via cURL at the CLI result in the same roundtrip time?

T_P · September 20, 2017, 9:29am

Hi @josh

We’re using Pixel Tracker, currently tracking pixels and clicks (example below)
GET
100GB gp2
Benchmarks were collected via HTTP (location Portugal, EBS AZ Ireland) the response rate were collected at EBS using /status. Either via cURL or HTTP the results has very similar TTFB.

Request example:
trck.domain.com/r/tp2?e=ue&ue_pr={“schema”%3A"iglu%3Acom.snowplowanalytics.snowplow%2Funstruct_event%2Fjsonschema%2F1-0-0"%2C"data"%3A{“schema”%3A"iglu%3Acom.XYZ%2Fclick%2Fjsonschema%2F1-0-4"%2C"data"%3A{“cid”%3A"8633"%2C"eid"%3A"31238"%2C"uid"%3A"12345"%2C"geo"%3A"PT"}}}&tv=custom&p=web&u=https%3A%2F%2Fwww.XYZ.com%2Fhome%2C24682%2CNL%2CNL%2C54219.html

josh · September 20, 2017, 11:29am

Hi @T_P,

Thanks for that … all of that looks fine but there are a few other things we can have a look at.

Could you share your collector endpoint so that I can test it as well for TTFB (just to remove any localised latency issues)
Could you share the exact configuration for both the m1.small and m4.large environments:

RAM allocated, instance count etc

Are you running the Collectors in a Private Subnet behind a NAT Gateway / Instance?
How many requests per second is the collector seeing at the Load Balancer currently and what is the average reported Latency for the Load Balancer?

T_P · September 20, 2017, 11:54am

Thank you for your help, @josh

Take a look: trck.eu-west-1.elasticbeanstalk.com. Currently with ±5 req/sec the latency isn’t fully experienced since we opted to disable the tracking of most events.
At the moment only m4.large is used that consists in 1 instance with 8GB RAM, 2 vCPUs, dedicated 450 Mbps bandwidth.
No NAT Gateway, but using a VPC with a subnet.
Not using a Load Balancer at the moment.

Instance metrics from the past 24 hours:

Volume metrics from the same 24 hours period:

For reference, at 19/09 16h00 we got something like TTFB 6.60s (8.5 req/sec)

T_P · September 25, 2017, 9:04am

Hi @josh, any clue with the given metrics?

josh · September 25, 2017, 11:16am

Hi @T_P,

Sending requests to the endpoint you have supplied above resolves in just a few milliseconds so not sure where the issue might be.

Is the traffic pattern from the pixel tracker very spiky? Are you sending sudden influxes of data to the collector? If you do manage to overwhelm the server then response times can rise quite sharply.

At the moment only m4.large is used that consists in 1 instance with 8GB RAM, 2 vCPUs, dedicated 450 Mbps bandwidth.

I was asking more for the Elastic Beanstalk configuration. How much ram have you allocated to the Clojure Collector server itself?

Topic		Replies	Views
Handful of 5XX responses from Clojure Collector deployed in elastic beanstalk Collectors	3	1610	November 10, 2016
POST request to Clojure collector from JS tracker is timed out Collectors	6	2046	August 8, 2018
Troubleshooting Clojure Collector instances to prevent data loss Troubleshooting	0	1628	May 2, 2016
Clojure collector outside ELB Collectors	3	1255	July 11, 2018
Logs when deploying Clojure Collector as a Docker container in Beanstalk Collectors	2	1479	December 13, 2016

Clojure Collector bottleneck

At the moment only m4.large is used that consists in 1 instance with 8GB RAM, 2 vCPUs, dedicated 450 Mbps bandwidth.

Related topics