Collector GET request 400 failure caused by long querystring?

Hey,

We have a Snowplow (community) pipeline in production. It’s on AWS and heavily influenced by the quickstart (secure one).

To be able to backfill/replay failed requests (in case of an incident), we put the collector ALB behind a CloudFront distribution. To be able to use the Cloudfront logs, the client trackers send requests as GET (because CloudFront doesn’t log POST requests). So, the event data becomes appended as querystring.

Almost everything is good with this setup. But since yesterday we had an issue where some requests had the status of 400. At first I thought it may be because of CloudFront (limits). I sent the same request to the ALB DNS. It was the same. Finally, today I ran snowplow/scala-stream-collector-stdout via Docker. When I sent the request to localhost, it’s again the same.

When I send the request to Snowplow Micro, it doesn’t fail. Probably because it’s rather different than scala-stream-collector.

So obviosly beyond AWS, the scala-stream-collector itself rejects my request.

The problematic request’s querystring length is around 4150 characters.

My question is, are there already any known limitations on scala-stream-collector (or http4s in general) for GET request size? If so, is there a way/config to solve it?

Hi @akocukcu generally speaking we would only ever recommend GET requests when you cannot use POST but would always lean towards POST requests. There are a number of benefits:

  1. More reliable encoding (don’t need to do a ton of serialization of complex structures in a URL safe way)
  2. POST requests allow for batching (less TCP calls from your clients to the Collector is a lot more efficient for I/O)

In short I would not recommend using only GET requests in production!

However to answer your question this is a hard-coded setting in the Collector currently. The limit is 4096 chars.

In the upcoming release v3.2.1 we are increasing the default and making this configurable as well - you can follow the PR here: Release/3.2.1 by peel · Pull Request #430 · snowplow/stream-collector · GitHub


So summing up this limit will be increased in the future and will be configurable. The recommendation however is to use POST requests instead as there are extra benefits there.

Thanks for the detailed answer @josh .

We wanted to use GET method with CloudFront to have serverless logging. But another reason was sane limits. CloudFront nicely rejects any request that is bigger than a certain size (Maximum length of a URL=8,192 bytes). This automatically protects the application and server.

In that regard, may I ask if there are any limits inside Snowplow Collector for POST requests?

According to what I see there is 1MB limit for collector via collector.streams.{good,bad}.maxBytes

@josh just to be %100 sure, I want to ask something about this part. Doesn’t the latest release v3.2.0 already allow us to configure those limits?

If so, will using the app_version=3.2.0 allow us to change the limit through config of collector-kinesis-ec2?

If so, will using the app_version=3.2.0 allow us to change the limit through config of collector-kinesis-ec2?

Technically yes but there were a couple of bugs introduced around overly strict timeouts in that release that will be fixed in the upcoming one - I would suggest waiting for v3.2.1 to be safe on that front.

In that regard, may I ask if there are any limits inside Snowplow Collector for POST requests?
According to what I see there is 1MB limit for collector via collector.streams.{good,bad}.maxBytes

Yep that is the setting exactly.

1 Like