Collector GET request 400 failure caused by long querystring?

akocukcu · November 7, 2024, 1:25pm

Hey,

We have a Snowplow (community) pipeline in production. It’s on AWS and heavily influenced by the quickstart (secure one).

To be able to backfill/replay failed requests (in case of an incident), we put the collector ALB behind a CloudFront distribution. To be able to use the Cloudfront logs, the client trackers send requests as GET (because CloudFront doesn’t log POST requests). So, the event data becomes appended as querystring.

Almost everything is good with this setup. But since yesterday we had an issue where some requests had the status of 400. At first I thought it may be because of CloudFront (limits). I sent the same request to the ALB DNS. It was the same. Finally, today I ran snowplow/scala-stream-collector-stdout via Docker. When I sent the request to localhost, it’s again the same.

When I send the request to Snowplow Micro, it doesn’t fail. Probably because it’s rather different than scala-stream-collector.

So obviosly beyond AWS, the scala-stream-collector itself rejects my request.

The problematic request’s querystring length is around 4150 characters.

My question is, are there already any known limitations on scala-stream-collector (or http4s in general) for GET request size? If so, is there a way/config to solve it?

josh · November 7, 2024, 2:19pm

Hi @akocukcu generally speaking we would only ever recommend GET requests when you cannot use POST but would always lean towards POST requests. There are a number of benefits:

More reliable encoding (don’t need to do a ton of serialization of complex structures in a URL safe way)
POST requests allow for batching (less TCP calls from your clients to the Collector is a lot more efficient for I/O)

In short I would not recommend using only GET requests in production!

However to answer your question this is a hard-coded setting in the Collector currently. The limit is 4096 chars.

In the upcoming release v3.2.1 we are increasing the default and making this configurable as well - you can follow the PR here: Release/3.2.1 by peel · Pull Request #430 · snowplow/stream-collector · GitHub

So summing up this limit will be increased in the future and will be configurable. The recommendation however is to use POST requests instead as there are extra benefits there.

akocukcu · November 7, 2024, 4:12pm

Thanks for the detailed answer @josh .

We wanted to use GET method with CloudFront to have serverless logging. But another reason was sane limits. CloudFront nicely rejects any request that is bigger than a certain size (Maximum length of a URL=8,192 bytes). This automatically protects the application and server.

In that regard, may I ask if there are any limits inside Snowplow Collector for POST requests?

akocukcu · November 7, 2024, 4:41pm

According to what I see there is 1MB limit for collector via collector.streams.{good,bad}.maxBytes

akocukcu · November 8, 2024, 9:02am

@josh just to be %100 sure, I want to ask something about this part. Doesn’t the latest release v3.2.0 already allow us to configure those limits?

If so, will using the app_version=3.2.0 allow us to change the limit through config of collector-kinesis-ec2?

github.com/snowplow/stream-collector

Performance improvements

snowplow:develop ← snowplow:fix/blocking-sink-calls

opened 11:26AM - 22 Mar 24 UTC

peel

+683 -122

#### [Add timeout for body parsing](https://github.com/snowplow/stream-collector…/commit/6e08cb969d443a98d45b138e2183d7340184cf86) For a long-running connections (for example ones coming from a load-balancer), we used to wait for body text stream to complete and therefore hanged the request infinitely. This caused performance issues. Now, we detect if there was no activity for a given body text for the period of time and short-circuit the processing early. We extend timeout in testing to accommodate for virtual environment performance. #### [Add debug logging and timeout configurations](https://github.com/snowplow/stream-collector/commit/af8f52dac5f3cc7634231860f63138a40441a4fd) Previously it was only possible to enable trace logs for requests and responses. These can contain PII and should not be enabled. After this change it is possible to explicitly configure which parts of request we want to log. For now, body is left on or off, but we should provide a better mechanism to filter out PIIs on demand. #### [Allow setting size limit on line and header length](https://github.com/snowplow/stream-collector/pull/417/commits/3f77728cfaa39f769cfb3aa6accbcedb8b5e382a) There was a slight difference in behavior between series/3.x and series/2.x where the old one's default values on lengths were higher. Those requests were marked as invalid. Now, the value is configurable.

josh · November 8, 2024, 10:40am

If so, will using the app_version=3.2.0 allow us to change the limit through config of collector-kinesis-ec2?

Technically yes but there were a couple of bugs introduced around overly strict timeouts in that release that will be fixed in the upcoming one - I would suggest waiting for v3.2.1 to be safe on that front.

In that regard, may I ask if there are any limits inside Snowplow Collector for POST requests?
According to what I see there is 1MB limit for collector via collector.streams.{good,bad}.maxBytes

Yep that is the setting exactly.

Topic		Replies	Views
Event size error when passing large JSON variable as part of self describing event Collectors	4	1841	June 25, 2020
AWS Scala Stream Collector and Javascript Tracker resulting in 405 & 403 errors Troubleshooting	2	937	May 4, 2021
Javascript Tracker Request Log In Scala Stream Collector	0	936	September 3, 2020
AWS APIGateway SnowPlow Collector Collectors	1	2105	March 11, 2018
Resolved - POST request not reaching scala collector Collectors	4	1718	August 10, 2018

Collector GET request 400 failure caused by long querystring?

Related topics