I was doing a bit of testing on a website that has Snowplough installed via Stacktome when I noticed something a bit funny. It appears there is no real authentication or response validation happening at the API level, since manipulating various data fields still returns a 200 OK via a basic Postman request. On approaching the Stacktome team, it was suggested that when false data is injected via this method, it is handled by them in the backend, perhaps via validation of certain unique keys? If this is happening and known to be happening by users of your product this is of course a major concern. However, even if we do suppose such validation is occuring it can only ever be a partial validation, because the transmitted data is not in a form that would even allow a full validation of the payload’s integrity to occur…
To be clear: I am particularly wondering why there is no integrity validation on the data as a whole. The closest there is to encoding is base64, which is of course trivially translated, and intended to be so. I really would have expected the endpoint to check whether the data passed an integrity check of some sort based on a unique identifier for the body of the request. If this were occuring I would probably have expected a 401, 403 or (questionably?) a 400 response for the POST, rather than a 200 which of course I take to mean the POST action was successful and the data has been added to the requisite database or similar.
Regarding contribution guidelines for your project these seem to be missing from the repo for your product. A representative of Stacktome suggested, however, that signficant contributers might receive compensation of some sort, which would of course make the idea of contibuting to a not particularly high profile repostitory a lot more appealing.
Hi Peter. Thanks for your interest in the Snowplow API design!
On Collector responses, the Collector performs a single responsibility - collecting the event - at huge volumes. If the collector had no problem receiving the payload it returns a 200. All validation (including compliance to schemas) happens after this point and most of it in downstream services, kicking the event out of the pipeline and in to a holding location for what we call “bad rows” for further analysis and/or recovery. If the client had to wait for all possible validations to happen before load it could take seconds, if not minutes. Data is emitted from throwaway client environments, so the idea of validation in the API is rendered meaningless - there is nothing the client could do with the information that the data it sent was incorrectly formed.
On authentication in web analytics, data collection is from untrusted client environments, so no authentication as you’d expected in a traditional API is possible.
Sounds like you’re passing in bad data through the fields in the request and the server is responding 200.
The “collector” server will log these events for processing during “enrichment” but they will fail validation. Failed events are stored in “bad rows” giving you the opportunity to reprocess/recover the erroneous data later, JIC. Validation in the backend is typical of web analytics tools - you’ll find Google Analytics et al handle it similarly because they’re designed to make it easy to collect huge volumes of data from anywhere, easily…
Thanks for these replies. It’s interesting to know. From a cursory glance at the documentation I did wonder if there might be something happening during the enrichment process, but it still seems to me that it would still be possible to craft requests en-mass that would seem to be genuine but nevertheless contain incorrect information i.e. not a malformed request but one crafted to appear genuine.
I understand your point about the browser being essentially untrusted from the get-go but I did have the vague inclination there could be additional information included that would facilitate additional checks without increasing payload size too much…? Perhaps you have something which performs a similar function already but I didn’t initially find anything in the specs which seemed to be designed to contain validation information for the object as a whole.
I’ll maybe have a delve into the details of the code and/or run some tests on a mock production system to see if I can get altered packets through and past any backend validation. After that if it proves possible in practice maybe see if there’s any way to further account for spoofing without major increases in payloads, processing times etc.
Even if I don’t find anything it’ll be a useful and interesting learning excersize.
Again thanks for taking the time to read and respond. Much appreciated.
This is quite difficult to do reliably unless events are being sent server side in which they can be considered a trusted environment. Any hashing / signing that happens on the client can reasonably easily reproduced so you’ll find that most web based analytics tools (GA, Adobe, A/B testing etc, display advertising, retargeting) don’t attempt to sign requests. The latest start of the art algorithms tend to use machine learning on the data to determine whether a request / event is fraudulent or untrusted based on a variety of signals.
Hmmmmmmm, I take it Snowplough this doesn’t use machine learning yet? I’ve been looking to get into Tensorflow etc for a while though some of that stuff is pretty hardcore. I’m wondering: if the big-money client desire is generally for machine learning then that might be worth adding support for, but I also notice that we’ve got AWS support but no Azure. A bit off topic but I know a lot of people are moving to Azure these day… Comparatively, if I were to look to contribute to this repo in the next few months, what would be the bigger win: improved security, perhaps via machine learning or some other technique, or Azure support? I believe Azure also does machine learning out of the box, though I’m not sure what the implications would be for integration with this service.