Snowplow JS Authentication

Hello everyone!

I am writing to discuss our ongoing implementation of Snowplow and seek further clarification on a specific aspect related to the authentication of front-end calls using JavaScript.

Firstly, I would like to express our appreciation for the progress we have made so far with Snowplow. It appears to be a promising solution that aligns well with our needs. While we are not yet in production, we have encountered a point that requires a more in-depth understanding.

Our concern revolves around the authentication of front-end calls made through JavaScript. We would greatly appreciate it if you could provide us with further insights into this matter. Specifically, we would like to understand how the validation of sent requests is performed and any recommended best practices for ensuring the integrity and security of these requests.

Our utmost concern is to ensure the integrity and security of our Snowplow implementation. It is crucial for us to prevent any malicious agent from inspecting the page, capturing calls, or gaining unauthorized access to the collector URL. We aim to avoid situations where events are sent in random patterns, potentially generating a significant amount of false data or even causing disruptions to our infrastructure. Therefore, we kindly request your guidance and any recommended best practices to fortify our defenses against such risks. Your expertise in this matter would be invaluable in safeguarding our system and maintaining its smooth operation.

Regards

Hi there Guillherme. This is an interesting subject but I think we need to understand the threat model you’re dealing with in a bit more detail.

Snowplow—and other behavioural data sources—ultimately source their behavioural events from clients that we do not control. In the case of web, the code that generates the events (and the rest of the application) is available for viewing by anyone who wants to learn how it works. Anyone can work out where data is being sent.

Any approach to making this somehow “secure” is going to rely on obfuscation—making it difficult to work out what’s going on, but not impossible to a determined adversary.

One approach might be using server-side calls, which means you can be sure it was generated by your servers. But then how does the server know the client asked for something? By responding to a call from the client, which has the same flaws.

So perhaps if you could explain a little more the threats as you see them.

Well, on our side, what we are considering as the impact of a possible malicious agent inspecting the page and getting the url of the collector, as well as the payload sent are the following:

  • Sending non-compliant data (false data, dirtying the database)

  • Possibility of collector/loader crash due to sending multiple requests - How to block attacks, avoid a lot of repeated requests

Example, by inspecting this page I was able to get the collector URL and also the payload sent, I can start to send a lot of requests, generating some fake data mass and possibly causing a disruption to the collector


Hi @ggasque,

regarding your “non-compliant data” point: In my opinion, if you leverage schemas the right way and only send events based on custom schemas, the risk of non-compliant data is low because, your custom schemas are not exposed in the client, meaning a spammer cannot know how events are validate in enrich.

  • regarding collector/loader crash point: the easiest way the mitigate this is, to route the tracker endpoint via a CDN/WAF like cloudflare, akamai, fastly. Additional benefit if setup properly: cookies set via Snowplow endpoint (sp cookie) are protected against Safari ITP.

Hope that helps.
David

1 Like

Just to add to David’s answer and list a couple of other options:

In case you are dealing with authenticated users, there is also the option to send auth tokens to the collector as a context entity and validate the tokens using the JS Enrichment or the API Enrichment. This would also enable you to enrich the events with additional user data based on the auth token.

Another option is to implement the reCaptcha API to validate data attached to each event as proposed in this reCaptcha v3 enrichment RFC.

@davidher_mann

Regarding the topic of non-compliant data, it will not be the biggest problem, considering that we have the auth implemented.

About to route the tracker endpoint via a CDN/WAF, I gonna deep explore this possibility.

@matus

We are not dealing only with authenticated users, but maybe (as we are using AWS) it could be used SigV4 signing to generate the auth token.

I will consider about implement the reCaptcha API

While I was exploring some possible solutions regarding ways to authenticate, a possibility rises:
I’m thinking about performing AWS SigV4 signing on the client-side, because Kinesis, by default, accepts unsigned requests. To enforce SigV4 authentication, it can enable enhanced fan-out on the Kinesis stream, enhanced fan-out requires signed requests.

You can certainly sign requests client side but irrespective of signing method they depend on having a secret to sign the message with. Signing is probably likely to reduce users sending targeted data but if you have a signing method that is executing client side then it necessitates having that secret available on the client. If an attacker is determined enough they can determine the secret and signing method and still send dummy data. As far as I’m aware there aren’t any analytics tools (or many other tools for that matter) that prevent request tampering. Data that is sent from the client is default assumed to be untrusted so folks that want to prevent tampering tend to move these events server side rather than relying on code that executes on the client.

If you do come up with a way that you think prevents this I’d love to hear about it as it’s certainly something we could consider implementing.

Nice POV, for sure if I figure out something to handle this I’ll share with the community

2 Likes

Regarding this topic, we found out a possible solution (AWS cloud), here it goes:

Snowplow and also other players have no way to perform authenticated requests from the front-end. This raises concerns about data security and the potential risks associated with the unauthorized usage of URL that is gonna send data. It emphasizes the importance of implementing security measures, especially when it comes to collecting data.
To address this concern, we have been exploring ways to secure the collector URL, specifically focusing on Kinesis Data Stream. By implementing proper authentication protocols, we aim to ensure that only authorized individuals or systems have access to the data stream endpoint. This approach will significantly enhance the security of our analytics infrastructure and protect against potential breaches.

Solution

To route the Amazon Kinesis Data Stream endpoint via a CDN/WAF

  1. Set up a CDN: Choose a CDN provider that supports the integration of custom origins, such as Amazon CloudFront, Cloudflare, or Fastly. Configure the CDN to act as a reverse proxy for the Kinesis Data Stream endpoint.
  2. Create a distribution: In the CDN provider’s console, create a new distribution and configure it to use Kinesis Data Stream endpoint as the origin server. Set appropriate caching settings and choose edge locations that provide optimal coverage for the target audience.
  3. Configure caching behavior: Determine which content from the Kinesis Data Stream should be cached by the CDN. Typically, the idea is to cache static content or data that doesn’t change frequently. Configure caching headers and rules to control the cache duration and behavior.
  4. Implement WAF rules: If the CDN provider offers a built-in WAF or allows integration with a separate WAF service, configure security rules to protect against common web application vulnerabilities. This can include rules to detect and block malicious requests, prevent SQL injection, cross-site scripting (XSS), or other known attack vectors.
  5. Update DNS settings: Once the CDN distribution is set up, update the DNS settings to point the relevant domain or subdomain to the CDN’s edge servers. This ensures that incoming requests are directed to the CDN instead of directly hitting the Kinesis Data Stream endpoint.
  6. Test and monitor: Verify that the CDN/WAF integration is functioning correctly. Test different scenarios, such as caching behavior, WAF rule enforcement, and performance improvements. Monitor the CDN/WAF logs and analytics to gain insights into the traffic patterns and security events.

References

So this is definitely a good approach (and generally a WAF is what we recommend to most customers) for blocking things like exploits and malicious requests. Unfortunately I don’t think this addresses the authentication or spoofing requests components which WAFs don’t really have a good way to introspect what is genuine vs what is manually constructed data.