RFC: Improve bot detection with Google reCaptcha v3

Google have recently launched version 3 of their reCaptcha service and with this comes the ability to retrieve a score to help you better detect bots or abusive traffic and make your own decisions based on that score, all with no user interaction.

You can get more detail from their docs here or watch their video below:

This RFC will look at how we can leverage this new service to further improve bot detection within your Snowplow pipeline.

Firstly, you will need to register for Google’s reCaptcha v3 service and add your site in the Admin Console. This will generate you a SITE_KEY and a SECRET_KEY. Take a note of these as you will need them later.

Then you will be able to add the recaptcha script to your frontend application:

<script src="https://www.google.com/recaptcha/api.js?render=_reCAPTCHA_site_key"></script>
<script>
grecaptcha.ready(function() {
    grecaptcha.execute('_reCAPTCHA_site_key_', {action: 'snowplow'}).then(function(token) {
       ...
    });
});
</script>

This script will return a TOKEN which can later be used to get the users score, but this must be done server side. We are proposing that using the Snowplow JavaScript Tracker, you can track this TOKEN which we can then use to retrieve the users score in an Enrichment.

We have a working branch at the moment that has a new tracking function (trackRecaptchaToken) to capture the TOKEN, you can see the code for this here.
With this modification, you can then modify your reCaptcha snippet to the following:

<script src="https://www.google.com/recaptcha/api.js?render=_reCAPTCHA_site_key"></script>
<script>
grecaptcha.ready(function() {
    grecaptcha.execute('_reCAPTCHA_site_key_', {action: 'snowplow'}).then(function(token) {
       window.snowplow('trackRecaptchaToken', token);
    });
});
</script>

This will capture an unstructured event with the following (new) schema: com.google.recaptcha/token/jsonschema/1-0-0.

We are then proposing using a new Snowplow reCaptcha Enrichment. This will require your SECRET_KEY that you received earlier from Google’s reCaptcha Admin Console.
When this enrichment executes, it will call the Google siteverify API and add a new context on to the token event: com.google.recaptcha/site_verify/jsonschema/1-0-0.
The response from Google and therefore attached as a context will look something like:

{
  "success": true,
  "action": "snowplow",
  "score": 0.9,
  "challenge_ts": "2019-12-10T10:00:00Z", 
  "hostname": "https://www.snowplowanalytics.com",
  "error-codes": []
}

This context will then be loaded into your Data Warehouse along with all your other event data and using this score you can then make decisions on how to treat this user in your data modelling. The event will contain the pageview_id, user_id, domainuser_id, etc so you should be able to join it with the other events that have tracked on this page or for this user.

To test this theory, we’ve leveraged our existing API Enrichment as a Proof of Concept which you can see below. If we do accept this feature, then the end goal would be to add a new reCaptcha Enrichment that would only require your SECRET_KEY as configuration.

{
  "schema": "iglu:com.snowplowanalytics.snowplow.enrichments/api_request_enrichment_config/jsonschema/1-0-0",
  "data": {
    "name": "api_request_enrichment_config",
    "vendor": "com.snowplowanalytics.snowplow.enrichments",
    "enabled": true,
    "parameters": {
      "inputs": [
        {
          "key": "token",
          "json": {
            "field": "unstruct_event",
            "schemaCriterion": "iglu:com.google.recaptcha/token/jsonschema/1-*-*",
            "jsonPath": "$.token"
          }
        },
        {
          "key": "ipAddress",
          "pojo": {
            "field": "user_ipaddress"
          }
        }
      ],
      "api": {
        "http": {
          "method": "POST",
          "uri": "https://www.google.com/recaptcha/api/siteverify?secret=<<SECRET>>&response={{token}}&remoteip={{ipAddress}}",
          "timeout": 5000,
          "authentication": {
            "httpBasic": {
              "username": "",
              "password": ""
            }
          }
        }
      },
      "outputs": [ {
        "schema": "iglu:com.google.recaptcha/site_verify/jsonschema/1-0-0",
        "json": {
          "jsonPath": "$"
        }
      } ],
      "cache": {
        "size": 3000,
        "ttl": 60
      }
    }
  }
}

The last thing to note, is that there are some costs associated with Google reCaptcha. It’s free for most users but there is a cost associated with high throughput that goes beyond 1k calls per second or 1m calls per month.

We’d love to know if you would find this useful or your potential use cases and if there are any other suggestions or improvements you would like to see.

6 Likes

This would be perfect for catching fraudulent publishers with their pants down.

Early on, we used to be able to spot bot traffic through IPs and browser fingerprinting. Now that bots are more sophisticated, this would be a better, more modern tool for sniffing out bots from display ads.

1 Like

Like it!

1 Like

We are currently looking for a way to identify bots and I really like this idea. Are there any plans to implement this feature?

You can technically implement this already if you are not using the API enrichment for other purposes.

You’d have to upload the site_verify schema (see above) to your own iglu server and implement by following the steps suggested above.

We’re currently rethinking our enrichment architecture so it might be a while before we add new enrichments to the pipeline, although with enough demand there might be an opportunity to add new enrichments to our existing architecture.