We’ve got an SEO consultant telling us that having a lot of bot traffic on our collector endpoint is a Bad Thing. (Googlebot hits it pretty regularly.)
Is it possible to configure the Scala collector to respond to a request for robots.txt?
Does it make any sense to do this?
How regularly out of interest? Getting crawled by Googlebot is pretty standard and if it’s super frequent you can adjust how often this happens in Google Search Console.
No, not at the moment but it wouldn’t be particularly complicated to add in as the collector would just need to serve a static file. You can however set the X-Robots-Tag header on the root path - though you may need to set that for every URL if you don’t want it indexed.
I’m not an SEO expert by any means, and I don’t think it’d hurt to do this but I also don’t think Google would be penalising you in anyway in the same way it doesn’t penalise any other API services for not having a robots.txt. You aren’t really serving any HTML content so I can’t imagine that Google is actually going to be indexing any of this content - though it may crawl it.
Thanks Mike. I will look into the X-Robots-Tag header.
I am not convinced that Googlebot hitting our collector is a real problem, but I have to respond to the SEO Guy. You probably know how that goes …
We have a fair number of sites, and the level of bot traffic that runs our tracker and hits our collector has never been considered excessive. (Also, bots tend to cache their collector requests, which makes it pretty easy to find them in the events table.)
What I’m dealing with now is that we are launching a new site that logs 5 to 10 events on each page. SEO Guy is analyzing a dev instance, running some client application that acts like googlebot. He has raised a flag about all the traffic on the collector, and Management Is Concerned.
I’ve created a Github issue for this because I’ve had more of a think about it and think there is a legitimate case to be made for serving a robots.txt.
Having robots crawl (Googlebot or otherwise) any of the collector endpoints means that the collector needs to respond and there’s a non-zero cost to doing so
Robots (both good and bad) may inadvertently create bad rows, mostly via sending empty payloads and creating adapter failures which then trickle downstream to any bad rows sinks. This creates additional noise + network transfer + data storage for these events which nobody is really ever going to use. Bad robots are likely to ignore robots.txt, but if it reduces the volume of well-behaved robots I think that is still beneficial overall.
In this case I don’t think the robots.txt solution will fix this problem as it’ll prevent crawling of the collector but not necessarily crawling of the website and execution of the tracker JS which will fire bot events. In those instances I’d highly recommend filtering traffic with the IAB enrichment which will flag bots and either filtering or excluding this traffic. In some instances it is quite useful to retain this data - particularly if you’d like to analyse how and at what frequency bots are crawling the site.