Robots.txt disallowing user-guide crawling

joshenders · March 8, 2023, 2:11am

I was searching tonight for a user-guide I wrote for Tailscale and I noticed it's nowhere to be found on Google:

This appears to be due to the robots.txt disallowing crawling of user guides:

User-agent: *
Disallow: /docs/user-guide/
Disallow: /*/docs/user-guide/

This seems like a disservice to new users who are likely using Google instead of wiki search. Does anyone know if this is intentional or the background on this? If it's not intentional, who can remove these disallow lines to get our content indexable?

Thanks!

tmomas · March 8, 2023, 6:32am

user-guide != guide-user

joshenders · March 8, 2023, 1:14pm

Oops. You're absolutely correct! That is not the line disallowing Googlebot crawl but there is something disallowing Googlebot crawl.

I've compiled Google's robots.txt parser and run it against the url and all of Googlebot's desktop user-agents are disallowed.

$ ./robots /tmp/robots.txt https://openwrt.org/docs/guide-user/services/vpn/tailscale/start 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
user-agent 'https://openwrt.org/docs/guide-user/services/vpn/tailscale/start' with URI 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)': DISALLOWED

$ ./robots /tmp/robots.txt https://openwrt.org/docs/guide-user/services/vpn/tailscale/start 'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/110.0.5481.177 Safari/537.36'
user-agent 'https://openwrt.org/docs/guide-user/services/vpn/tailscale/start' with URI 'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/110.0.5481.177 Safari/537.36': DISALLOWED

$ ./robots /tmp/robots.txt https://openwrt.org/docs/guide-user/services/vpn/tailscale/start 'Googlebot/2.1 (+http://www.google.com/bot.html)'
user-agent 'https://openwrt.org/docs/guide-user/services/vpn/tailscale/start' with URI 'Googlebot/2.1 (+http://www.google.com/bot.html)': DISALLOWED

This is also evidenced by the results returned for the same query against multiple search engines, which in my experience, have roughly similar results for well optimized content.

Google:

Yandex:

Bing:

Yahoo:

I'll dig a little deeper and get to the bottom of this but just to be clear, I'm not particularly concerned about my individual contributions here but more concerned overall in the quality of search engine relevance of documentation on the wiki. Arch Linux is one of the most well-known distributions in large part due to their excellent wiki and documentation. Everybody is doing a great job with OpenWrt and I'd love to see the work appreciated by a larger audience.

joshenders · March 8, 2023, 1:44pm

Okay, well that didn't take nearly as long as I expected.

The offending rule is:

User-agent: *
Disallow: /*.html

I determined this by patching Google's open source robots.txt parser to log when a match occurred:

diff --git a/robots.cc b/robots.cc
index 20cd51c..d825916 100644
--- a/robots.cc
+++ b/robots.cc
@@ -28,6 +28,8 @@
 #include <cstddef>
 #include <vector>
 
+#include <iostream>
+
 #include "absl/base/macros.h"
 #include "absl/container/fixed_array.h"
 #include "absl/strings/ascii.h"
@@ -108,7 +110,7 @@ class RobotsMatchStrategy {
       if (numpos == 0) return false;
     }
   }
-
+  std::cout << "Matched: " << pattern << std::endl;
   return true;
 }

Now we recompile and rerun the offending URL, robots.txt, and User-Agent and we get:

$ ./robots /tmp/robots.txt https://openwrt.org/docs/guide-user/services/vpn/tailscale/start 'Googlebot/2.1 (+http://www.google.com/bot.html)'
Matched: /*.html
user-agent 'https://openwrt.org/docs/guide-user/services/vpn/tailscale/start' with URI 'Googlebot/2.1 (+http://www.google.com/bot.html)': DISALLOWED

If we remove the Disallow: /*.html:

user-agent 'https://openwrt.org/docs/guide-user/services/vpn/tailscale/start' with URI 'Googlebot/2.1 (+http://www.google.com/bot.html)': ALLOWED

I think this line should be removed from the robots.txt for optimal SEO. Just keep in mind that this will increase server load as google (and others) have likely not been indexing almost any pages on the wiki.

joshenders · March 8, 2023, 3:06pm

I just did a quick test against the sitemap:

$ wc -l /tmp/sitemap.txt 
   26811 /tmp/sitemap.txt
$ while read line; do ./robots /tmp/robots.txt "${line}" 'Googlebot/2.1 (+http://www.google.com/bot.html)' | grep Matched; done < /tmp/sitemap.txt  | wc -l
   26811

It looks like every single URL is matched by Disallow: /*.html and so changing this is likely to drastically increase traffic volume from crawlers.

What also concerns me is that I see the following cache-related headers on most of this urls:

< date: Wed, 08 Mar 2023 15:03:02 GMT
< expires: Thu, 19 Nov 1981 08:52:00 GMT
< cache-control: no-store, no-cache, must-revalidate
< pragma: no-cache

These pages are safe to cache and highly cacheable. I would update Dokuwiki's cache settings before changing the robots.txt to ensure the servers can accommodate the new load.

Looking at DokuWiki's documentation on caching, it's not entirely clear to me how to influence downstream http cache headers as these pages refer to page generation caching. It seems like this may be functionality that needs to be added by a plug-in.

tmomas · March 8, 2023, 4:40pm

I don't get it: why does *.html match https://openwrt.org/docs/guide-user/services/vpn/tailscale/start? There is not html in this url.

Anyways, I have removed the *.html now from robots.txt

Google search console is telling me that this page has not been indexed due to the "noindex" tag, which is absolutely correct, since the page had been edited on 14.02.2023, and the crawling took place on 15.02.2023.

See also https://www.dokuwiki.org/config:indexdelay -> set to 2 days for the OpenWrt wiki.

joshenders · March 9, 2023, 3:32pm

Thanks for taking the time to look into this tmomas. I really appreciate it.

Based on what you're seeing in the console, it does look like the meta tag is more likely to be the root cause here. I'm not sure why Disallow: /*.html matches. This also doesn't make sense to me but I can add more logging and try to figure out why.

With regards to the indexdelay option in Dokuwiki, this setting looks like it's designed to prevent vandalism and link spam from new accounts on wikis with open registration.

If you have a quick community (eg. at this wiki spam usually never lasts longer than a day) or have a closed user group you may want to lower the indexdelay option or even set it to 0 for disabling delayed indexing.

Since OpenWrt registration is closed, does it make sense to set this to 0?

tmomas · March 9, 2023, 4:18pm

Good point. I have set indexdelay now to 0 and requested indexing by google (will take some time).

joshenders · March 15, 2023, 1:51pm

Just checked and the page rank on Google and Yandex have recovered for the pages I was monitoring. Thanks again for your help!

system · March 25, 2023, 1:51pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.