PITS /
00563: Don't show ?action= links to web spiders/robots
Summary: Don't show ?action= links to web spiders/robots
Created: 2005-10-20 00:31
Status: Closed - added to 2.1.beta8
Category: CoreCandidate
From: Pm
Assigned:
Priority: 55433
Version: 2.0.12
OS:
Description: Currently when a web spider such as Googlebot, Yahoo! Slurp, or others visit a PmWiki site, they tend to follow all of the ?action= links on a page (including ?action=edit, ?action=diff, etc.).
PmWiki's default configuration provides a <meta> tag to tell robots not to index pages when ?action= is specified in the link, however by the time this occurs the server has already incurred the expense of generating the page and sending it to the robot.
Pm proposes a module that detects when a robot is retrieving a page, and strips all "?action=" parameters from page links within the page. This prevents robots from seeing the ?action= links in the first place, reducing server overhead and bandwidth.
Comments?
- Stripping the action when a robot arrives is a good idea, I add a rel=nofollow to the edit link. Works for google my best friend
- some action should be excluded from exclusing
- browse ( :) )
- rss
- dc
- Some aftercare is needed as well, what to do when a robot arrives using a ?action= link ( most indexes already have a lot of links in the databases, or a robot may arrive via an external page (like site stats). On my site I return a 401 (Unauthorized) whenever a robot arrives.
- List of robots I catch at the moment
- slurp
- googlebot
- mediapartners
- xenu
- grub
- ingrid
- baiduspider
- metaweb
- nutch
- aipbot
- societyrobot
- teoma
- zoekybot
- gigabot
- yahoo
- vagabondo
- msnbot
- mirago
- omni
- zyborg
- (and a bunch in robots.txt and .htaccess)
good luck BrBrBr