00563: Don't show ?action= links to web spiders/robots

Summary: Don't show ?action= links to web spiders/robots
Created: 2005-10-20 00:31
Status: Closed - added to 2.1.beta8
Category: CoreCandidate
From: Pm
Assigned:
Priority: 55433
Version: 2.0.12
OS:

Description: Currently when a web spider such as Googlebot, Yahoo! Slurp, or others visit a PmWiki site, they tend to follow all of the ?action= links on a page (including ?action=edit, ?action=diff, etc.).

PmWiki's default configuration provides a <meta> tag to tell robots not to index pages when ?action= is specified in the link, however by the time this occurs the server has already incurred the expense of generating the page and sending it to the robot.

Pm proposes a module that detects when a robot is retrieving a page, and strips all "?action=" parameters from page links within the page. This prevents robots from seeing the ?action= links in the first place, reducing server overhead and bandwidth.

Comments?

  1. Stripping the action when a robot arrives is a good idea, I add a rel=nofollow to the edit link. Works for google my best friend
  2. some action should be excluded from exclusing
    1. browse ( :) )
    2. rss
    3. dc
  3. Some aftercare is needed as well, what to do when a robot arrives using a ?action= link ( most indexes already have a lot of links in the databases, or a robot may arrive via an external page (like site stats). On my site I return a 401 (Unauthorized) whenever a robot arrives.
  4. List of robots I catch at the moment
    • slurp
    • googlebot
    • mediapartners
    • xenu
    • grub
    • ingrid
    • baiduspider
    • metaweb
    • nutch
    • aipbot
    • societyrobot
    • teoma
    • zoekybot
    • gigabot
    • yahoo
    • vagabondo
    • msnbot
    • mirago
    • omni
    • zyborg
    • (and a bunch in robots.txt and .htaccess)

good luck BrBrBr