GoogleSitemaps

Summary: How to submit a complete list of web pages to google
Version:
Prerequisites:
Status:
Maintainer:
Categories: RSS, Integration, Robots, Google

How can we assist search engine crawlers in finding new and updated pages on our website?

Search engines, especially Google, are major source of visitors for many if not most websites. Optimal indexing of a webpage by means of the search engine spider (for example googlebot) is a key issue in achieving good search engine results.

A spider visits a web page, the page is indexed and the spider crawls on following the links on the page. Although PmWiki ensures a proper linkage between different wiki-pages, and enables easy generation of a sitemap by means of the (:pagelist:) directive, a spider indexes a website step-by-step and it can take a while before it discovers newly added or updated pages.

Google allows you to assist robots in discovering updated pages through a specific site index "file" or feed: see https://support.google.com/webmasters/answer/183668 for details.

Using RSS

One method to provide an index of desired pages to Google sitemaps is to use the rss feed provided by pmwiki based on for example Main.AllRecentChanges:

http://yoursite.com/pmwiki.php?n=Main.AllRecentChanges&action=rss
NOTE: With recent changes to PMWiki you should now use: http://yoursite.com/pmwiki.php?n=Site.AllRecentChanges&action=rss or with .htaccess rewrites it could be http://yoursite.com/?n=Site.AllRecentChanges&action=rss.

  • the rss module must be enabled:
include_once("scripts/rss.php");

Do not use the syntax like ..../pmwiki.php/Main/AllRecentChanges?action=rss. Why? from Google:

The location of a Sitemap file determines the set of URLs that can be included in that Sitemap. A Sitemap file located at http://yoursite.com/catalog/sitemap.xml can include any URLs starting with http://yoursite.com/catalog/ but cannot include URLs starting with http://yoursite.com/images/.

Thus the syntax above would not add .../pmwiki.php/Cookbook/... to the index

Set parameters for a more complete list

It might be useful to tweak the rss a little, by default the feed only displays the last 20 changes:

  if ( $action=="sitemap" ) {
    $RssMaxItems=50000;                           # maximum items to display
    $RssSourceSize=0;                        # max size to build desc from
    $RssDescSize=0;                          # max desc size
    $action="rss";
  }
  include_once("scripts/rss.php");

Above code didn't work

I used this line instead:

  if ( $action=="sitemap" ) {
	SDVA($_REQUEST, array('count' => 50000));
	$action="rss";
  }

Set .htaccess to overcome directory layout restrictions

Google is quite strict about the directory layout and the sitemap url must be in the top directory of your website. However redirects are accepted. So a little teak in the .htaccess can overcome that restriction:

Redirect /sitemap.rss http://yoursite.com/index.php/Site/AllRecentChanges?action=sitemap

Now use a syntax like:

http://yoursite.com/pmwiki.php?n=Main.AllRecentChanges&action=sitemap

Submit this link to Google sitemaps using https://www.google.com/webmasters/tools/sitemap-list or the web form. (see the Google pages for details)

Using XML-Sitemap

Google provides a special XML scheme for this purpose.

Benefit of using the XML-Sitemap scheme are the tags:

how important is this page ( relative to the other pages on the site)
how often is the page updates

The changefreq could be derived from the values of the page history. I’m not sure yet how to get a priority of a page. Probably using some patternarray

Any thoughts are welcome BrBrBr

A Basic script

sitemap.phpΔ

Note: This script expects the page file to have additional "name" and "time" entries, see PageFileFormat.

Changelog

1.7support EnablePageListProtectTested with pmwiki 2.1beta14-15
 Added Site to exclude pattern

Contributors

Comments

See discussion at GoogleSitemaps-Talk