Creating a Google Sitemap
The following documentation refers to google's sitemapgen. This has been supplanted by Google Site Map Generator usage above.
To facilitate the indexing of WormBase content, a Google Sitemap file must be created. Google uses this file to index dynamic content URLs on a specified basis.
A number of scripts located at wormbase/util/google_indexing help in the creation of this file.
A new sitemap should be created on the primary production server with each new release of the database. Currently, I run the sitemap script once a day under cron.
Object classes included in the Sitemap
See dump_urls.pl for a full list of all classes exported.
Outline of procedure
1. Create a list of URLs of the most common objects in the database
This script will create a file in url_lists/VERSION-urllist.txt and update symlinks as appropriate.
todd> dump_urls.pl /path/to/database/version
2. Use the sitemap_gen.py script to generate the site map.
To capture dynamic pages, this script uses the file created above. To capture static pages, the configuration file (wormbase_config.xml) also specifies paths to select directories and their corresponding URLs.
Test the script by:
todd> python sitemap_gen/sitemap_gen.py \ --config=sitemap_gen/wormbase_config.xml --testing
The '--testing' flag prevents the script from contacting google. If everything looks good, run the site indexer again:
todd> python sitemap_gen/sitemap_gen.py --config=sitemap_gen/wormbase_config.xml
The script will automatically contact Google and let them know that we have a new sitemap.
Running under cron
# Generate new sitemap thrice weekly 0 4 * * 0,2,4 /usr/local/wormbase/util/google_indexing/create_sitemap.sh
This will dump out URLs, create the site maps, and send an appropriate HTTP request to Google.
Indexing of the FTP site and RSS feeds Specific classes (like¬†?Sequence and¬†?Protein) should probably be restricted to select objects.
--Tharris 23:54, 2 February 2006 (EST)