Creating a Google Sitemap
Contents
Google Site Map Generator
This document assumes that Apache2 with SSL support is already installed and configured.
Installation
Fetch the appropriate code from:
http://code.google.com/p/googlesitemapgenerator/
cd build tar xzf googlesitemapgenerator.tgz cd g* sudo sitemap-install/install.sh // Follow prompts
Configuration
sudo emacs /usr/local/google-sitemap-generator/conf/httpd.conf
Add the following to the VirutalHost setting:
SSLEngine on
- SSLCipherSuite ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP:+eNULL
SSLCertificateFile "/usr/local/apache2/conf/server.crt" SSLCertificateKeyFile "/usr/local/apache2/conf/server.key"
Here's a shell script
- cd ~/src/httpd-2.2.11
- ./configure --enable-mods-shared=all --enable-ssl
- make
- sudo make install
- sudo openssl req -new -x509 -days 30 -keyout /usr/local/apache2/conf/server.key -out /usr/local/apache2/conf/server.crt -subj '/CN=Test-On$
wget http://googlesitemapgenerator.googlecode.com/files/sitemap_linux-x86_64-beta1-20090225.tar.gz
- wget http://googlesitemapgenerator.googlecode.com/files/sitemap_linux-i386-beta1-20090225.tar.gz
- tar xzf sitemap_linux-i386-beta1-20090225.tar.gz
tar xzf sitemap*
sudo sitemap-install/install.sh
echo "add SSL config" sudo emacs /usr/local/google-sitemap-generator/conf/httpd.conf
echo "add SSL config" sudo emacs /usr/local/apache2/conf/httpd.conf sudo /usr/local/apache2/bin/apachectl stop sudo /usr/local/apache2/bin/apachectl start
Deprecated
The following documentation refers to google's sitemapgen. This has been supplanted by Google Site Map Generator usage above.
Synopsis
To facilitate the indexing of WormBase content, a Google Sitemap file must be created. Google uses this file to index dynamic content URLs on a specified basis.
A number of scripts located at wormbase/util/google_indexing help in the creation of this file.
A new sitemap should be created on the primary production server with each new release of the database. Currently, I run the sitemap script once a day under cron.
Object classes included in the Sitemap
See dump_urls.pl for a full list of all classes exported.
Outline of procedure
1. Create a list of URLs of the most common objects in the database
This script will create a file in url_lists/VERSION-urllist.txt and update symlinks as appropriate.
todd> dump_urls.pl /path/to/database/version
2. Use the sitemap_gen.py script to generate the site map.
To capture dynamic pages, this script uses the file created above. To capture static pages, the configuration file (wormbase_config.xml) also specifies paths to select directories and their corresponding URLs.
Test the script by:
todd> python sitemap_gen/sitemap_gen.py \ --config=sitemap_gen/wormbase_config.xml --testing
The '--testing' flag prevents the script from contacting google. If everything looks good, run the site indexer again:
todd> python sitemap_gen/sitemap_gen.py --config=sitemap_gen/wormbase_config.xml
The script will automatically contact Google and let them know that we have a new sitemap.
Running under cron
# Generate new sitemap thrice weekly 0 4 * * 0,2,4 /usr/local/wormbase/util/google_indexing/create_sitemap.sh
This will dump out URLs, create the site maps, and send an appropriate HTTP request to Google.
TODO
Indexing of the FTP site and RSS feeds Specific classes (like ?Sequence and ?Protein) should probably be restricted to select objects.
--Tharris 23:54, 2 February 2006 (EST)