Creating a Google Sitemap

From WormBaseWiki
Jump to navigationJump to search

Google Site Map Generator

This document assumes that Apache2 with SSL support is already installed and configured.

Installation

Fetch the appropriate code from:

http://code.google.com/p/googlesitemapgenerator/

cd build
tar xzf googlesitemapgenerator.tgz
cd g*
sudo sitemap-install/install.sh  // Follow prompts

Configuration

sudo emacs /usr/local/google-sitemap-generator/conf/httpd.conf

Add the following to the VirutalHost setting:

 SSLEngine on
  1. SSLCipherSuite ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP:+eNULL
 SSLCertificateFile "/usr/local/apache2/conf/server.crt"
 SSLCertificateKeyFile "/usr/local/apache2/conf/server.key"


Here's a shell script

  1. cd ~/src/httpd-2.2.11
  2. ./configure --enable-mods-shared=all --enable-ssl
  3. make
  4. sudo make install
  5. sudo openssl req -new -x509 -days 30 -keyout /usr/local/apache2/conf/server.key -out /usr/local/apache2/conf/server.crt -subj '/CN=Test-On$

wget http://googlesitemapgenerator.googlecode.com/files/sitemap_linux-x86_64-beta1-20090225.tar.gz

  1. wget http://googlesitemapgenerator.googlecode.com/files/sitemap_linux-i386-beta1-20090225.tar.gz
  2. tar xzf sitemap_linux-i386-beta1-20090225.tar.gz

tar xzf sitemap*

sudo sitemap-install/install.sh

echo "add SSL config" sudo emacs /usr/local/google-sitemap-generator/conf/httpd.conf

echo "add SSL config" sudo emacs /usr/local/apache2/conf/httpd.conf sudo /usr/local/apache2/bin/apachectl stop sudo /usr/local/apache2/bin/apachectl start

Deprecated

The following documentation refers to google's sitemapgen. This has been supplanted by Google Site Map Generator usage above.

Synopsis

To facilitate the indexing of WormBase content, a Google Sitemap file must be created. Google uses this file to index dynamic content URLs on a specified basis.

A number of scripts located at wormbase/util/google_indexing help in the creation of this file.

A new sitemap should be created on the primary production server with each new release of the database. Currently, I run the sitemap script once a day under cron.

Object classes included in the Sitemap

See dump_urls.pl for a full list of all classes exported.

Outline of procedure

1. Create a list of URLs of the most common objects in the database

This script will create a file in url_lists/VERSION-urllist.txt and update symlinks as appropriate.

   todd> dump_urls.pl /path/to/database/version

2. Use the sitemap_gen.py script to generate the site map.

To capture dynamic pages, this script uses the file created above. To capture static pages, the configuration file (wormbase_config.xml) also specifies paths to select directories and their corresponding URLs.

Test the script by:

  todd> python sitemap_gen/sitemap_gen.py \
            --config=sitemap_gen/wormbase_config.xml --testing

The '--testing' flag prevents the script from contacting google. If everything looks good, run the site indexer again:

   todd> python sitemap_gen/sitemap_gen.py  --config=sitemap_gen/wormbase_config.xml

The script will automatically contact Google and let them know that we have a new sitemap.

Running under cron

# Generate new sitemap thrice weekly
 0 4 * * 0,2,4 /usr/local/wormbase/util/google_indexing/create_sitemap.sh
 

This will dump out URLs, create the site maps, and send an appropriate HTTP request to Google.

TODO

Indexing of the FTP site and RSS feeds Specific classes (like ?Sequence and ?Protein) should probably be restricted to select objects.


--Tharris 23:54, 2 February 2006 (EST)