Difference between revisions of "FAQs"

From WormBaseWiki
Jump to navigationJump to search
Line 3: Line 3:
=== How should I cite WormBase?  ===
=== How should I cite WormBase?  ===
see [[Citing_WormBase|Citing_WormBase]]<br>
see [[Citing and Acknowledging WormBase|Citing and Acknowledging WormBase]]<br>
=== What database technology are you using for WormBase?  ===
=== What database technology are you using for WormBase?  ===

Revision as of 09:22, 30 June 2011


General Questions

How should I cite WormBase?

see Citing and Acknowledging WormBase

What database technology are you using for WormBase?

1) At the back end, WormBase data are deposited in an object-oriented database, ACeDB, which is the "master" database containing all data. ACeDB can be accessed both remotely and locally, through both commandline and web server.

2) Some data (especially sequence data including genomics sequence, ESTs, OSTs, SNPs, genes, RNAs etc) are extracted from ACeDB and are deposited in a "slave" MySQL database, to support some key features like gbrowse (see below);

3) At the front end sits the apache server with mod_perl. Wormbase software package containing configuration files and a series of CGI scripts runs on the apache server. The CGI scripts provide users with a number of ways to browse and search WormBase.

4) Some key features of the WormBase package: i. gbrowse (http://www.wormbase.org/db/seq/gbrowse?source=wormbase): developed by Lincoln Stein for the GMOD consortium and is widely used for other model organisms. It allows users to browse through the whole genome for feature tracks corresponding to specific genome regions. gbrowse is highly configuarable and support multiple foreign languages. ii. synteny browser(http://www.wormbase.org/db/seq/ebsyn?name=CBG22984): recently developed by Lincoln Stein for the GMOD consortium as well. It allows comparative view of two genomes side by side, focusing on the syntenic regions.

How are the WormBase entries created and maintained?

There is no simple answer to that. WormBase has a team of about 30 people who generate and curate data in many different ways. The genome sequence of C. elegans was determined at two of the four WormBase groups, and so a lot of data pertaining to gene predictions and other features annotated on to the genome are created and maintained by those groups.

The group at Caltech do a lot of literature curation and extract all sorts of information from the published literature (from hand-curated descriptions of gene function to details of individual RNAi experiments).

Also a lot of data comes from 3rd party collaborators who submit bulk datasets direct to WormBase (e.g. Orfeome data, 'knockout' deletion alleles). In contrast we also get directly submitted data from users at a very small level, e.g. individual allele submissions.

Finally, we also generate data de novo as part of the database build procedure, e.g. calculating molecular weights of proteins.

Can you give me medical advice on how to deal with infectious or parasitic worms?

Unfortunately, no; WormBase is specifically dedicated to the biology of Caenorhabditis elegans, is staffed by Ph.D.s rather than physicians, and would not lawfully be able to provide medical advice over the Internet even if it were a M.D.-staffed database oriented towards pathogenic worms. Please consult your local physician for all medical advice.

Gene Summary Page questions

What does the "% length" mean in the Best Blast Hits table?

BLAST queries can have matches with multiple regions on the same hit. WormBase attempts to reconcile this information and present a value which represents the extent of coverage of all matches on the target sequence.

-- Tharris 15:13, 9 February 2006 (EST)

What do the different gene model Status lines (ie confirmed) shown on the Gene Page and in the GFF files mean?

confirmed_inconsistent - means that a curator has decided that the intron doesn't fit with what we consider to be a valid transcript or there appears to be something wrong with the Transcript that confirms the intron.

confirmed_est - an intron confirmed by EST transcript sequence data

confirmed_cdna - an intron confirmed by cDNA/mRNA transcript sequence data

confirmed_false - these are where curators have confirmed that the confirmed intron is false or artefactual.

other confirmed_* types that a curator can add are:

confirmed_UTR - used when a confirmed intron looks like it is in the UTR of a gene.

confirmed_Homology - although I don't think this has ever been used.

Fetching Sequences

I'd like to fetch the DNA sequence of (some feature or coordinates). How do I go about this?

WormBase offers many ways to fetch sequences of features.

  • Using the Genome Browser
  1. Enter the desired coordinates into the Genome Browser using the format "CHROMOSOME:START..STOP" (eg. X:10..1000). If you don't know the coordinates of the feature of interest, just search for the feature itself.
  2. Select "Display Decorated FASTA File" from the "Reports & Analysis" popupmenu. Click Go to retrieve the sequence of the region. You can also specify optional formatiting of features contained within the sequence by clicking "configure...".
  3. HINT: You can adjust the coordinates of the segment to be retrieved manually or by zooming in or out.

-- Tharris11:15, 16 February 2006 (EST)

  • Using the Genome Browser for a batch of sequences
  1. On Genome Browser page, under "Reports & Analysis:" select from the pull down list "Download Sequence File"; click on "Configure"; paste in your coordinates (eg. X:10..1000, one per line) in the "Sequence IDs" box; select the choices of output you desire; hit go and enjoy

--Raymond 13:01, 14 July 2009 (EDT)

  • Using WormMart

Click Here for some example WormMart Queries.

-- Tharris11:15, 16 February 2006 (EST)

How can I download all the [3' UTR] sequences from the C. elegans genome?

The best way to download all sequences (for example the 3' UTR sequences) is through WormMart.

Here are the steps: Go to:


- Select the most recent version of WormBase and then select the 'Gene' dataset.  Hit 'next'.
- In the identification section on the next page, check the box to Limit to Gene IDs of Type:
  and select the appropriate gene identifier for how the genes are represented in your list,
  i.e. CGC names such as pal-1, sequence names such as R12B2.4, or the WBGene IDs, such
  as WBGene00000200.  Hit 'next'.
- At the top of the next page, select 'Gene, CDS and Protein Sequences' from the attribute page menu.
- Let the page reload and then select '3' UTR' from the sequence type menu.  Hit 'export'.

Where can I get repeatmasked genomic sequences for C. elegans, C. briggsae, or C. remanei?

How are the repeats determined?

For C. elegans, using the current (17 July 2007) most recent archival release of the database, you can get repeatmasked chromosomal sequences here:


You will need to uncompress the relevant files with "gunzip CHROMOSOME_*_masked.dna.gz" or a similar command.

The C. briggsae repeatmasked genomic sequence is here, for assembly cb25.agp8:


At some point there should be a repeatmasked version of the cb3 assembly, but as of 17 July 2007 there isn't yet.

C. remanei is here:


Note that in all of these cases, the sequences are hardmasked: i.e., the repeat sequences have been replaced by stretches of "N" residues, instead of being marked in some less information-destroying way. By contrast, softmasked sequences would keep the repeat sequences but distinguish them by changing their case: non-repeat sequences would be UPPERCASE, while repeat sequences embedded between the non-repeat sequences would be lowercase.

How do I get the sequence of a Brugia malayi protein?

For instance, sna-1 is annotated as being orthologous to B. malayi 13258.m00169, and paralogous to B. malayi 14704.m00455; but where do I go to get these sequences?

For the time being, the best method for getting B. malayi sequences quickly (and without having to download the entire predicted B. malayi proteome by 'licensed FTP') is to do a BlastP search, against the "BMA1_pep" protein sequence set, on TIGR's B. malayi Blast server at:


A successful BlastP search will give a report that has hypertext links to individual protein sequences such as 13258.m00169 and 14704.m00455.

This workaround should stop being needed when TIGR and WormBase have worked out some agreement for WormBase to present individual protein sequences through WormBase's own FTP site and interface; but, as of 20 July 2007, this is the best we can (legally) do.

Another option would be to do a bulk download of the relevant sequence data from TIGR itself. See TIGR's data release policy at http://www.tigr.org/tdb/e2k1/bma1/ for more details.

To give you a glimpse of the things to come, Brugia malayi data imported from Genbank which will soon appear on the main site can be found at:

  • cds sequences (fasta) based on WS187
  • GFF3 genome annotation based on WS187
  • Protein Set based on translations of the GenBank submissions (first imported for WS185)
  • GBrowse of the updated data is soon to come(tm)
  • orthology predictions of Brugia malayi are included in the TreeView pages of C.elegans / C.briggsae and C.remanei genes starting with WS187

Gene structures and gene predictions

I think there should be a gene at the end of clone 'X' but WormBase doesn't show any genes in this region. Why not?

A full and complete description of all C. elegans genes is not known (and may not accurately be known for many years). WormBase attempts to represent all genes that have good experimental evidence plus a number of genes which have less experimental evidence but which were generated using gene finding software. If there are any publically available transcript data (EST, mRNA etc.) then WormBase should nearly always have attempted to make a gene prediction in that region. However, many poorly expressed genes may not have any transcript evidence and so may not be represented in WormBase at this time. Please help us by letting us know if you have any evidence for a gene that is currently not displayed in WormBase. Aside from transcript evidence (for which we would always encourage people to submit to GenBank/EMBL/DDBJ) a strong case can be made for creating a new gene if there is good conservation with other species (particularly C. briggsae or C. remanei) and if there is other supporting data (such as a positive RNAi phenotype).

Please also note that your gene may be there but may not be represented in the standard set of tracks in the Genome Browser. Check alternative gene predictions by turning on tracks for the 'GeneFinder' and 'Twinscan' gene predictions. Also consider turning on the 'Obsolete gene models' track as the gene may have existed in WormBase in the past but may have been removed.

-- Kbradnam 17:47, 12 February 2006 (EST)

I have found experimentally that the transcript of gene X is different from the gene model reported in WormBase. We would therefore like to update the gene model on WormBase.

Please send the new transcript sequence with a brief description of the required gene model change to wormbase-help@wormbase.org and a curator will make the appropriate change.

Please also submit your sequence to the EMBL/GenBank/DDBJ database. This helps in the confirmation and evidence for the wormbase gene prediction as we routinely retrieve sequence data from the public databases. This also makes the data public, allowing appropriate reference and acknowledgement to yourself.

--[/wiki/index.php?title=User:Gw3&action=edit Gw3] 09:22, 17 August 2007 (GMT)

What criteria does WormBase use to classify a gene as a Pseudogene?

See entry in Glossary_of_terms#P

I'd like to create a diagram of a genomic region similar to that shown on the Genome Browser. How can I do this?

One approach is to write your own scripts in Perl using the Bio::Graphics modules that are part of BioPerl. A second approach is to use the web interface to this software.

In the gff files I see confirmed_inconsistent and other confirmed statuses for introns, what do they all mean?

confirmed_est - an intron confirmed by EST transcript sequence data

confirmed_cdna - an intron confirmed by cDNA/mRNA transcript sequence data

confirmed_inconsistent - means that a curator has decided that the intron doesn't fit with what we consider to be a valid transcript or there appears to be something wrong with the Transcript that confirms the intron.

confirmed_false - these are where curators have confirmed that the confirmed intron is false or artefactual.

confirmed_UTR - used when a confirmed intron looks like it is in the UTR of a gene.

confirmed_Homology - where protein homology looks to confirm the intron, this has seen limited use.

what are seg, signalp and tmhmm motifs on the Protein report page?

  • seg - low complexity regions e.g. homopolymer runs - explanation

Old database releases

How do I remap the chromosomal coordinates between releases?

There is a page describing a perl script and the data to change the coordinates of GFF files here

Gene Model Naming

What do all the different gene names mean?

  • All genes have a corresponding sequence name, which are derived from the cosmid, fosmid or YAC clone on which they reside.
For instance the gene bli-4 has a sequence name of K04F10.4, indicating it was identified when the cosmid K04F10 was sequenced and annotated, and there are at least 3 other genes associated with that cosmid.
  • Any gene can code for multiple proteins (CDS) as a result of alternative splicing. In the case of bli-4 there are 6 known isoforms, called K04F10.4a, K04F10.4b, ..... K04F10.4f.
  • The corresponding transcripts for the isoforms are called K04F10.4a.1, K04F10.4b.1, ..... K04F10.4f.1
  • However if there is alternative splicing in the UTRs, which doesn't change the protein sequence, the alternatively-spliced transcripts are named K04F10.4a.1 and K04F10.4a.2.
  • ... and if there are no isoforms of the coding gene, for example AC3.5, but there is alternative splicing in the UTRs, there will be multiple transcripts named AC3.5.1 and AC3.5.2, etc.
  •  !!But if there are no alternate UTR transcripts the single coding_transcript is named the same as the CDS and does not have the .1 appended. As in the case of K04F10.4f

C. remanei

state of C.remanei integration

Interpolated Map Positions

We calculate for Genes without known genetic map positions a theoretical interpolated position based on a linear interpolation of surrounding genetic markers. [ A gene is considered a genetic marker if:

  • it has a physical map position
  • it has a CGC name
  • it has a genetic map position (experimental or promoted)

Promoted map positions are made for genes, that fulfil all other genetic marker requirements and were interpolated during prevbious builds. The logical order of the genetic markers is checked and curated during the build process to accomodate new experimental data as well as changes in the genomic sequence.

Obtaining WormBook articles

WormBook articles are available directly from the WormBook website, the development site, the Sanger WormBook mirror, as well as via Textpresso searches in WormBase.

The entire WormBook may also be downloaded as a zip archive (~400 Mb).

How do I get a list of genes with alternatively spliced isoforms?

Go to the WormBase Query Language Search page and enter the query 'find CDS; Isoform; follow Gene '

If you want to do further investigation of this group of genes set the output to 'Text' and copy the list to use in WormMart

How do I get a list of coding gene transcripts together with the supporting cDNA evidence for them?

Go to the WormBase Query Language Search page and enter the query:

'select l, cdna from l in class Transcript where l->Method = "Coding_transcript", cdna in l->Matching_cDNA '

What are WormBase's precomputed BLAST parameters?

These parameters refer to the precomputed BLAST results shown on the gene and protein report pages.

For BLASTP and BLASTX we use WU-BLAST2.0 with the following parameters

Z=10000000 (sets size of database in letters)
V=1000 (sets the number of one line summaries)
B=1000000 (sets number of database hits to report)
E=0.1 (E from the Karlin-Altschul equation - will not report hits with E-value greater than this)
cpus=1 (sets number of processors to use)
hitdist=40 (Max distance between word hits for 2-word seeding algorithm)

No low-complexity filtering is done for BLASTP. DNA sequences are masked with RepeatMasker and TRF before BLASTing.

About Cell and Anatomy

How do I find out cell lineage pedigrees?

There are two kinds of pedigree display. The Cell pedigree tree (located on the Cell Page) or the Lineage pedigree tree (located on the Pedigree Browser). The Cell Page is simple and easy to use, with a full description of the cell lay out, while the advantage of the Pedigree Browser is that it displays complete lineage pathways (from P0) with user-interested cell(s) highlighted.

Starting from the Search on the WormBase home page. Select from the pull-down menu "Cell" and enter the cell name. A "cell summary" display will appear with a Cell pedigree display box showing three generations of cells. Your cell will appear red on the pedigree. Users can move the pedigree tree up or down in the lineage by clicking on the parent cell or daughter cells. Another way to access pedigree is from "Cell and Pedigree Search" (under More Searches menu), which searches for specific cells, cell groups, or lineages.

What's the nomenclature for C. elegans cells?

There is a very good article explaining everything about embryonic cell lineage and nomenclature:

Sulston JE et al (1983) Dev Biol. "The embryonic cell lineage of the nematode C. elegans."

That article is the "dictionary" everyone refers.

P0 is the founder cell for C. elegans. It is the zygote after fertilization. The first few rounds of divisions produce six "founder cells": E, MS, AB, C, D and P4. Each of these founder cells generate different tissues. From then on, cells are named after these founder cells. For example, the daughters of AB are called ABa ('a' means anterior) and ABp ('p' means posterior). ABa will generate daughters ABal ('l' means left) and ABar ('r' means right)... If cell divides dorsal-ventrally, 'd' or 'v' will be added to the name of daughters.

Now you know when you see ABalppp , it comes from:


Not only will you see the lineage pathway from the cell name, you will also see in which direction cells have divided and what the sister cells are for each step of the division.

How can I know each C. elegans cell's function and exactly at which stage of the embryonic lineage it appears?

Most of the information you need for a cell should be contained on Cell Report, which can be located by "Cell and Pedigree" search. In WormBase, if you read the Tree Display of a Cell Report, there is a tag called "Embryo_division_time"; it is the time when the cell divides or dies. Unfortunately, for cells generated after hatch, there is no such information in WormBase.

What is the connection between the cell P0, and the cells P1, P2, P3, ..., P7, P8, etc?

There are two sets of P cells. One arises from early embryonic divisions, and are called P0, P1', P2', P3' ... in WormBase; these are the lineage names. The other set is called P1, P2, P3, ... These are postembryonic blast cells, which are not related to the embryonic founder cells.

P1, P2, P3.. are adult names for post embryonic blast cells preset from hatching until the middle of the first lalval stage (L1), . A lot of cells have two names: lineage name and adult name. Adult name is the name people give to some cells that become terminal and differentiate (such as neurons) or not differentiate but will divide into an important lineage (such as P1, P2 ... lineages). Adult names are given by cell position and function, so it is a different naming system. Cells with the same adult name could come from different lineages depending on how bilateral symmetry is broken, for example: P7 can be developed either from AB.plappapp or AB.prappapp.

Lineage name is accurate, unique, but hard to remember for most people, so adult names are usually for researchers to use and do the query. That is why in WormBase cell nomenclature, whenever there exists adult name, we use it to call a cell, and bury its lineage name inside data field.

How to get all the cell types (neurons, actually) in which a gene is expressed?

When we curate a gene, we enter all the cells and cell groups that express the gene. This information can be easily viewed by clicking the "details" button at the gene page. For example, if you search for eat-16, which is expressed in neurons:

1. At the WormBase home page, select "Any gene" and search for "eat-16", and select "Exact Match", this will take you to the Gene Summary page for eat-16. 2. In the Function section, you will see "Anatomimic Expression Pattern". Here you will see some information about the eat-16 expression pattern, at the very end of the entry, you will see a link "Details". 3. If you click here, you will be brought to the Expression Pattern page for eat-16. On this page you will see the detailed cell and cell group information associated with eat-16. (To keep annotation easy, when a gene is expressed in lots of cells, we enter cell group name instead of all the cell names one by one. Each cell group will include the list of cells associated.)

Is there a file showing the lineage map of the worm?

Leon Avery has something like that on the Web: http://elegans.swmed.edu/parts/

About Orthologs and Homologs

How do I find the ortholog / paralog / etc. of gene X?

We do not explicitly make ortholog assignments in WormBase. This is a non-trivial task and something that we leave to external experts whose results we try to make available. There are several sources that may be useful in WormBase. NCBI COGS, InParanoid and TreeFam are all programs that attempt to predict orthologous relationships. InParanoid and TreeFam are visible from the gene pages (see cdk-1 page for eg). The COGs are found on the respective protein page.




There are also the precomputed BLAST results that are summarised on the gene pages. Each release we also produce a file of best blastp hits for each worm protein which can be found on the FTP site called best_blastp_hits.WSXXX.gz

In addition we include since WS164 predicted orthologue assignemts based on Ensembl COMPARA which predicts orthology of the longest isoform based on homology as well as conserved gene order.

You can run this prepared query in WormMart for compara orthologs

How do I get a list of all C. elegans orthologs of H. sapiens disease genes?

One possible solution is to use EnsMart from Ensembl to query the EnsEMBL databases which include C.elegans. Go to BioMart and pick Caenorhabdits elegans (homology) and as set C.elegans and H.sapiens. Select orthologues_gene and filter as needed by the different types of orthology (one2one and/or one2many). Then pick as second dataset H.sapiens and select "associated with disease". That should return a list of all orthologues of human disease genes. The only possible problem is that due to the different release cycles between EnsEMBL and WormBase the EnsEMBL data might be slightly out of date (the CELXXX on the EnsEMBL pages refer to the corresponding WSXXX WormBase release).

How can I retrieve nematode specific genes with no homology to yeast, fly, mouse, and human?

/// out dated (Raymond 20080214) /// From the "advanced query" WormBase page, construct the following query:

find predicted_gene NOT Pep_homol

What is the meaning of several abbreviations for proteins that are used by WormBase, like "Protein SW"?

Or "Protein TR" and "Protein WP"? In addition, using the TR Database, sometimes the species origin (e.g., C. elegans) is missing - how can I find out? Furthermore, how can I get from a TR Database entry to the corresponding predicted gene in the C. elegans genome?

SW stands for Swiss-Prot, TR stands for TrEMBL and WP stands for WormPep. In case you're not familiar with any of theses protein databases you can go to: http://www.expasy.org/sprot/ and http://www.sanger.ac.uk/Projects/C_elegans/wormpep/ for an explanation and access to them.

Inside Protein SW or Protein TR, you may find the accession number of Swiss-Prot or TrEMBL. You can get all details of the protein (including species origin..) by going to http://www.expasy.org/sprot/ and entering the accession numbers,

What do those colorful bars for C. briggsae alignments mean?

Dark blue bars are regions of strong similarity. Light blue bars are regions of weak similarity. Dashed areas don't match.

When there are multiple bars in the same region, it means that there are several C. briggsae clones that all match the region.

How can I retrieve the best blast_p scored homologies of worm genes?

(I.e., the homologies produced automatically with each WormBase build -- roughly every 2 weeks?)

a. Go to the Wormbase ftp site by following the "Bulk Downloads" link in the "Web Site Directory" section of the Wormbase homepage or by entering the following URL in your browser: ftp://ftp.sanger.ac.uk/pub/wormbase? Select the most current Wormbase release (i.e. WS130).

b. Download the two best blastp files in this folder: best_blastp_hits.WS130.gz (elegans homolgies) best_blastp_hits_brigpep.WS130.gz (briggsae homologies)

c. Unpack the compressed files using a suitable software e.g. gunzip (linux)

d. The files have 15 columns delmited by a comma. The contents of the columns are as follows:

   Column 1 : Wormbase peptide accession number for elegans peptide

   Column 2 : Wormbase peptide accession number for highest homology elegans peptide

   Column 3 : e value for best elegans peptide/worm peptide hit

   Column 4 : Ensemble accession number for highest homolgy ensemble sequence

   Column 5 : e value for best elegans peptide/ensemble sequence hit

   Column 6 : Wormbase peptide accession number for highest homolgy briggsae peptide

   Column 7 : e value for best elegans peptide/briggsae peptide hit

   Column 8 : Flybase accession number for highest homology fly protein

   Column 9 : e value for best elegans peptide/fly protein hit

   Column 10: Saccharomyces Genome Database accession number from highest homology yeast protein

   Column 11: e value for best elegans peptide/yeast protein hit

   Column 12: Swissprot/Uniprot name from highest homology sequence

   Column 13: e value for best elegans peptide/swissprot sequence hit

   Column 14: TrEMBL accession number from highest homology sequence

   Column 15: e value for best elegans peptide/TrEMBL sequence hit

e. You might also want a file that maps Wormbase peptide accession numbers to the corresponding Gene in Wormbase (warning, a single gene may correspond to multiple peptides). For this you will have to perform an AQL query on Wormbase: - on the banner at the top of the Wormbase homepage select "Searches" - select the top search from the resulting list, "Acedb Searches(AQL)" - copy paste the following text into the search text box: select a, a->Cgc_name, c from a in class Gene, c in a->Molecular_name where c like "CE*" order by :1 asc - choose the "Text output" radio button and click Query ACeDB(the search may take a few minutes) - the resulting file contains a tab delimited mapping of Wormbase gene accession numbers to the CGC approved name for that gene (if it has one) to the peptide accession number for that gene. save the results file to your hard drive

How can I download the C. elegans-human gene homology map?

You can download a file that lists best blastp match to human, fly, yeast, C. briggsae, SwissProt, and TrEMBL proteins for every C. elegans protein form the wormbase ftp site:


The file name is best_blastp_hits.WSXXX.gz where XXX is the release number.

How can I download C. elegans-C.briggsae orthologs and their protein-coding DNA sequences?

One possible way to retrieve those would be to download a C. elegans-C.briggsae ortholog file:

ftp://ftp.wormbase.org/pub/wormbase/briggsae/supporting_data_stein_2003/orthologs_and_orphans/orthologs.txt and C. briggsae gene sequences in fasta format (briggenes.fa.gz):


and write a script that would parse C. briggsae ortholog sequences based on C. elegans gene names.

Another way would be to use WormMart to get a list of genes with orthologs (filter by Homolog/Ortholog -> Homolog[Compara Orholog]). in the Attribute part you can select if you want to have the sequences or just a table of orthologs.

About the User Interface

What's the difference between the sequence displayed in "Sequence Report" and that in "Genome Browser"?

(For example: Sequence F35E8) The coordinates given in the Sequence Report under 'Genomic Location' are for the sequence F35E8, which is not the full sequence of the clone F35E8. The clone is represented under the diagram of the sequence features and has an arrow point off the left end indicating the clone extends to the left.

When you click to the Genome Browser your seeing the sequence of F35E8 with the clone again represented under the diagram of sequence features with an arrow pointing left. You have to zoom out to see the full extent of the clone F35E8.

About Nomenclature

How can I register a new lab?

New lab and allele designation should be registered directly with Jonathan Hodgkin (jah@bioch.ox.ac.uk), WormBase genetic nomenclature coordinator.

About Reagents (such as cosmid clones)

What are the different types of clones in WormBase?

1. There are seven different types of "Clone" objects in WormBase:

Type Nomenclature
Cosmid A*, B*, C*, D*, F*,J*,K*,M*,R*,T*,W*,Z*
Fosmid H* WRM*
cDNA yk*, EC*, EB*, OST*, CK*, EF*, CEE*, CEM*, CES*, CB*, CN*, cm*
Plasmid: PCR clones V*,EGAP*
Other telo clones, 1 BAC, plasmid

Most cosmids, fosmids, YACs can be requested from Sanger, cDNA (yk*) from Dr. Yuji Kohara. The EGAP* plasmids can be obtained from MRC Geneservice. The V* plasmids are no longer available.

Whom could I contact about getting a cDNA clone?

All of the cDNA clones with a yk prefix can be ordered by the following method. All other cDNA clones will have to be requested from the submitting  party (found by looking at the EMBL/GenBank entry.)

Please go to NextDB(http://nematode.lab.nig.ac.jp), Yuji Kohara cDNA database repository. You can obtain cDNA clones from Yuji Kohara at the National Institute of Genetics, Mishima, Japan: ykohara@LAB.nig.ac.jp

Useful Clone info

1) How do I find out about the vectors used in the genome sequencing project?

If you want the actual sequence of the vectors used they are on the Sanger FTP site. and this should help you identify the vector for the clone you are interested in

How do I order C. elegans Cosmids Fosmids and Yacs?

Cosmids and Fosmids are available to the community via these routes:

1) How do I obtain C.elegans cosmids/Yacs?

Information can be found here: Cosmids/YACs

2) How do I obtain C.elegans fosmid from the Moerman fosmid library?

Information can be found here: Fosmids

3) How do I obtain C. elegans fosmid from the Incyte Genomics Inc. fosmid library?

Information can be found Here

How can I find allele information?

We do have information on many thousands of alleles in WormBase. We have also tried to extract the molecular details of the mutations (where known) and add those to WormBase. Some examples:

Go to a gene page: http://www.wormbase.org/db/gene/gene?name=unc-71;class=Locus Then click on the link to the 'ay47' allele (near the bottom of the page), this takes you to: http://www.wormbase.org/db/gene/allele?name=ay47;class=Allele You can see that there is a 'c' to 't' substitution in this gene. If you go to the genome browser display for this gene: http://www.wormbase.org/db/seq/gbrowse/wormbase?name=unc-71 Then turn on the 'SNPs, Knockouts, and other Alleles' track and you will see the positions of the alleles in this gene.

To find other alleles, you can go to the query page: http://www.wormbase.org/db/searches/wb_query and type the following queries (everything between the single quotes): 'Find Allele; Substitution' 'Find Allele; Deletion' 'Find Allele; Insertion'

I'm interested in the CB4858 pas* snp data can I get a bulk download??

A complete dataset of pas snp data is available from Here

Explanation of dataset:

Substitution - the snp sits between Flank1 and Flank2 (gggtAtcg) and this makes up the N2 genomic hit. This is a 1bp feature as the snp is contained in the N2 genomic.

ID              Type            N2/CB   Chrom   Coordinate1     Coordinate2     Flank1             Flank2
pas10021        Substitution    A/G     V       19689782        19689782        -cut-aattttgggt    tcgaccttgaaa-cut-

Deletion - the snp sits between Flank1 and Flank2 (ttttCacacttt) and this makes up the N2 genomic hit. This is a 1bp feature as the snp is contained in the N2 genomic.

ID              Type            N2      Chrom   Coordinate1     Coordinate2     Flank1             Flank2
pas44643        Deletion        C       X       193667          193667          -cut-aaccattttt    acactttttggctta-cut-

Insertion - The insertion is in CB4858 so the 2 flanking sequences abut each other (accttaaaaaaaa) and so you get a 2bp feature as the N2 base to the left and right are marked up (Notice the pair of coordinates) In this case, CB4858 has an A between the relative N2 positions 116070 and 116071.

ID              Type            CB4858  Chrom   Coordinate1     Coordinate2     Flank1             Flank2
pas44644        Insertion       A       I       116070          116071          -cut-aactcaaaacctt aaaaaaaa-cut-

What's the easiest way for users to find the 'TRUE' ends, and thus the insert size, of a clone?

The set of clone ends is dumped as part of the gff files:


This is the source for the extents displayed in WormBase.

The caveat with this is that the 'true' end is not marked up for all clones. The early cosmids do not have such annotations because nobody thought about marking them up. Later cosmids do have clone left and right ends as this became part of the standard procedure. Finally, many of the YACs do not have clone ends because the segment submitted to GenBank/EMBL is much smaller than the full clone, and hence the true ends lie within sequences already finished at that stage of the sequencing (i.e. we never went back to update clone ends in sequence already finished).

How can I FTP-download the genomic DNA database and the EST database for C. elegans?

Our underlying database for WormBase is built on the acedb software (available freely from www.acedb.org). If you have acedb installed locally, you can download the entirety of our database from: ftp://ftp.sanger.ac.uk/pub2/wormbase/live_release

However, a simpler approach may be to just download a GFF file and DNA file for each chromosome from: ftp://ftp.sanger.ac.uk/pub2/wormbase/live_release/CHROMOSOMES/

Where are the flat files for the gene annotation of each chromosome of C. elegans?

You should take a look at the Feature Tables (GFF), which you can pick up from the same 'WormBase Downloads' page where you found the "Summary Tables" (http://www.wormbase.org/downloads.html).

You should also look at the 'Batch Downloads' page at WormBase (http://www.wormbase.org/db/searches/info_dump), where you can build your own tables of gene annotations.

One other WormBase page you should look at is the "Genome Dumper" (http://www.wormbase.org/db/searches/advanced/dumper).

Where can I find details of the methods used to create wormpep?

We make WormPep during each release of WormBase and the starting point is always a translation of our latest set of gene predictions. Gene predictions are initially based on the GeneFinder prediction program with human modification as is deemed necessary. The level of human involvement really depends on what other supporting data is available. Aside from routine inspection of gene predictions based on EST/mRNA data we also evaluate our predictions based on information from published papers and direct contact from the worm community. All gene predictions have been looked at by a human to some level.

We have started to distinguish subsets of WormPep. Thus all WormPep proteins can be thought of as either 'CONFIRMED', 'PARTIALLY CONFIRMED', or 'PREDICTED'. The first set contains all genes where there is transcript evidence for every base of every exon of the gene (note that this can still - in theory - mean that there are unpredicted exons in a 'CONFIRMED' gene). The second set contains genes for which there is some transcript evidence but the whole gene is not yet supported...either due to lack of transcript evidence or errors in our current gene prediction. The last set is everything else, i.e. genes with no transcript support. In the future we may expand this classification system to take account of other evidence (e.g. homology info from C. briggsae).

Each new build usually sees a slight increase in the first two categories and a drop in the third category. The relevant status of each Wormpep entry is added into the FASTA header of every entry in each WormPep release.

I cannot find information for a gene mentioned in early microarray or RNAi studies: why?

In early versions of microarray and RNAi libraries, clone and gene names were often used synonymously. Because gene models and names change over versions of WormBase and history has not been carfully preserved, this caused much confusion. Fortunately, for those clones that we have the sequence information, we provide up-to-date mapping from each clone to the current gene models. However, sometimes we don't have sufficient sequence information for a given clone and thus unable to provide any information about its identify and one must inquire the primary generators (corresponding authors of publication) of that clone for more information. Below is an example of a 'lost' clone:

"what is the present location of Y41D4A_2491.a?"

Simple answer, there's none. A simple search for "Anything" "Y41D4A_2491.a" produces hits that indicate that Y41D4A_2491.a was used as a clone/gene name in Stanford microarray library. For sequence information, WormBase177 only has sequences for the oligos but not a PCR_product. The pair of oligo sequences (Oligo: sjj_Y41D4A_2491.a_b ; Oligo: sjj_Y41D4A_2491.a_f) fails to produce an ePCR product and each individually fails to map to the genome when searched with Genome browser oligo mapping tool.

We keep uptodate mapping files here ftp://caltech.wormbase.org/pub/annots/rnai/.

I am trying to figure what convention was used when the gene names were changed from a letter code to a number code (for example Y17G9B.a-i to Y17G9B.1-9).

This naming convention change occurred following the initial annotation phase back in the 90's. Genomic clones were originally submitted with cosmid.letter annotations prior to 1998 but this was changed to increase the depth of the nomenclature as some clones started to have more than 26 genes.

There are 3 approaches to identify the current gene that corresponds to an original letter code gene locus.

1) Search through the wormpep.history file within the Wormpep190.tar.gz archive found here.

If you look for your gene eg. Y17G9B.g you get:
Y17G9B.g CE21394 17 18

Then if you look for all occurrences of the CE21388 number.
Y17G9B.g CE21394 17 18 
Y17G9B.4 CE21394 18 72

From this you can assume that .g was renamed .4 as the gene encodes the same protein. 
This doesn't always work as the gene may have undergone some annotation changes which breaks this link. 

2) If the above doesn't work and you have a small list you can blast the old peptide against the genome and see which gene it overlaps.

look in the wormpep.fasta190 file for the peptide sequence. 

grep Y17G9B.a wormpep.history190 as before
Y17G9B.a CE21388 17 18 

TBLASTN the peptide sequence against the elegans genome and see where it hits the current assembly. 

3) If you are interested in genes used in microarray and RNAi libraries see the previous FAQ.

Where can I obtain C elegans strains

Please refer to the following link to obtain C. elegans strains:


About Database Queries

How do I get AQL to search data in hashes?

Like this example:

select p->Standard_name, a[Institution], a[Email] from p in class Person, a in p->Address[0] where exists p->Supervised

More documentation is available at http://www.acedb.org/Software/whelp/AQL/examples_worm.shtml; scroll down to "Queries on objects containing hash structures."

How can I obtain all the abstracts on Wormbase and the particular genes that they are associated with?

There are two ways:

1) go to ftp://ftp.sanger.ac.uk/pub/wormbase/current_release/ and get the acedb data files.

2) Use AcePerl to get the abstracts. You can do this easily with an Aceperl script:

   my $db = Ace->connect(-host=>yourhost.com) || warn 'yikes';
   my $iterator = $db->fetch_many(-query=>qq(find Paper where Abstract)); while (my $obj = $iterator->next) {
   # grab info from the object
   my @genes = $obj->Gene;
   ... etc ...
   print join(' ',@genes);

How can I find out how many genes contain expression patterns generated with a specific method?

(For example, by in situ hybridization?)

Type the following command under the menu and in the box of "Advanced Search". The following line search for Expr_patterns containig all three types of methods. If the '&' is replaced by '|', the command will search for Expr_pattern with In_situ OR Antibody OR Reporter_gene data.

find Expr_pattern Type="In_situ" & Type="Antibody" & Type="Reporter_gene"

You may change the words following the same syntax to search for other objects.

How can I download the alignments of EST sequences to genomic sequences?

You can extract it from the GFF files that we provide with every release of WormBase. For more information on GFF files see:


Basically, we release one GFF file per chromosome and this contains the coordinates and details of most features that we can map onto chromosome base pair coordinates.

These files are accesible from the main WormBase page (see the Feature table links on the right) and should also be on the WormBase and Sanger Institute FTP sites.

You will need to extract only a subset of these files, i.e. lines that match the pattern 'BLAT_EST_'. This is very easy to do if you have access to a UNIX/Linux system (use the 'grep' command).

E.g. here are two sample lines from the Chromosome II file (these will probably wrap around your screen):

CHROMOSOME_II BLAT_EST_BEST similarity 5754433 5755008 100
. . Target "Sequence:yk776e12.5" 21 596
CHROMOSOME_II BLAT_EST_OTHER similarity 5755968 5755971 98.4
. . Target "Sequence:yk4g4.5" 116 119

Within these lines are details of the chromosome coordinates, the BLAT score, the matching sequence name, and the coordinates within the matching sequence.

How can I retrieve timestamps of Acedb from the command line, and do I have to use Perl?

We use AcePerl to retrieve some timestamp information...this is done via an AQL query.

E.g. if you wanted to find the timestamp of a tag in a particular object, belonging to a particular class, you could do:

my $aql_query = "select s,s->$tag.node_session from s in object(\"$class\",\"$object\")";
my @aql = $db->aql($aql_query);
my $timestamp = $aql[0]->[1];

How can I search for pseudogenes in Wormbase?

It will take a long time if you do AQL queries. However, a different way of query can be done if you want to retrieve the info from wormbase website.

From More search -> Advanced search at http://www.wormbase.org/db/searches/query

In Query Acedb, type in

find sequence *; pseudogene

You should get a result of pseudogene objects within a couple of seconds.

Where can I find a list of classes and subclasses for Acedb?

You can find a list of Acedb classes by first clicking on the More Searches link on the upper right corner of the WormBase home page. From here, select the WormBase Class Browser, which will bring you to a searchable drop-down menu of all the Acedb classes.

For performing queries, it is helpful to know the data model for each of the classes that you would like to search. The data models can be accessed from this same page by typing the class of interest into the search box and then selecting "Model" from the drop-down menu. This will lead you to a Tree Display that diagrams how data for a particular class is represented in Acedb.

Also, from the MoreSearches link, you can access the Advanced AQL Search, which has further documentation and examples for querying the database.

How can I retrieve the gi numbers only for a list of entries having the GO term selected?

At present, you can retrieve Genbank identifiers (i.e. AAMxxxxx, AAKxxxxx, AAFxxxxx, etc.) for CDS's that are associated with a particular GO term by performing an AQL query. Here are the steps:

   1) At the top right corner of the WormBase homepage, click on the More Searches link.

   2) Under the general heading, select the Advanced AQL Search link.

   3) Type the following query into the box:
   select a, b, c[1] from a in class CDS,
   b in a->go_term,
   c in a->protein_id where b = "GO:0003700"

   4) You should get back a three-column table listing each CDS, the GO term you selected, and a Genbank ID.

If you are interested, the rationale for the query can be bettter understood by looking at our data model for CDS's, which is at http://www.wormbase.org/db/misc/etree?name=%3FCDS&class=Model;expand=Visible#Visible. The above query searches in the CDS class, in the attribute go_term, where we have defined the go_term to be "GO:0003700", and in the attribute protein_id for the unique text id which is the database identifier. The [1] after the letter c in the query indicates that the search will retrieve information in the 1, or text, column of protein_id, since the sequence column is considered column 0.

How can I download C. briggsae 3' UTRs in bulk?

We don't really have a strictly empirical set of 3' UTRs (3' flanking sequences taking from cDNA). However, what you probably really want are predicted 3' UTR regions. Those, you can get by going to the Genome Dumper:


selecting the species "C. briggsae", and then filling in the options for the download that you want. The main stumbling block is that the list you need for briggsae genomic sequences is rather long. However, I've tried typing this:


into the window "Type in a list of sequence or chromosome names", and that seems to successfully prompt the Genome Dumper to search through all available C. briggsae genomic contigs.

Also, you should pick the "Integrated ('hybrid') briggsae gene set", and select some reasonable value (e.g., 1000 bp) for the length of the 3' flank sequences that you want.

How can I find the coding sequences of alleles for particular gene(s) having SNPs from C. elegans?

For instance, if you want to find out SNP sequences for H39E23.1a gene, you can use the following AQL query: select a->predicted_gene, a, a->flanking_sequences[1],

a->flanking_sequences[2], a->substitution from a in class allele where
a->predicted_gene = "H39E23.1a" and a->method = "snp"

The output of the query (in text mode) looks like this:

H39E23.1a snp_AH10.2 tgaaaaaaactaatttttaatgtga tcttggccacaattgacctagtttg [A/G]
H39E23.1a snp_AH10.3 ctgaacaactgaaaaaggaaagaaa agggaaaaagttcgaccacaaaaaa [G/A]

Here the first column is the gene name, second is the allele name, third and fourth are sequences flanking the allele and the last one is the actual allele sequence change. You can modify the query to retrieve information for genes that you're interested in.

How can I download the spliced and non-spliced regions for all C. elegans or C. briggsae genes?

You can download spliced/unspliced sequences for a list of genes using Batch Gene tool: http://www.wormbase.org/db/searches/info_dump You can paste a list of genes you're interested in into the search box and select Spliced and Unspliced check boxes in the Sequence field. If you output data as text, you'll be able to save it to your harg disk.

To get the list of C. briggsae genes (so that you can paste it into the search box), you can use the following query: select a from a in class cds where a->species like "*briggsae*".

How do I find all genes with transmembrane or signalp domains?

Go to the WB Query page and enter this query.

'find Wormpep where Feature AND NEXT = "signalp"; follow Corresponding_CDS; follow Gene'

substitute "tmhmm" for "signalp" to get genes with transmembrane domains.

Alternatively, all gene with transmembrane domains are automatically assigned the GO term GO:0016021

WormMart questions

Is there a way for a large number of genes to get not just the alleles, but also the actual mutation (when known)?

Here's what you need to do:

  • Open WormMart
  • Click Filters in the left menu, then expand the Other Annotation filter and under Annotated with select [Sequence] Flanking Sequence
  • Upload your file of GeneIDs to the Specified identifier of type field.
  • Hit Count then Results to check your file is being read. nb. if you have chosen WS176 you should get n/80791.
  • Click Attributes in the left menu and under Identification select Variation (Name), Method, Variation Type (merged) and Mutation Type. Under Affects select Gene (WB Gene ID) (merged), Gene (CGC name) and under Description select Sense, Sense Text, Splice site, Splice Site Text and Frameshift
  • Hit Results

--Tuli 04:42, 6 July 2007 (EDT)

How can I retrieve 1.5 kb promoter region upstream of a bunch of genes?

  • Open WormMart at http://www.wormbase.org/biomart/martview/
  • Click Filters in the left menu, then expand the "Identification" filter, tick "[Gene] ID(s) of Type", select "[Gene] Any Name", upload (using "Browse") or type in a list of genes
  • Hit Count to check your file is being read correctly
  • Click Attributes in the left menu, tick "Gene Sequences", expand "Sequence Type", tick "Flank (Gene Coding Region)" (for upstream of translation start site), expand "Flanking Regions", tick "Upstream flank", type 1500 in the box
  • Hit Results
  • Results can be exported or e-mailed

How can I find all genes expressed somewhere with a particular GO term?

(E.G., Find all genes expressed in the Vulva that have signal transducer activity?)

Firstly you need to identify the exact ontology identifiers that correspond to you query. Seach for the term in the main search box on the home page with the appropriate category selected.

   vulva  -  WBbt:0006748
   signal transducer activity - GO:0004871 

Armed with these you can start your WormMart query.

  • Select which version of the database you are interested in (current release or a recent frozen one)
  • Select 'Gene' DATASET
  • In the left panel click 'Filters'. This will give an expandable list of filters in the main window.
  • Expand 'Annotation'
  • Check the ' [Annotation] IDs of Type: ' box and select '[Function] GO term ID' from the dropdown menu
  • put the GO term found earlier (GO:0004871) in to the box (or upload from file). (NOTE:You must include the GO: part)

This will find all genes annotated with GO term GO:0004871 - hit count for results so far (at time of writing = 111) Now we will add a second dataset to cross reference this result with.

  • Back in the left panel, click 'Dataset' to get a drop down menu of other datasets.
  • Select 'Expression Pattern'
  • In the left panel, under 'Dataset' there is another 'Filters' option. Click to get a similar list as above.
  • Select ' Expressed in '
  • Check ' Specified identifiers of type ' box and choose ' Anatomy Term [eg WBbt:0006748]' from drop down list.
  • Enter the Anatomy term found earlier (or upload from file). (NOTE:You must include the WBbt: part)

This completes the querying part, you now need to select what information you want about the genes that the search finds.

  • Click the 'Attributes' section in the left panel and expand the boxes as you need to select output categories of the Gene.
  • Click the 'Attributes' under 'Datasets' section to select attributes to do with the expression pattern.

This link goes to the completed query. You can click on the relevant sections as described above to change any of GO or Anatomy terms and output data. You may need to click 'Results' to see the output of the query.

How do I pull out all operon details and the names of genes contained in an Operon?

You can retrieve this data through WormMart

  • Select which version of the database you are interested in (current release or a recent frozen one)
  • Select 'Gene' DATASET
  • In the left panel click 'Filters'. This will give an expandable list of filters in the main window.

Select these Filters:

[Gene] Species : Caenorhabditis elegans
[Gene] Status : Live
[Location] : Operon:Only (Annotation Tab - Limit to Entries Annotated with:)

  • In the left panel click 'Attributes'
  • In the right panel select these Attributes:

Gene WB ID [IDs tab] Operon [Location tab] Operon Start [Location tab] Operon End [Location tab]

This will give you a table like#:

Gene WB ID	Gene Public Name  Operon	Operon Start (bp)  Operon End (bp)
WBGene00000001	aap-1	          CEOP1906	5106224	           5111008
WBGene00000037	ace-3	          CEOP2632	14197942	   14210076
WBGene00000038	ace-4	          CEOP2632	14197942	   14210076

How do you retrieve all the protein sequences of genes within Operons?

You can retrieve this data through WormMart

  • Select which version of the database you are interested in (current release or a recent frozen one)
  • Select 'Gene' DATASET
  • In the left panel click 'Filters'. This will give an expandable list of filters in the main window.

Select these Filters:

[Gene] Species : Caenorhabditis elegans
[Gene] Status : Live
[Location] : Operon:Only (Annotation Tab - Limit to Entries Annotated with:)

  • In the left panel click 'Attributes'
  • In the right hand window click 'Gene Sequences'

Select These Attributes:

Sequence Type:
Header Attributes
 Gene WB ID
 Gene Public Name
 WB Wormpep ID

This should give you ~2800 on count and results like:

  • Click on 'Results'
> WBGene00000814|csn-2|WP:CE27562
etc. etc.

How do I retrieve a list of transcription factors which when mutated or targeted by RNAi cause embryonic lethal phenotype?

  • Open WormMart
  • Select Database, e.g. "WormBase Release 188"
  • Select Dataset - "Phenotype"
  • Click on "Filter" link on the left and then on the "+" next to "Phenotype Annotation"
  • Select "Phenotype Inc. Descendents" checkbox and select "embryonic_lethal" from the pull down menu

If you click "Count" button at this point, you should see the number of entries that are annotated with this phenotype (this is not necessary).

Now add a second dataset:

  • Click on second "Dataset" link on the left
  • Choose Additional Dataset - "Gene"
  • Click on "Filter" link on the left (for the second dataset) and then on the "+" next to "Annotation"
  • Select "[Annotation] IDs of Type" checkbox, select "[Function] GO Term ID" and enter GO:0003700 in the box below (corresponds to transcription factor activity)
  • Click on "Attributes" link on the left (for the second dataset) and then on the "+" next to "Function" and select "GO Term Info (merged)" checkbox (if you want to see GO annotations in addition to attributes selected by default, which you can change for each dataset through the Attributes dialog)
  • Press Results button and Export all results to File (also check Unique results only)

Here is what you should see.

How do I download/generate a file containing the unspliced transcripts like I see on the sequence pages of WormBase?

I would like to download the sequences that I see on the sequence summary pages eg.


To do this, replicate the following wormmart query.

Dataset:    CHOOSE DATABASE: WormBase WS195
                 CHOOSE DATASET: Gene

Filters leave as default and add (*)

       [Gene] Species : Caenorhabditis elegans<input type="hidden" name="default____wormbase_gene__filterlist" value="wormbase_gene__filter.species_selection"></input>

       [Gene] Status : Live

        * Annotation: [Transcript] Type: Coding

Attributes: select [Gene Sequences] at top of page

        Sequence Type: Unspliced (Transcript)

        Header Attributes: Whatever the user requires

This should give you a count in the region of 22,000 objects and yield ~27,000 sequence objects in your file.

 If you have a specific list of genes you want sequence data for, you can upload a file of IDs.
e.g. WBGene IDs file format is:


Go back to your wormmart session and on the filters tab select ([Annotation] IDs of Type:) and upload your file.

WormBase will provide a pre-computed file under the sequence directory on the ftp site:


Which GFF source and feature (method) should I use?

The terms feature and method are used interchangably GFF_source_methods