UserGuide:Revised Nomenclature

From WormBaseWiki
Jump to: navigation, search

Overview

Species

In WormBase, species are referred to by their Linnean binomial name. (e.g. Caenorhabditis elegans or C. elegans)

The following species have their gene annotation manually curated.

  • C. elegans
  • C. brenneri
  • C. briggsae
  • C. japonica
  • C. remanei
  • Brugia malayi

All other species in WormBase have their gene annotation imported from the authors or are predicted.

Genomes, assemblies, clones and contigs

Reference genomes in WormBase are given version names, for example C. elegans has the version names: WBcel215 (an old version) and WBcel235 (the current version). A. suum has two assemblies from different groups in WormBase and these have the version names AscSuum_1.0 and ASU_2.0

The genomes of most species in WormBase are incompletely assembled, being left as various sizes of contig. Only C. elegans and C. briggsae have been assembled into chromosomes. The chromosomes of C. elegans have the names:

  • CHROMOSOME_I
  • CHROMOSOME_II
  • CHROMOSOME_III
  • CHROMOSOME_IV
  • CHROMOSOME_V
  • CHROMOSOME_X
  • CHROMOSOME_MtDNA

These may be abbreviated to the chromosome letter (I, II, III, IV, V, X, MtDNA).

The C. elegans chromosomes are composed of tiling paths of clones from the original sequenceing project. These clones have names like 'B0001', 'R12C12', 'VF38E11R', 'Y48E1B', 'ZK6', as given by the various groups that produced the clones.

The chromosomes of C. briggsae have the names:

  • chrI
  • chrI_random
  • chrII
  • chrIII
  • chrIII_random
  • chrIV
  • chrIV_random
  • chrV
  • chrV_random
  • chrX
  • chrX_random
  • chrun

The C. briggsae '*_random' chromosomes are those where the clones are known to belong to a chromosome, but their order and position in the chromosome is unknown. The 'chrun' chromosome is a set of clones whose chromosome is not known.

CDS and Transcript

A CDS (coding sequence structure) is the part of a gene locus that codes for a protein product.

In C. elegans these are guaranteed to start with an AUG (or other legal initiation codons) and end with a STOP codon with no internal STOP codons (apart from scbp-2 which has a non-standard initiation codon).

This may not always be the case in other curated WormBase species' CDSs.

CDSs are the only part of a gene locus that is manually curated. The protein product, the Transcript and the Gene span are all constructed automatically from the CDS structure and available transcript evidence.

A CDS is named in C. elegans after the clone it is created on followed by a dot and the next available number for naming objects on the clone. For example 'AC3.3' is the third CDS made on the clone 'AC3'.

EST and mRNA evidence is then used to extend the CDS structure to model the 5'UTR and the 3'UTR of the expected mature mRNA Transcript of the CDS.

If there is evidence for one or more isoforms of a CDS at a locus, then they are distinguished by giving them letters after their name. For example, if there is evidence for a different structure of CDS at the locus of the 'AC3.3' CDS, then the existing CDS will have its Sequence Name changed to 'AC3.3a' and the new one will have the Sequence Name 'AC3.3b'. There is nothing special about CDS names anding in 'a', they are not necessarily longer, better annotated, more important etc. that the other isoforms at that locus. They are simply the first structure that was created. Transcripts of the CDS isoforms will have the same Sequence Names as the CDSs ('AC3.3a' and 'AC3.3b', for example).

If there is evidence for alternative splicing in the 5'UTR or the 3'UTR of a Transcript object, then Transcript isoforms will automatically be created. These Transcript isoforms are distringuished by adding a dot and numbers after the Sequence Name of the CDS. For example, if 'AC3.3a' has alternate splicing in a UTR giving rise to two Transcript isoforms, the CDS structure will be unaffected and it will retain its name, but the two Transcript isoforms will be created and will have the names 'AC3.3a.1' and 'AC3.3a.2'.

Genes and gene classes

When a new locus is implicitly created by creating a new CDS or non-coding RNA transcript structure at a location on the genome, that locus has a Gene created which is assigned a WormBase Gene ID like 'WBGene00000024'. The Gene ID uniquely refers to this locus (with all of its CDS and Transcript structures) in the WormBase database. It can be used in publications to identify this gene, but it is not a very human-friendly way of referring to a Gene and is prone to copying mistakes. As an alternative to the Gene ID, users may refer to a Gene by the Sequence Name. The Sequence Name is the name that was given to the CDS or ncRNA Transcript at that locus, but without the letters at the end that distinguish isoforms. For example 'AC3.3'. Use of either the Gene ID ('WBGene00000024') or the Sequence Name ('AC3.3') is acceptable in publications.

An alternative to using the Gene ID or Sequence Name is to use a Gene Name. These are composed of the Class of the gene followed by a hyphen and a number, for example 'abu-1'. About 9000 coding genes have currently been assigned a Gene Name and there are currently about 2500 Gene Classes.

Gene Name nomenclature is controlled by WormBase. See below for instructions on how to propose a new Gene Name.

When a C. elegans gene which has a Gene Name (for example 'tra-1') has a homolog in another species, the homolog has a Gene Name constructed from the C. elegans Gene Name with a three-letter species prefix, like 'Cbr-tra-1' in C. briggsae

Species Prefix
Brugia malayi Bma-
C. brenneri Cbn-
C. briggsae Cbr-
C. japonica Cjp-
C. remanei Cre-
Onchocerca volvulus Ovo-

(Also see Other Nematodes section).

Proteins

Protein sequences are automatically translated from the CDS. Each

sequence is automatically assigned a Protein ID, like 'WP:CE05133'. Each protein sequence has one unique Protein ID. This means that when two identical CDSs from different genes are translated to give the same protein sequence, their proteins products will have the same Protein ID. It also means that when any CDS is changed in the course of manual curation to have a different structure, its protein product

will have a new Protein ID.

The consequences of this are that any protein sequence can be

unambiguously referred to, but the ID of the protein product of a CDS

may change between WormBase releases.

A second way of referring to a protein is to use its Protein

Name. This is composed of the Gene Name of the Gene of the CDS with the letters in uppercase, This is a way of referring to whatever the protein product is without specifying a particular sequence from the CDS. For example the CDS 'AC3.3' whose Gene has a Gene Name of 'abu-1', currently produces the protein with a Protein ID of 'WP:CE05133' and its protein product will always be referred to as the

Protein Name 'ABU-1'.

When a gene has CDS isoforms, each isoform will produce a different

protein sequence. For example the Gene 'tra-1' has two isoforms ('Y47D3A.6a' and 'Y47D3A.6b'). The protein products of these CDS isoforms have the Protein Names 'TRA-1, isoform a' and 'TRA-1, isoform

b'.

How to apply for new names

Genetic Nomenclature for Caenorhabditis elegans

Genetic nomenclature for Caenorhabditis elegans is supervised by <a href="http://wormbase.org/">WormBase</a> in collaboration with the <a href="http://www.cbs.umn.edu/CGC/">Caenorhabditis Genetics Center (CGC)</a>.

How to Register a New Gene Class or Gene Name

Investigators wishing to register new gene names for C. elegans should note the summary guidelines below and apply online via <a href="http://minerva.caltech.edu/~azurebrd/cgi-bin/forms/gene_name.cgi">WormBase</a>

or by email application to

How to Register a New Laboratory and Receive Lab, Strain and Allele designations

Specific identifying codes (CGC designations) are assigned to each laboratory engaged in dedicated long-term genetic research on C. elegans. Each such laboratory is assigned a lab/strain code, for naming strains, and an allele code, for naming mutations and transgenes. These codes are listed at the <a href="http://www.cbs.umn.edu/cgc/gene-names">CGC</a>.

Investigators requiring new CGC designations should apply to <A HREF="mailto:tim.schedl@wormbase.org">Tim Schedl</A>

Summary Guidelines for Proposing New Gene Names

  • Gene names must conform to the standard format of 3 or 4 letters, hyphen, number.

  • Genes can be named on the basis of a mutant phenotype or on the basis of the predicted protein product or RNA product.

  • If a new gene clearly belongs in an existing gene class (of which more than 2000 now exist), then a new gene number will be assigned after consultation with the laboratory responsible for the gene class in question. Gene classes and the corresponding assigning laboratory for each gene class are listed at the <a href="http://www.cbs.umn.edu/cgc/gene-names">CGC</a>.

  • If the establishment of a new gene class name seems more appropriate, then an approval for this gene name must be obtained, preferably online via <a href="http://minerva.caltech.edu/~azurebrd/cgi-bin/forms/gene_name.cgi">WormBase</a> or by email application to .

  • Gene names based on homology with a previously named gene in another well-studied organism, such as Saccharomyces cerevisiae or Mus musculus, are often appropriate and desirable, especially where there is convincing orthology between genes.

  • Gene names and gene numbering schemes that conform to established nomenclature proposals for particular protein classes are desirable.

  • Gene names that are memorable, informative and simply explained are encouraged.

  • Gene names based solely on RNAi phenotypes or high-throughput analysis of gene expression or protein interactions are discouraged.

  • Gene names including c (for caenorhabditis), ce (for C. elegans), n (for nematode) or w (for worm) are discouraged. C. elegans as the organism of origin can be specified with a prefix (Cel-) if desired.

  • New gene name classes can be assigned in confidence, prior to formal publication or disclosure in an abstract.

Standard Genetic Nomenclature Recommendations

This summary is based on the original proposals for C. elegans nomenclature (Horvitz et al., 1979 Mol. Gen. Genet. 175: 129-133), plus additional recommendations that have been distributed in The Worm Breeder's Gazette or posted on <a href="http://www.wormbase.org">WormBase</a>.

</div>
Last edited by <a href="/resources/person/WBPerson2970" class="person-link" title="">Mary Ann Tuli</a> – 1 year ago

Genetic Loci

  • A Gene is a region that is expressed or a region that has been expressed and is now a Pseudogene.

  • A gene can be a Pseudogene, or can express one or more non-coding RNA genes (ncRNA) or protein-coding sequences (CDS).
  • All WormBase genes have a unique identifier like WBGene00006415.
    • This is guaranteed to consistently follow the gene throughout any changes that may be made to its structure.
    • When gene structures are split into two genes, the original gene ID will usually apply to the 5' gene and a new gene ID will be created for the other half.
  • All C. elegans WormBase genes also have a Sequence Name, which is derived from the cosmid, fosmid or YAC clone on which they reside, for instance F38H4.7, indicating it is on the cosmid F38H4, and there are at least 6 other genes on that cosmid.

Approved gene names

  • If a gene produces a protein that can be classified as a member of a family, the gene may also be assigned a Approved name consisting of three or four italicized letters, a hyphen, and an italicized Arabic number, e.g., unc-30 indicating that this is the 30th member of the unc gene family.

  • There are a few exceptions to this format, like the genes cln-3.1, cln-3.2, and cln-3.3 which all are equally similar to the human gene CLN3.
  • Gene GCG names for non-elegans species in WormBase have the 3-letter species code prepended, like Cre-acl-5, Cbr-acl-5, Cbn-acl-5.
  • The gene name may on rare occasions be followed by an italicized Roman numeral, to indicate the linkage group on which the gene maps, e.g., dpy-5 I or let-37 X or mlc-3 III.
  • Assignment of gene family names is controlled by WormBase and requests for names should be made, before publication, <A HREF="http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/gene_name.cgi">via the form.</A> or via email to: <A HREF="mailto:genenames@wormbase.org">genenames@wormbase.org</A>
  • For genes defined by mutation, the Approved gene names refer to the mutant phenotype originally detected or most easily scored e.g.
    • dumpy (dumpy) in the case of dpy-5
    • lethal (lethal) in the case of let-37.
  • For genes defined on the basis of sequence similarity or sequence features,

    the Approved gene name refers to the predicted protein product or RNA product e.g.

    • myosin light chain in the case of mlc-3,
    • superoxidedismutase in the case of sod-1,
    • NPHP (human kidney disease nephronophthisis gene) in the case of nph-4.
    • ribosomal RNA in the case of rrn-1.
  • Genes with related properties are usually given the same three-letter name

    and different numbers. For example, there are three known myosin light chain genes: mlc-1, mlc-2, mlc-3, and more than twenty different dumpy genes: dpy-1, dpy-2, dpy-3, and so on.

  • Genes can be given names corresponding to homologous named genes in other standard genetic organisms. e.g.
    • rnt-1 is the C. elegans ortholog of the Drosophila gene runt.
    • wrn-1 is the C. elegans ortholog of the human gene WRN1, responsible for Werner's syndrome.
  • Gene names that are memorable, informative and simply explained are encouraged.
  • Genes in a paralogous set related to a single named gene in another organism are sometimes given the same gene name and number, followed by a distinguishing decimal. e.g. four C. elegans genes homologous to SIR2 in S. cerevisiae have been given the names sir-2.1, sir-2.2, sir-2.3, sir-2.4.
  • Gene names based solely on RNAi phenotypes or high-throughput analysis of gene expression or protein interaction are discouraged.
  • Gene names including c (for Caenorhabditis), ce (for C. elegans), n (for nematode) or w (for worm) are discouraged. Instead, an optional prefix Cel- can be added to indicate the species origin.
  • A limited number of genes have been given temporary tag- names (tag = temporarily assigned gene name). These are genes for which deletion alleles have been generated by reverse genetic methods, but which have not yet been given more informative names based on sequence or mutant phenotype. When sufficient information becomes available, each tag name will be replaced by an appropriate standard 3-letter or 4-letter name.
  • A limited number of genes, named on the basis of sequence homology, have been given non-standard names ending with alphanumeric identifiers rather than with simple numbers, in order to make these names closer to the generally accepted names used in other organisms. e.g. eif-3.B, eif-3.C encode proteins of the conserved translation factor eIF3.

Approved Gene Name Conflicts

Approved Gene names that have been established in the published literature and databases should preferably not be changed. In cases where a gene has received multiple names, one name will be adopted as the main name for the gene. Other names will continue to be listed in databases. Whenever possible, name changes or the adoption of a single main name should be made with the approval of all laboratories concerned.

Homologous Genes

If a homolog of a known C. elegans gene is identified in a related species such as Caenorhabditis briggsae, this can be given the same gene name, preceded by three italic letters referring to the species, and a hyphen. For example, Cbr-tra-1 is the name for the C. briggsae homolog of the C. elegans gene tra-1. The C. elegans homolog of a gene identified and named in another organism can be distinguished by the same convention, using "Cel-" as an optional prefix. For example, Cel-snt-1 defines the C. elegans synaptotagmin gene.

Alleles and Mutations

  • Every mutation has a unique designation. Mutations are given names consisting of one or two italicized letters followed by an italicized Arabic number, e.g., e61 or mn138 or st5. The letter prefix refers to the laboratory of isolation, as registered with the <a href="http://www.cbs.umn.edu/cgc/lab-head">CGC</a>. There are currently more than 500 registered laboratories. For example, e refers (originally) to the MRC Laboratory of Molecular Biology (Cambridge, U.K.), (currently) to the laboratory of J. Hodgkin (University of Oxford), and st refers to the laboratory of R.H. Waterston (originally at Washington University, St. Louis, MO, currently at the University of Washington, Seattle).

  • When gene and mutation names are used together, the mutation name is included in parentheses after the gene name, e.g., dpy-5(e61), let-37(mn138). When unambiguous (e.g., if only one mutation is known for a given gene or if all work on a gene described in a publication used a single mutation cited in a Methods section), gene names are used in preference to mutation names (let-37 rather than mn138 or let-37(mn138)).
  • Optional suffixes indicating characteristics of a mutation can follow a mutation name. These are usually two-letter nonitalicized letters, e.g., hc17ts, where ts stands for temperature-sensitive, or pk15te, where te stands for transposon-excision.
  • Mutations created by in vitro mutagenesis should receive standard allele names. For cases where a pre-existing genomic mutation is re-created by in vitro mutagenesis, it is still desirable to give the new mutation a new name.
  • The wild-type allele of a gene is defined as that present in the Bristol N2 strain, stored frozen at the CGC and other locations. Wild-type alleles can be designated by a plus sign immediately after the gene name, dpy-5+, or, more commonly, by including the plus sign in parentheses, dpy-5(+).

Gene Knockouts

  • Most gene knockouts constructed to date are small deletions (<5 kb) generated by transposon excision or by chemical mutagenesis. These are named as alleles, sometimes with the optional suffix te (transposon-excision) or ko (knockout).

    Example: zyx-1(gk190) is a 777 bp deletion in the zyx-1 gene.

  • Some knockouts have been made by insertion of a selectable marker, such as unc-119(+). These are named as alleles, with an optional descriptor defining the selected marker following the unique allele name, and preceded

    by a double colon. Example: jf61 = zhp-3(jf61::unc-119+)

  • Some of the small deletions generated by reverse genetic methods may remove parts of two adjacent genes. If only two genes appear to be affected, then the deletion is given a single allele name, but the genotype is written with both gene names coupled with an ampersand (&). Example: allele ok615

    is a 1422 bp deletion of two adjacent genes, so it can be written rad-54&tag-157(ok615).

  • Deletions that affect more than two genes are named as Deficiencies (Df), as described in the Chromosomal Aberrations section.

    </ul></p>

    Modifers: Suppressors, Revertants and Enhancers

  • There is no special nomenclature for modifier mutations. Many extragenic suppressor loci are called sup (40 sup loci defined so far, with a wide variety of properties and mechanisms). An increasing number of more specific modifier gene classes have been established, such as smu (suppressor of mec and unc), and smg (suppressor with morphogenetic effect on genitalia)

    and sel (suppressor/enhancer of lin-12).

  • Intragenic suppressors or modifiers are indicated by adding a second mutation name within parentheses; for example, unc-17(e245e2608)

    is an intragenic partial revertant of unc-17(e245).

  • Mutations known to be chromosomal rearrangements, rather than intragenic lesions,

    are named differently, as described in the Chromosomal Aberrations section.

    Chromosomal Aberrations

  • Duplications (Dp) deficiencies (Df), inversions (In) and translocations (T) are known in C. elegans cytogenetics; these are given italicized names consisting of the laboratory mutation prefix, the relevant abbreviation, and a number, optionally followed by the affected linkage groups in parentheses (e.g., eT1(III;V), mnDp5(X;f), where f indicates a free duplication). Chromosomal balancers of unknown structure can be designated using the abbreviation C, e.g., mnC1(II). </ul></p></div>

    Last edited by <a href="/resources/person/WBPerson2970" class="person-link" title="">Mary Ann Tuli</a> – 19 days ago

    DNA sequences

    • There are no specific recommendations for designating cloned sequences that are not similar to known genes. Most genomic clones have been provided by the C. elegans mapping/sequencing consortium (based at the <a href="http://wiki.wormbase.org/index.php?title=Cosmids/YACs">Wellcome Trust Sanger Institute, Cambridge, UK</a>, and the <a href="http://genome.wustl.edu/">Genome Sequencing Center, St. Louis, USA)</a>. Cosmid clones generated by the consortium are named on the vector, either pJB8 (initial letters B, C, D, E, R, M, ZC) or a Lorist vector (initial letters K, T, W, F, ZK). Phage clones (in Lambda 2001) are identified by the initial letters A, ZL, YSL. Some fosmid clones are identified by the initial letter H. Vancouver fosmid clones are identified by initial letters WRM.

    • YACs (yeast artificial chromosome clones) are identified by the initial letter Y, e.g., Y3D5. YAC subsequences may be given names derived from the initial YAC name. Example: subsequences derived from the YAC Y47H9 have been called Y47H9A, Y47H9B, Y47H9C. Note that physical clones corresponding to these subsequences are not available.
    • Genomic DNA clones that have not been generated by the consortium are usually designated by the laboratory strain designation (see Strains section), a # symbol and an isolation number, e.g., MT#JAL6.

    • Sequences that are predicted to be genes from sequence data alone are initially named by the consortium on the basis of the sequenced cosmid, plus a number. For example, the genes predicted for the cosmid T05G3 are called T05G3.1, T05G3.2, etc. (numbered in arbitrary order of definition). Such names can be superseded by standard 3-letter names when this becomes appropriate. Thus, R13F6.3 has been given the name srg-12 (for serpentine receptor, class gamma).

    • EST (Expressed Sequence Tag) clones historically received names with prefixes such as cm and yk, but the INSDC accession number is now preferentially used for any new EST data.


    Last edited by <a href="/resources/person/WBPerson1983" class="person-link" title="">Paul Davis</a> – 239 days ago

    Transposons and Transposon Insertions

    • Types of C. elegans transposons are called Tc1, Tc2, etc., where each number represents a different family. Transposon names are not italicized except when included in a genotype. Different races of C. elegans have different distributions of these transposons, which result in polymorphic differences from the reference wild-type strain Bristol N2. These natural differences between races are given polymorphism names, as described below.

    • The endogenous transposons of C. elegans can be mobilized to generate new insertional mutations. In addition, foreign transposons such as Mos1 can be introduced by transformation, and then mobilized to create new insertions. All these newly generated transposon insertions can be named as simple mutations, with an optional suffix indicating the nature of the transposon. They are treated as alleles of named genes if they are located within the boundaries of a gene. Example: r293 is a Tc1 insertion in the gene unc-54. An optional descriptor can also be added after a double colon to indicate the nature of the insertion. Example: unc-54(r293::Tc1).

    • Note that such insertions may often be silent in terms of gene activity, for example if an insertion occurs within an intron and can be spliced out.

    • Newly generated transposon insertions, especially those located in apparently intergenic regions, may also be given Ti (transposon insertion) names. These consist of a prefix identifying the laboratory of origin, the two letters Ti, and a number, all italicized. Example: eTi13 is an insertion of a Mos transposon into an intergenic region on LGIII.

    • Transposon loci have ID names formed from 'WBTransposon' followed by a unique number, like WBTransposon00000623.
    • Their exon-like structure is curated as a Transposon_CDS object with a name like C29E6.6 formed from the YAC or cosmid or clone they are on followed by a number which uniquely identifies it from the other CDS-like objects on that clone, YAC or cosmid.
    • Transposons and Transposon_CDS are not currently classed as genes in WormBase and so do not have a parent gene object, the WBTransposon and representation on the Genome Browser should be viewed as analogous to the WBGene and how it is displayed.

    Last edited by <a href="/resources/person/WBPerson2970" class="person-link" title="">Mary Ann Tuli</a> – 1 year ago

    Strains

    • A strain is a set of individuals of a particular genotype with the capacity to produce more individuals of the same genotype. Strains are given nonitalicized names consisting of two or three uppercase letters followed by a number. The strain letter prefixes refer to the laboratory of origin and are distinct from the mutation letter prefixes. Examples: CB1833 is a strain of genotype dpy-5(e61) unc-13(e51), originally constructed by S. Brenner at the MRC Laboratory of Molecular Biology (strain prefix CB, allele prefix e), and MT688 is a strain of genotype unc-32(e189) +/+ lin-12(n137) III; him-5(e1467) V, constructed in the laboratory of H.R. Horvitz at M.I.T. (strain prefix MT, allele prefix n).

    • Strain prefixes are listed at the <a href="http://www.cbs.umn.edu/cgc/lab-code">CGC</a>.

    • Strains can and should be preserved as frozen stocks at -70C or ideally in liquid nitrogen, in order to ensure long-term maintenance and to avoid drift or accumulation of modifier mutations.

    • Bacterial strain names employ the two or three letter Laboratory/Strain designation, followed by “b”. For example, CBb###. This facilitates distinguishing nematode strains from bacterial strains. Please provide full information on species and relevant genotype of the bacteria.


    Last edited by <a href="/resources/person/WBPerson2970" class="person-link" title="">Mary Ann Tuli</a> – 349 days ago

    CDS

    • Coding Sequences (CDSs) are the only part of a Gene's structure that is manually curated in WormBase. The structure of the Gene and its transcripts are derived from the structure of their CDSs.

    • CDSs have a Sequence Name that is derived from the same Sequence Name as their parent Gene object, so the gene F38H4.7 has a CDS called F38H4.7.
    • The CDS specifies coding exons in the gene from the START (Methionine) codon up to (and including) the STOP codon.
    • Any gene can code for multiple proteins as a result of alternative splicing.
      • These isoforms have a name that is formed from the Sequence Name of the gene with a unique letter appended.
      • In the case of the gene bli-4 there are 6 known CDS isoforms, called K04F10.4a, K04F10.4b, K04F10.4c, K04F10.4d, K04F10.4e and K04F10.4f.
    • It is common to refer to isoforms in the literature using the Approved gene family name with a letter appended, for example pha-4a, however this has no meaning within the WormBase database and a search for pha-4a in WormBase will not return anything. The correct name of this isoform is pha-4, isoform a.


    Last edited by <a href="/resources/person/WBPerson1983" class="person-link" title="">Paul Davis</a> – 3 years ago

    Mitochondrial Genome

    The mitochondrial genotype of a worm can be expressed using the standard nomenclature, using M as the abbreviation for the mitochondrial linkage group. The mitochondrial genotype is written as the last element in the genotype, following the nuclear genotype. Heteroplasmic combinations, where mitochondria of different genotypes co-exist in the same cytoplasm, can be expressed using a double forward slash, //. For example: uaDf5//+.


    Last edited by <a href="/resources/person/WBPerson1983" class="person-link" title="">Paul Davis</a> – 3 years ago

    RFLPs and SNPs

    • Polymorphic sites, which are mostly RFLPs (restriction fragment length polymorphisms) or SNPs (single nucleotide polymorphisms), are designated by an italic letter P and an italic number, preceded by the allele prefix for the laboratory responsible for identifying the site.

    Examples: stP17 and stP196 are RFLPs identified in the laboratory of R. H. Waterston, amP6 and amP15 are SNPs identified in the laboratory of K. Kornfeld.


    Last edited by <a href="/resources/person/WBPerson10214" class="person-link" title="">Abigail Cabunoc</a> – 4 years ago

    Proteins

    • The protein product of a gene can be referred to by the relevant gene name, written in non-italic capitals, e.g., the protein encoded by unc-13 can be called UNC-13.

    • Where more than one protein product is predicted for a gene (usually as a result of alternative message processing), the different proteins are distinguished by adding 'isoform' and then the isoform letter derived from the isoform letter of the name of the WormBase CDS, e.g., the gene 'tra-1' has two CDS isoforms: 'Y47D3A.6a' and 'Y47D3A.6b' which give rise to the protein isoforms: 'TRA-1, isoform a' and 'TRA-1, isoform b'.
    • Mutant protein products can be named by the missense change, for example a mutant 'TRA-1, isoform a' protein with a Pro to Leu change at codon 79 would be written: 'TRA-1, isoform a (P79L)'.


    Last edited by <a href="/resources/person/WBPerson1983" class="person-link" title="">Paul Davis</a> – 3 years ago

    Natural Copy Number Variants

    Dozens of independent natural isolates of C. elegans have been recovered, from multiple locations around the world. The genomes of some of these isolates contain large (>10 kb) deletions, duplications or insertions, relative to the reference wildtype strain, Bristol N2. Deletions are named with the prefix niDf (natural isolate deficiency) followed by a number. Duplications and insertions are named with the prefix niDp (natural isolate duplication or insertion), followed by a number. Numbers for niDf and niDp variants are assigned by application to:


    Last edited by <a href="/resources/person/WBPerson2970" class="person-link" title="">Mary Ann Tuli</a> – 1 year ago

    Introgressed regions in near-isogenic lines (aka congenic lines)

    Genetic regions that have been introgressed from one natural isolate of C. elegans onto the background of a different natural isolate are named in a manner similar to that used for deficiencies (Df) and duplications (Dp). Each Introgressed Region is given an italicized name consisting of the relevant laboratory mutation prefix, the letters IR, and a number. Thus, a region from the X chromosome of Hawaiian strain CB4856 crossed onto a Bristol N2 background, and created in the Kruglyak lab (allele code qq) has been given the name qqIR1. Additional information about genetic map location and strain origin can be provided in an optional parenthesis. So this example could be more fully written as qqIR1(X, CB4856), with the implicit assumption that the strain background is Bristol N2. The strain background and the direction of introgression can also be specified, using the symbol >, with this example being written qqR1(X, CB4856>N2).


    Last edited by <a href="/resources/person/WBPerson241" class="person-link" title="">Todd Harris</a> – 4 years ago

    Transgenes

    • Transformation of C. elegans with exogenous DNA by microinjection usually leads to the formation of a transmissible extrachromosomal array containing many copies of the introduced DNA. Sometimes chromosomal integration of the introduced DNA can occur, or an existing extrachromosomal array can be integrated after irradiation of a transgenic line.

    • Extrachromosomal arrays are given italicized names consisting of the laboratory allele prefix, the two letters Ex, and a number.

    • Integrated transgenes are designated by italicized names consisting of the laboratory allele prefix, the two letters Is, and a number. Single copy integrants, usually generated by the MosSCI or miniMos insertion techniques, are a subset of integrated transgenes and are designated by italicized names consisting of the laboratory allele prefix, the two letters Si, and a number.

    • Transgenes designations Ex, Is and Si can optionally be followed by genotypic or molecular information describing the transgene, in square brackets. For example, eEx3 or eIs2 or stEx5[sup-7(st5) unc-22(+)]
    • Gene fusions incorporated in transgenes that consist of a C. elegans gene or part thereof fused to a reporter such as lacZ or GFP are indicated by the C. elegans gene name followed by two colons and the reporter, all italicized: pes-1::lacZ, mab-9::GFP. No specific recommendations have been made for distinguishing between transcriptional and translational fusions.


    Last edited by <a href="/resources/person/WBPerson2970" class="person-link" title="">Mary Ann Tuli</a> – 36 days ago

    Genotypes

    • The genotype of an animal is specified by listing all known differences between its genotype and that of wild type, which is defined by convention as Bristol N2. Each such difference is assigned a unique name. The currently recognized types of difference, described at greater length elsewhere in these guidelines, are:

      • Simple mutations.Example: e2123.

      • New transposon insertions. Example: eTi13.

      • Sequence polymorphisms. Example: stP17.

      • Transgenes (extrachromosomal arrays). Example: stEx5.

      • Transgenes (chromosomally inserted). Example: mdIs18.

      • Chromosomal aberrations (duplications, deficiencies, inversions, translocations, and crossover suppressors). Examples: nDp17, uaDf5, hIn1, eT1, mnC1.

    • Where necessary, wild type sequence can be indicated using the symbol +.

    • Because every genetic "feature" (i.e., difference from Bristol N2) has a unique name, an animal's genotype is fully specified by listing all the named features that it carries. Example: e2123; mdIs18.

    • For clarity and convenience, additional information about genes, chromosomes, transgene contents, etc can be added as described elsewhere in this document,

      to produce a more informative genotype. Example: pha-1(e2123ts) III; mdIs18[pha-1(+) unc-17::GFP]
    • Mutants carrying more than one mutation are designated by sequentially listing mutant genes or mutations according to the left-right (= up-down) order on the genetic map. Different linkage groups are separated by a semicolon and given in the order I, II, III, IV, V, X, f, M. I-V are the five autosomes, X is the X chromosome, f refers to free duplications or chromosomal fragments, and M is the mitochondrial genome. For example: dpy-5(e61) I; bli-2(e768) II; unc-32(e189) III.

    • Heterozygotes, with allelic differences between chromosomes, are designated by separating mutations on the two homologous chromosomes with a slash. Where unambiguous, wild-type alleles can be designated by a plus sign alone, or even omitted. For example, dpy-5(e61) unc-13(+)/dpy-5(+) unc-13(e51) I can

      also be written dpy-5 +/+ unc-13 or dpy-5/unc-13.

    Last edited by <a href="/resources/person/WBPerson241" class="person-link" title="">Todd Harris</a> – 4 years ago

    RNA Molecules

    • Messenger RNA species can be written by using the protein product as a descriptor, for example TRA-1A mRNA, TRA-1B mRNA, in order to allow distinction between different splice variants.

    • Non-coding RNA species can be written using the gene name as a descriptor, for example lin-4 RNA. Small RNA species derived from mir genes (micro-RNAs) can be written miR-, followed by a number corresponding to the mir gene. Example: miR-2 for the RNA derived from mir-2.


    Last edited by <a href="/resources/person/WBPerson241" class="person-link" title="">Todd Harris</a> – 4 years ago

    Phenotypes

    • Phenotypic characteristics can be described in words, e.g., dumpy animals or uncoordinated animals. If more convenient, a non-italicized 3-letter or 4-letter abbreviation, which usually corresponds to a gene class or gene name, may be used. The first letter of a phenotypic abbreviation is capitalized, e.g., Unc for uncoordinated, Dpy for dumpy. If necessary to distinguish among related but distinguishable phenotypes, the relevant gene number can be added, e.g., Unc-4 and Unc-13 to differentiate the distinct phenotypes produced by mutations in the two genes unc-4 and unc-13. WormBase maintains a standard set of defined phenotype descriptors (the <a href="http://wormbase.org/db/misc/phenotype">WormBase Phenotype Ontology</a>)

    • Abbreviations that do not correspond to a gene class or gene name can also be used, e.g., Muv for multiple vulval development, and Daf-c for dauer-formation-constitutive. Assignment of phenotype abbreviations not corresponding to a gene name is controlled by WormBase and requests for names should be made, before publication via email to: <A HREF="mailto:genenames@wormbase.org">genenames@wormbase.org</A>
    • A common and accepted convention, when comparing a mutant with the wild-type, is to use the prefix non- to refer to the wild-type phenotypes, for example, non-Lin (= wild type cell lineage) or Dpy non-Unc (= wild type with respect to movement, but dumpy with respect to body shape).

    Last edited by <a href="/resources/person/WBPerson2970" class="person-link" title="">Mary Ann Tuli</a> – 1 year ago

    RNAi Phenotypes

    • Animals in which an endogenous gene has been down-regulated by RNA interference (RNAi), after exposure to double-stranded RNA corresponding to that gene, can be referred to as mutants, using italicized RNAi as the mutation name. Example:

      mog-4(RNAi), C08F8.8(RNAi)
    • Phenotypes induced by RNAi can be named using conventional mutant phenotype descriptors, such as Unc, Muv, Fem. For high-throughput RNAi screens, which may detect only conspicuous phenotypes, the more general phenotype descriptors could be used (see the <a href="http://wormbase.org/db/misc/phenotype"> Phenotype Ontology</a>).


    Last edited by <a href="/resources/person/WBPerson2970" class="person-link" title="">Mary Ann Tuli</a> – 1 year ago

    Sources

    • All genetic data for C. elegans are summarized in WormBase (Bieri et al. 2007, Nucleic Acids Res. 35 Database Issue: D506-510; Harris et al. 2004, Nucleic Acids Res. 32 Database issue: D411-417).

    • Queries on recommended nomenclature for C. elegans should be addressed to: or to the curator for C. elegans Genetic Mapping and Genetic Nomenclature (Professor Tim Schedl, Department of Genetics Campus Box #8232, Washington University School of Medicine, 4566 Scott Ave., St. Louis, MO 63110): email tim.schedl@wormbase.org


    Last edited by <a href="/resources/person/WBPerson2970" class="person-link" title="">Mary Ann Tuli</a> – 2 years ago

    Gene Transcripts

    • The transcripts of a gene in WormBase are automatically derived by mapping any available cDNA or mRNA alignments onto the CDS model.

    • These gene transcripts will therefore often include the UTR exons surrounding the CDS.
    • If there are no available cDNA or mRNA transcripts, then the gene transcripts will have exactly the same structure as the CDS that they are modelled on.
    • Gene transcripts are named after the Sequence Name of the CDS used to create them, for example, F38H4.7 or K04F10.4a.
    • However if there is alternative splicing in the UTRs, which would not change the protein sequence, the alternatively-spliced transcripts are named with a digit appended, for example: K04F10.4a.1 and K04F10.4a.2.
    • If there are no isoforms of the coding gene, for example AC3.5, but there is alternative splicing in the UTRs, there will be multiple transcripts named AC3.5.1 and AC3.5.2, etc.
    • If there are no alternate UTR transcripts the single coding_transcript is named the same as the CDS and does not have the .1 appended, as in the case of K04F10.4f.

    Last edited by <a href="/resources/person/WBPerson4025" class="person-link" title="">Gary Williams</a> – 4 years ago

    Operons

    • Groups of genes which are co-transcribed as operons are curated as Operon objects.

    • These have names formed from 'CEOP' followed by a value for the chromosome (1,2,3,4,5,X) and a unique 3 digit number like CEOP5460 and are manually curated using evidence from the SL2 trans-spliced leader sequence sites.

    Last edited by <a href="/resources/person/WBPerson1983" class="person-link" title="">Paul Davis</a> – 4 years ago

    Transcription Factors

    • Transcription factors in WormBase have ID names formed as 'WBTranscriptionFactor' followed by a unique number, like WBTranscriptionFactor000143.

    • A Transcription Factor can be formed from one or more constituent proteins, such as WBTranscriptionFactor000143 which is a AHA-1/CKY-1 heterodimer.
    • Although they are not strictly Transcription Factors, Polymerase II and Polymerase III are held as Transcriptions factors (WBTranscriptionFactor000001 and WBTranscriptionFactor000002) as it is often convenient to treat them as such.

    Last edited by <a href="/resources/person/WBPerson2970" class="person-link" title="">Mary Ann Tuli</a> – 1 year ago

    New methods for genome engineering (TALENs, CRISPR-Cas9, etc.) are increasingly being applied to C. elegans. These entail some additional recommendations to the standard Genetic Nomenclature Guidelines, as described below. The aim is to provide compact and unambiguous ways of describing and referring to engineered changes to endogenous loci, as distinct from transgenic constructs that are inserted elsewhere in the genome.

    • Each engineered modification to an endogenous locus (point mutations, deletions, insertions or combinations thereof) should receive a unique allele designation, using the standard allele designation of the originating laboratory. For example: bus-50(e5000).</p><p>
    • Optional brackets can be employed to provide additional information. Example: bus-50(e5000[T110E]) (an engineered missense mutation).

    • An engineered fusion of GFP to the C-terminus of BUS-50 would be: bus-50(e5001[bus-50::gfp]).

    • As a shorter and more convenient form, and where unambiguous, this could be referred to as: bus-50::gfp. Such abbreviations should be clearly defined where first used in a paper.

    • An engineered insertion of GFP plus the unc-119(+) selectable marker, flanked by loxP sites, would be: bus-50(e5002[bus-50::gfp + loxP unc-119(+) loxP]).

    • Each additional engineering of the endogenous locus requires a new allele number. In the example of bus-50(e5002), following Cre-mediated recombinase removal of unc-119(+) so that a single loxP site remains, the new genotype would be bus-50(e5003[bus-50::gfp +loxP]) or bus-50(e5003) for short.

    • Engineered insertions in apparent intergenic regions are given standard Is insertion names, for example eIs2002. Optional descriptors can include the nature of the insertion, e.g., [unc-119::gfp] and the position in the genome, e.g., [III:2992500], to give eIs2002[unc-119::gfp]] or eIs2002[unc-119::gfp, III:2992500].

    • Engineered changes to existing Is (or Si) insertions should receive new Is numbers using originating lab’s prefix. The original Is insertion can be indicated in brackets with a preceding asterisk (*), in order to allow searches for all derivatives from a given insertion.

      For example, an engineered change from GFP to mCherry in eIs2002 might be named as ozIs909, or ozIs909[unc-119::mCherry *eIs2002].


    Last edited by <a href="/resources/person/WBPerson2970" class="person-link" title="">Mary Ann Tuli</a> – 115 days ago

    Other Nematodes

    Research and genomic analysis of non-C. elegans species is increasing rapidly. An important mission of WormBase is to make available information for each species listed in the Overview section, within the database structure developed for C. elegans. For these organisms, gene naming will also be supervised by WormBase, in order to maximize consistency with C. elegans. It is recommended that nomenclature in general should follow the principles used for C. elegans, as far as possible. Gene name proposals and queries should be made online via <a href="http://minerva.caltech.edu/~azurebrd/cgi-bin/forms/gene_name.cgi">WormBase</a> or sent to <A HREF="mailto:genenames@wormbase.org">genenames@wormbase.org</A>

    Species prefixes

    In order to unambiguously specify the nematode species-of-origin, an optional 3-letter standard prefix and hyphen can be added to the gene name. Examples: the C. briggsae and Pristionchus pacificus orthologs of C. eleganstra-1 are called Cbr-tra-1 and Ppa-tra-1, respectively. WormBase coordinates the species prefix designations, to avoid the use of the same designation for more than one species; contact <A HREF="mailto:genenames@wormbase.org">genenames@wormbase.org</A> for prefix proposals.

    <p> Prefixes so far used include:

    Species Prefix
    C. elegans Cel-
    C. briggsae Cbr-
    C. remanei Cre-
    C. brenneri Cbn-
    C. japonica Cpj-
    Heterorhabditis bacteriophora Hba-
    Oscheius tipulae Oti-
    Pristionchus pacificus Ppa-</i>

    Gene naming: Homologous genes

    Genes predicted from whole genome sequences in other nematode species will, in many cases, have identifiable close homologs in C. elegans, for which approved names already exist. In these cases, the same name should be used as in C. elegans, with the relevant species identifier.

    Possible scenarios:

    • One-to-one: Where one gene in C. elegans corresponds to a single gene in another nematode species, ortholog naming can be applied automatically. Example: thoc-1 in C. elegans has a C. briggsae ortholog, Cbr-thoc-1.
    • One-to-many: Where one gene in C. elegans is related to multiple genes (paralogs) in another nematode species, these paralogs can be named using additional decimal numbers. Example: thoc-3 in C. elegans has two C. briggsae paralogs, Cbr-thoc-3.1 and Cbr-thoc-3.2.
    • Many-to-one: Where multiple genes exist in C. elegans, but only a single gene in another nematode species, it is recommended that either the most closely similar, or the lowest numbered C. elegans gene, be used to name the single gene, as appropriate.
    • Many-to-many: Where multiple closely related genes can be identified in both species, but the phylogenetic relationships of the two sets are complex, new gene numbers can be assigned to the set of genes in the other nematode species, after consultation with <A HREF="mailto:genenames@wormbase.org">genenames@wormbase.org</A>

    In cases where a standard gene name has not yet been assigned in C. elegans, the gene can be referred to using the cosmid.number identifier for the C. elegans gene, preceded by a species prefix. Example: the ortholog of C. elegansW01B11.3 in Heterorhabditis bacteriophora can be referred to as Hba-W01B11.3. However, in such cases it will usually be both feasible and desirable to assign a standard name to the C. elegans gene as well, at the same time.

    Gene naming: Non-homologous genes

    It is expected that many genes in other nematode species will lack obvious close homologs in C. elegans, because of loss or substantial divergence during the evolution of C. elegans. These genes can be given new gene numbers, if they belong to an identifiable named class in C. elegans, or else new gene name classes can be established for them. In either case, assignment of an approved name should be made after consultation with <A HREF="mailto:genenames@wormbase.org">genenames@wormbase.org</A>

    Gene naming: Forward genetics

    A significant amount of mutation-based forward genetic analysis is

    being pursued in nematodes other than C. elegans, in particular using other species of Caenorhabditis (C. briggsae, C. remanei, C. brenneri and others), as well as species of Oscheius and Pristionchus. It is expected that most, but not all, of the mutationally-defined genes discovered in these species will prove to have orthologs with equivalent or similar function in C. elegans, and hence that standard genetic names will have been approved already. Several situations

    can arise:

    • In cases where the molecular identity is known and orthology is

      obvious, it is recommended that the C. elegans name be used, with the appropriate species identifier prefix. Example: Ppa-mab-5 is the Pristionchus pacificus ortholog of C. elegansmab-5.

    • In cases where the molecular identity is not initially known, but the mutant phenotype corresponds to a known C. elegans mutant phenotype, the following is recommended to reduce gene renaming as much as possible: (a) new mutations should be mapped as precisely as possible, (b) all relevant complementation tests should be done, and (c) if the genome assembly for that species is sufficiently accurate, potential C. elegans orthologs should be tested by sequencing or RNAi. If the gene still seems to be novel, then a new gene class name can be established, following consultation with <A HREF="mailto:genenames@wormbase.org">genenames@wormbase.org</A> in order to ensure that the new name is available and appropriate. An appropriate name should describe the phenotype yet be distinct from the corresponding C. elegans name. If at the point when the gene is molecularly identified it is found to be conserved in C. elegans, then the name would typically revert to the C. elegans name, but this would be determined on a case-by-case basis.
    • In cases where the molecular identity is unknown and the mutant phenotype does not correspond to a known C. elegans mutant phenotype, a new gene class name can be established, following consultation with <A HREF="mailto:genenames@wormbase.org">genenames@wormbase.org</A> in order to ensure that the new name is available and appropriate. Example: cov = Competence and/or centering Of Vulva abnormal.

    Last edited by <a href="/resources/person/WBPerson2970" class="person-link" title="">Mary Ann Tuli</a> – 1 year ago

    Last edited by <a href="/resources/person/WBPerson2970" class="person-link" title="">Mary Ann Tuli</a> – 36 days ago

    (END)