Difference between revisions of "UserGuide:Nomenclature"

From WormBaseWiki
Jump to navigationJump to search
Line 174: Line 174:
 
*Genomic DNA clones that have not been generated by the consortium are usually designated by the laboratory strain designation (see below), a # symbol and an isolation number, e.g., MT#JAL6.
 
*Genomic DNA clones that have not been generated by the consortium are usually designated by the laboratory strain designation (see below), a # symbol and an isolation number, e.g., MT#JAL6.
  
*Sequences that are predicted to be genes from sequence data alone are initially named by the consortium on the basis of the sequenced cosmid, plus a number. For example, the genes predicted for the cosmid T05G3 are called T05G3.1, T05G3.2, etc. (numbered in arbitrary order of definition). Such names can be superseded by standard 3-letter names when this becomes appropriate. Thus, R13F6.3 has been given the name srg-12 (for serpentine receptor, class gamma).
+
*Sequences that are predicted to be genes from sequence data alone are initially named by the consortium on the basis of the sequenced cosmid, plus a number. For example, the genes predicted for the cosmid T05G3 are called T05G3.1, T05G3.2, etc. (numbered in arbitrary order of definition). Such names can be superseded by standard 3-letter names when this becomes appropriate. Thus, R13F6.3 has been given the name srg-12 (for ''s''erpentine ''r''eceptor, class ''g''amma).
  
 
*EST (Expressed Sequence Tag) clones have received names with prefixes such as cm and yk.
 
*EST (Expressed Sequence Tag) clones have received names with prefixes such as cm and yk.

Revision as of 17:27, 8 December 2008

Genetic Nomenclature for Caenorhabditis elegans

Genetic nomenclature for Caenorhabditis elegans is supervised by WormBase in collaboration with the Caenorhabditis Genetics Center (CGC).

How to Register a New Gene Class or Gene Name

Investigators wishing to register new gene names for C. elegans should note the summary guidelines below and apply online via WormBase or by email application to genenames@wormbase.org

How to Register a New Laboratory and Receive Lab, Strain and Allele designations

Specific identifying codes (CGC designations) are assigned to each laboratory engaged in dedicated long-term genetic research on C. elegans. Each such laboratory is assigned a lab/strain code, for naming strains, and an allele code, for naming mutations and transgenes. These codes are listed at the CGC and on the WormBase wiki.

Investigators requiring new CGC designations should apply to jonathan.hodgkin@bioch.ox.ac.uk

Summary Guidelines for Proposing New Gene Names

  1. Gene names must conform to the standard format of 3 or 4 letters, hyphen, number.
  2. Genes can be named on the basis of a mutant phenotype or on the basis of the predicted protein product or RNA product.
  3. If a new gene clearly belongs in an existing gene class (of which more than 2000 now exist), then a new gene number will be assigned after consultation with the laboratory responsible for the gene class in question. Gene classes and the corresponding assigning laboratory for each gene class are listed at the CGC and on the WormBase wiki.
  4. If the establishment of a new gene class name seems more appropriate, then an approval for this gene name must be obtained, preferably online via WormBase or by email application to genenames@wormbase.org
  5. Gene names based on homology with a previously named gene in another well-studied organism, such as Saccharomyces cerevisiae or Mus musculus, are often appropriate and desirable, especially where there is convincing orthology between genes.
  6. Gene names and gene numbering schemes that conform to established nomenclature proposals for particular protein classes are desirable.
  7. Gene names that are memorable, informative and simply explained are encouraged.
  8. Gene names based solely on RNAi phenotypes or high-throughput analysis of gene expression or protein interactions are discouraged.
  9. Gene names including c (for caenorhabditis), ce (for C. elegans), n(for nematode) or w(for worm) are discouraged. C. elegans as the organism of origin can be specified with a prefix (Cel-) if desired.
  10. New gene name classes can be assigned in confidence, prior to formal publication or disclosure in an abstract.

Standard Genetic Nomenclature Recommendations

This summary is based on the original proposals for C. elegans nomenclature (Horvitz et al., 1979 Mol. Gen. Genet. 175: 129-133), plus additional recommendations that have been distributed in The Worm Breeder's Gazette or posted on WormBase.

Genetic Loci

  • Genes are given names consisting of three or four italicized letters, a hyphen, and an italisized Arabic number, e.g., dpy-5 or let-37 or mlc-3. The gene name may be followed by an italicized Roman numeral, to indicate the linkage group on which the gene maps, e.g., dpy-5 I or let-37 X or mlc-3 III.
  • For genes defined by mutation, the gene names refer to the mutant phenotype originally detected or most easily scored e.g.
    • dumpy (dumpy) in the case of dpy-5
    • lethal (lethal) in the case of let-37.
  • For genes defined on the basis of sequence similarity or sequence features, the gene name refers to the predicted protein product or RNA product e.g.
    • myosin light chain in the case of mlc-3,
    • superoxide dismutase in the case of sod-1,
    • NPHP (human kidney disease nephronophthisis gene) in the case of nph-4.
    • ribosomal RNA in the case of rrn-1.
  • Genes with related properties are usually given the same three-letter name and different numbers. For example, there are three known myosin light chain genes: mlc-1, mlc-2, mlc-3, and more than twenty different dumpy genes: dpy-1, dpy-2, dpy-3, and so on.
  • Genes can be given names corresponding to homologous named genes in other standard genetic organisms. e.g.
    • rnt-1 is the C. elegans ortholog of the Drosophila gene runt.
    • wrn-1 is the C. elegans ortholog of the human gene WRN1, responsible for Werner's syndrome.
  • Gene names that are memorable, informative and simply explained are encouraged.
  • Genes in a paralogous set related to a single named gene in another organism are sometimes given the same gene name and number, followed by a distinguishing decimal. e.g. four C. elegans genes homologous to SIR2 in S. cerevisiae have been given the names sir-2.1, sir-2.2, sir-2.3, sir-2.4.
  • Pseudogenes, for which there is good evidence that no functional product is ever generated, can be indicated by adding the optional italic suffix ps to the gene name, as in msp-48ps.
  • Gene names based solely on RNAi phenotypes or high-throughput analysis of gene expression or protein interaction are discouraged.
  • Gene names including c (for Caenorhabditis), ce (for C. elegans), n (for nematode) or w(for worm) are discouraged. Instead, an optional prefix Cel- can be added to indicate the species origin.
  • A limited number of genes have been given temporary tag- names (tag = temporarily assigned gene name). These are genes for which deletion alleles have been generated by reverse genetic methods, but which have not yet been given more informative names based on sequence or mutant phenotype. When sufficient information becomes available, each tag- name will be replaced by an appropriate standard 3-letter or 4-letter name.
  • A limited number of genes, named on the basis of sequence homology, have been given non-standard names ending with alphanumeric identifiers rather than with simple numbers, in order to make these names closer to the generally accepted names used in other organisms. e.g. eif-3.B, eif-3.C encode proteins of the conserved translation factor eIF3.

Gene Name Conflicts

Gene names that have been established in the published literature and databases should preferably not be changed. In cases where a gene has received multiple names, one name will be adopted as the main name for the gene. Other names will continue to be listed in databases. Whenever possible, name changes or the adoption of a single main name should be made with the approval of all laboratories concerned.

Homologous Genes

If a homolog of a known C. elegans gene is identified in a related species such as Caenorhabditis briggsae, this can be given the same gene name, preceded by three italic letters referring to the species, and a hyphen. For example, Cbr-tra-1 is the name for the C. briggsae homolog of the C. elegans gene tra-1. The C. elegans homolog of a gene identified and named in another organism can be distinguished by the same convention, using "Cel-" as an optional prefix. For example, Cel-snt-1 defines the C. elegans synaptotagmin gene.

Alleles and Mutations

  • Every mutation has a unique designation. Mutations are given names consisting of one or two italicized letters followed by an italicized Arabic number, e.g., e61 or mn138 or st5. The letter prefix refers to the laboratory of isolation, as registered with the CGC. There are currently more than 500 registered laboratories. For example, e refers (originally) to the MRC Laboratory of Molecular Biology (Cambridge, U.K.), (currently) to the laboratory of J. Hodgkin (University of Oxford), and st refers to the laboratory of R.H. Waterston (originally at Washington University, St. Louis, MO, currently at the University of Washington, Seattle).
  • When gene and mutation names are used together, the mutation name is included in parentheses after the gene name, e.g., dpy-5(e61), let-37(mn138). When unambiguous (e.g., if only one mutation is known for a given gene or if all work on a gene described in a publication used a single mutation cited in a Methods section), gene names are used in preference to mutation names (let-37 rather than mn138 or let-37(mn138)).
  • Optional suffixes indicating characteristics of a mutation can follow a mutation name. These are usually two-letter nonitalicized letters, e.g., hc17ts, where ts stands for temperature-sensitive, or pk15te, where te stands for transposon-excision.
  • Mutations created by in vitro mutagenesis should receive standard allele names. For cases where a pre-existing genomic mutation is re-created by in vitro mutagenesis, it is still desirable to give the new mutation a new name.
  • The wild-type allele of a gene is defined as that present in the Bristol N2 strain, stored frozen at the CGC and other locations. Wild-type alleles can be designated by a plus sign immediately after the gene name, dpy-5+, or, more commonly, by including the plus sign in parentheses, dpy-5(+).

Gene Knockouts

  • Most gene knockouts constructed to date are small deletions (<5 kb) generated by transposon excision or by chemical mutagenesis. These are named as alleles, sometimes with the optional suffix te (transposon-excision) or ko (knockout). Example: zyx-1(gk190) is a 777 bp deletion in the zyx-1 gene.
  • Some knockouts have been made by insertion of a selectable marker, such as unc-119(+). These are named as alleles, with an optional descriptor defining the selected marker following the unique allele name, and preceded by a double colon. Example: jf61 = zhp-3(jf61::unc-119+)
  • Some of the small deletions generated by reverse genetic methods may remove parts of two adjacent genes. If only two genes appear to be affected, then the deletion is given a single allele name, but the genotype is written with both gene names coupled with an ampersand (&). Example: allele ok615 is a 1422 bp deletion of two adjacent genes, so it can be written rad-54&tag-157(ok615).
  • Deletions that affect more than two genes are named as Deficiencies (Df), as described in the Chromosomal Aberrations section.

Modifers: Suppressors, Revertants and Enhancers

  • There is no special nomenclature for modifier mutations. Many extragenic suppressor loci are called sup (40 sup loci defined so far, with a wide variety of properties and mechanisms). An increasing number of more specific modifier gene classes have been established, such as smu (suppressor of mec and unc), and smg (suppressor with morphogenetic effect on genitalia) and sel (suppressor/enhancer of lin-12).
  • Intragenic suppressors or modifiers are indicated by adding a second mutation name within parentheses; for example, unc-17(e245e2608) is an intragenic partial revertant of unc-17(e245).
  • Mutations known to be chromosomal rearrangements, rather than intragenic lesions, are named differently, as described in the Chromosomal Aberrations section.

Chromosomal Aberrations

  • Duplications (Dp) deficiencies (Df), inversions (In) and translocations (T) are known in C. elegans cytogenetics; these are given italicized names consisting of the laboratory mutation prefix, the relevant abbreviation, and a number, optionally followed by the affected linkage groups in parentheses (e.g., eT1(III;V), mnDp5(X;f), where f indicates a free duplication). Chromosomal balancers of unknown structure can be designated using the abbreviation C, e.g., mnC1(II).

Transposons and Transposon Insertions

  • C. elegans transposons are called Tc1, Tc2, etc., where each number represents a different family. Transposon names are not italicized except when included in a genotype. Different races of C. elegans have different distributions of these transposons, which result in polymorphic differences from the reference wild-type strain Bristol N2. These natural differences between races are given polymorphism names, as described below.
  • The endogenous transposons of C. elegans can be mobilized to generate new insertional mutations. In addition, foreign transposons such as Mos1 can be introduced by transformation, and then mobilized to create new insertions. All these newly generated transposon insertions can be named as simple mutations, with an optional suffix indicating the nature of the transposon. They are treated as alleles of named genes if they are located within the boundaries of a gene. Example: r293 is a Tc1 insertion in the gene unc-54. An optional descriptor can also be added after a double colon to indicate the nature of the insertion. Example: unc-54(r293::Tc1).
  • Note that such insertions may often be silent in terms of gene activity, for example if an insertion occurs within an intron and can be spliced out.
  • Newly generated transposon insertions, especially those located in apparently intergenic regions, may also be given Ti (transposon insertion) names. These consist of a prefix identifying the laboratory of origin, the two letters Ti, and a number, all italicized. Example: eTi13 is an insertion of a Mos transposon into an intergenic region on LGIII.

RFLPs and SNPs

  • Polymorphic sites, which are mostly RFLPs (restriction fragment length polymorphisms) or SNPs (single nucleotide polymorphisms), are designated by an italic letter P and an italic number, preceded by the allele prefix for the laboratory responsible for identifying the site.

Examples: stP17 and stP196 are RFLPs identified in the laboratory of R. H. Waterston, amP9 and amP15 are SNPs identified in the laboratory of K. Kornfeld.

Natural Copy Number Variants

Dozens of independent natural isolates of C. elegans have been recovered, from multiple locations around the world. The genomes of some of these isolates contain large (>10 kb) deletions, duplications or insertions, relative to the reference wildtype strain, Bristol N2. Deletions are named with the prefix niDf (natural isolate deficiency) followed by a number. Duplications and insertions are named with the prefix niDp (natural isolate duplication or insertion), followed by a number. Numbers for niDf and niDp variants are assigned by application to: genenames@wormbase.org

Introgressed regions in near-isogenic lines (aka congenic lines)

Genetic regions that have been introgressed from one natural isolate of C. elegans onto the background of a different natural isolate are named in a manner similar to that used for deficiencies (Df) and duplications (Dp). Each Introgressed Region is given an italicized name consisting of the relevant laboratory mutation prefix, the letters IR, and a number. Thus, a region from the X chromosome of Hawaiian strain CB4856 crossed onto a Bristol N2 background, and created in the Kruglyak lab (allele code qq) has been given the name qqIR1. Additional information about genetic map location and strain origin can be provided in an optional parenthesis. So this example could be more fully written as qqIR1(X, CB4856), with the implicit assumption that the strain background is Bristol N2. The strain background and the direction of introgression can also be specified, using the symbol >, with this example being written qqR1(X, CB4856>N2).

Transgenes

  • Transformation of C. elegans with exogenous DNA by microinjection usually leads to the formation of a transmissible extrachromosomal array containing many copies of the introduced DNA. Sometimes chromosomal integration of the introduced DNA can occur, or an existing extrachromosomal array can be integrated after irradiation of a transgenic line.
  • Extrachromosomal arrays are given italicized names consisting of the laboratory allele prefix, the two letters Ex, and a number.
  • Integrated transgenes are designated by italicized names consisting of the laboratory allele prefix, the two letters Is, and a number.
  • Both Ex and Is can optionally be followed by genotypic or molecular information describing the transgene, in square brackets. For example, eEx3 or eIs2 or stEx5[sup-7(st5) unc-22(+)].
  • Gene fusions incorporated in transgenes that consist of a C. elegans gene or part thereof fused to a reporter such as lacZ or GFP are indicated by the C. elegans gene name followed by two colons and the reporter, all italicized: pes-1::lacZ, mab-9::GFP. No specific recommendations have been made for distinguishing between transcriptional and translational fusions.

Genotypes

  • The genotype of an animal is specified by listing all known differences between its genotype and that of wild type, which is defined by convention as Bristol N2. Each such difference is assigned a unique name. The currently recognized types of difference, described at greater length elsewhere in these guidelines, are:
    • Simple mutations.Example: e2123.
    • New transposon insertions. Example: eTi13.
    • Sequence polymorphisms. Example: stP17.
    • Transgenes (extrachromosomal arrays). Example: stEx5.
    • Transgenes (chromosomally inserted). Example: mdIs18.
    • Chromosomal aberrations (duplications, deficiencies, inversions, translocations, and crossover suppressors). Examples: nDp17, uaDf5, hIn1, eT1, mnC1.
  • Where necessary, wild type sequence can be indicated using the symbol +.
  • Because every genetic "feature" (i.e., difference from Bristol N2) has a unique name, an animal's genotype is fully specified by listing all the named features that it carries. Example: e2123; mdIs18.

For clarity and convenience, additional information about genes, chromosomes, transgene contents, etc can be added as described elsewhere in this document, to produce a more informative genotype. Example: pha-1(e2123ts) III; mdIs18[pha-1(+) unc-17::GFP]

  • Mutants carrying more than one mutation are designated by sequentially listing mutant genes or mutations according to the left-right (= up-down) order on the genetic map. Different linkage groups are separated by a semicolon and given in the order I, II, III, IV, V, X, f, M. I-V are the five autosomes, X is the X chromosome, f refers to free duplications or chromosomal fragments, and M is the mitochondrial genome. For example: dpy-5(e61) I; bli-2(e768) II; unc-32(e189) III.
  • Heterozygotes, with allelic differences between chromosomes, are designated by separating mutations on the two homologous chromosomes with a slash. Where unambiguous, wild-type alleles can be designated by a plus sign alone, or even omitted. For example, dpy-5(e61) unc-13(+)/dpy-5(+) unc-13(e51) I can also be written dpy-5 +/+ unc-13 or dpy-5/unc-13.

Mitochondrial Genome

The mitochondrial genotype of a worm can be expressed using the standard nomenclature, using M as the abbreviation for the mitochondrial linkage group. The mitochondrial genotype is written as the last element in the genotype, following the nuclear genotype. Heteroplasmic combinations, where mitochondria of different genotypes co-exist in the same cytoplasm, can be expressed using a double forward slash, //. For example: uaDf5//+.

DNA sequences

  • There are no specific recommendations for designating cloned sequences that are not similar to known genes. Most genomic clones have been provided by the C. elegans mapping/sequencing consortium (based at the Wellcome Trust Sanger Institute, Cambridge, UK, and the Genome Sequencing Center, St. Louis, USA). Cosmid clones generated by the consortium are named on the basis of the vector, either pJB8 (initial letters B, C, D, E, R, M, ZC) or a Lorist vector (initial letters K, T, W, F, ZK). Phage clones (in Lambda 2001) are identified by the initial letters A, ZL, YSL. Some fosmid clones are identified by the initial letter H. Vancouver fosmid clones are identified by initial letters WRM.
  • YACs (yeast artificial chromosome clones) are identified by the initial letter Y, e.g., Y3D5. YAC subsequences may be given names derived from the initial YAC name. Example: subsequences derived from the YAC Y47H9 have been called Y47H9A, Y47H9B, Y47H9C. Note that physical clones corresponding to these subsequences are not available.
  • Genomic DNA clones that have not been generated by the consortium are usually designated by the laboratory strain designation (see below), a # symbol and an isolation number, e.g., MT#JAL6.
  • Sequences that are predicted to be genes from sequence data alone are initially named by the consortium on the basis of the sequenced cosmid, plus a number. For example, the genes predicted for the cosmid T05G3 are called T05G3.1, T05G3.2, etc. (numbered in arbitrary order of definition). Such names can be superseded by standard 3-letter names when this becomes appropriate. Thus, R13F6.3 has been given the name srg-12 (for serpentine receptor, class gamma).
  • EST (Expressed Sequence Tag) clones have received names with prefixes such as cm and yk.

Proteins

  • The protein product of a gene can be referred to by the relevant gene name, written in non-italic capitals, e.g., the protein encoded by unc-13 can be called UNC-13. Where more than one protein product is predicted for a gene (usually as a result of alternative message processing), the different proteins are distinguished by additional capital letters, e.g., TRA-1A, TRA-1B.
  • Mutant protein products can be named by the missense change, for example a mutant TRA-1A protein with a Pro to Leu change at codon 79 would be written: TRA-1A (P79L).

RNA Molecules

  • Messenger RNA species can be written by using the protein product as a descriptor, for example TRA-1A mRNA, TRA-1B mRNA, in order to allow distinction between different splice variants.
  • Non-coding RNA species can be written using the gene name as a descriptor, for example lin-4 RNA. Small RNA species derived from mir genes (micro-RNAs) can be written miR-, followed by a number corresponding to the mir gene. Example: miR-2 for the RNA derived from mir-2.

Phenotypes

  • Phenotypic characteristics can be described in words, e.g., dumpy animals or uncoordinated animals. If more convenient, a nonitalicized 3-letter or 4-letter abbreviation, which usually corresponds to a gene name, may be used. The first letter of a phenotypic abbreviation is capitalized, e.g., Unc for uncoordinated, Dpy for dumpy. If necessary to distinguish among related but distinguishable phenotypes, the relevant gene number can be added, e.g., Unc-4 and Unc-13 to differentiate the distinct phenotypes produced by mutations in the two genes unc-4 and unc-13. Abbreviations that do not correspond to gene names can also be used, e.g., Muv for multiple vulval development, and Daf-c for dauer-formation-constitutive. WormBase maintains a standard set of defined phenotype descriptors (the WormBase Phenotype Ontology).
  • A common and accepted convention, when comparing a mutant with the wild-type, is to use the prefix non- to refer to the wild-type phenotypes, for example, non-Lin (= wild type cell lineage) or Dpy non-Unc (= wild type with respect to movement, but dumpy with respect to body shape).

RNAi Phenotypes

  • Animals in which an endogenous gene has been down-regulated by RNA interference (RNAi), after exposure to double-stranded RNA corresponding to that gene, can be referred to as mutants, using italicized RNAi as the mutation name. Example: mog-4(RNAi).
  • Phenotypes induced by RNAi can be named using conventional mutant phenotype descriptors, such as Unc, Muv, Fem. For high-throughput RNAi screens, which may detect only conspicuous phenotypes, the more general phenotype descriptors could be used (see theWormBase Phenotype Ontology).

Strains

  • A strain is a set of individuals of a particular genotype with the capacity to produce more individuals of the same genotype. Strains are given nonitalicized names consisting of two or three uppercase letters followed by a number. The strain letter prefixes refer to the laboratory of origin and are distinct from the mutation letter prefixes. Examples: CB1833 is a strain of genotype dpy-5(e61) unc-13(e51), originally constructed by S. Brenner at the MRC Laboratory of Molecular Biology (strain prefix CB, allele prefix e), and MT688 is a strain of genotype unc-32(e189) +/+ lin-12(n137) III; him-5(e1467) V, constructed in the laboratory of H.R. Horvitz at M.I.T. (strain prefix MT, allele prefix n).
  • Strain prefixes are listed at the CGC.
  • Strains can and should be preserved as frozen stocks at ‚Äì70Àö C or ideally in liquid nitrogen, in order to ensure long-term maintenance and to avoid drift or accumulation of modifier mutations.

Sources

  • All genetic data for C. elegans are summarized in WormBase

(Bieri et al. 2007, Nucleic Acids Res. 35 Database Issue: D506-510; Harris et al. 2004, Nucleic Acids Res. 32 Database issue: D411-417).

  • Queries on recommended nomenclature for C. elegans should be addressed to: genenames@wormbase.orgor to the curator for C. elegans Genetic Mapping and Genetic Nomenclature

(Dr Jonathan Hodgkin, Genetics Unit, Department of Biochemistry, University of Oxford, UK): jonathan.hodgkin@bioch.ox.ac.uk