WS230

From WormBaseWiki
Jump to: navigation, search

Release notes for WS230

New release of WormBase WS230

WS230 was built by klh
-===================================================================================-
The WS230 build directory includes:
species/ DIR              -  contains a sub dir for each WormBase species (G_SPECIES) with the following files:
     - G_SPECIES.WS230.genomic.fa.gz                - Unmasked genomic DNA
     - G_SPECIES.WS230.genomic_masked.fa.gz         - Hard-masked (repeats replaced with Ns) genomic DNA
     - G_SPECIES.WS230.genomic_softmasked.fa.gz     - Soft-masked (repeats lower-cased) genomic DNA
     - G_SPECIES.WS230.protein.fa.gz                - Current live protein set
     - G_SPECIES.WS230.cds_transcripts.fa.gz        - Spliced cDNA sequence for the CDS portion of protein-coding transcripts
     - G_SPECIES.WS230.ncrna_transcripts.fa.gz      - Spliced cDNA sequence for non-coding RNA transcripts
     - G_SPECIES.WS230.intergenic_sequences.fa.gz   - DNA sequence between pairs of adjacent genes
     - G_SPECIES.WS230.annotations.gff[2|3].gz      - Sequence features in either GFF2 or GFF3 format
     - G_SPECIES.WS230.ests.fa.gz                   - ESTs and mRNA sequences extracted from the public databases
     - G_SPECIES.WS230.best_blastp_hits.txt.gz      - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteins
     - G_SPECIES.WS230.*pep_package.tar.gz          - latest version of the [worm|brig|bren|rema|jap|ppa]pep package (if updated since last release)
     - annotation/                    - contains additional annotations:
        - G_SPECIES.WS230.confirmed_genes.txt.gz              - DNA sequences of all genes confirmed by EST &/or cDNA
        - G_SPECIES.WS230.cDNA2orf.txt.gz                     - Latest set of ORF connections to each cDNA (EST, OST, mRNA)
        - G_SPECIES.WS230.geneIDs.txtgz                       - list of all current gene identifiers with CGC & molecular names (when known)
        - G_SPECIES.WS230.PCR_product2gene.txt.gz             - Mappings between PCR products and overlapping Genes
        - G_SPECIES.WS230.*oligo_mapping.txt.gz               - Oligo array mapping files
        - G_SPECIES.WS230.knockout_consortium_alleles.xml.gz  - Table of Knockout Consortium alleles
        - G_SPECIES.WS230.SRA_gene_expression.tar.gz          - Tables of gene expression values computed from SRA RNASeq data
acedb DIR                -  Everything needed to generate a local copy of the The Primary database
     - database.WS230.*.tar.gz   - compressed acedb database for new release
     - models.wrm.WS230          - the latest database schema (also in above database files)
     - WS230-WS229.dbcomp   - log file reporting difference from last release
     - *Non_C_elegans_BLASTX/          - This directory contains the blastx data for non-elegans species
                                                    (reduces the size of the main database)
COMPARATIVE_ANALYSIS DIR - comparative analysis files
     - compara.WS230.tar.bz2     - gene-tree and alignment GFF files
     - wormpep_clw.WS230.sql.bz2 - ClustalW protein multiple alignments
ONTOLOGY DIR             - gene_associations, obo files for (phenotype GO anatomy) and associated association files


Release notes on the web:
-------------------------
http://www.wormbase.org/wiki/index.php/Release_Schedule


C. elegans Chromosomal Changes:
--------------------
There are no changes to the chromosome sequences in this release.


C. elegans Gene data set (Live C. elegans genes 47519)
------------------------------------------
Molecular_info              45866 (96.5%)
Concise_description         5920 (12.5%)
Reference                   14319 (30.1%)
WormBase_approved_Gene_name 26591 (56%)
RNAi_result                 24689 (52%)
Microarray_results          23987 (50.5%)
SAGE_transcript             19201 (40.4%)


C. elegans 

Wormpep data set:
----------------------------

There are 25634 CDSs, from 20517 protein-coding genes

The 25634 sequences contain 33922650 base pairs in total.

Modified entries      136
Deleted entries       72
New entries           159
Reappeared entries    9

Net change  +96

C. elegans Genome sequence composition:
----------------------------
----------------------------

        WS230           WS229           change
----------------------------------------------
a       32367418        32367418          +0
c       17780787        17780787          +0
g       17756985        17756985          +0
t       32367086        32367086          +0
n       0               0                 +0
-       0               0                 +0

Total   100272276       100272276         +0


Pristionchus pacificus Genome sequence composition:
----------------------------
 172773083 total
 a 43813958
 c 32811034
 g 32828589
 t 43810996
 - 0
 n 19508506


Caenorhabditis remanei Genome sequence composition:
----------------------------
 145500347 total
 a 42927857
 c 26293828
 g 26276020
 t 42923178
 - 0
 n 7079464


Caenorhabditis japonica Genome sequence composition:
----------------------------
 166565019 total
 a 46865690
 c 30244493
 g 30234317
 t 46807519
 - 0
 n 12413000


Caenorhabditis briggsae Genome sequence composition:
----------------------------
 108419768 total
 a 32984239
 c 19684682
 g 19693545
 t 33054090
 - 62600
 n 2940612


Caenorhabditis brenneri Genome sequence composition:
----------------------------
 190421492 total
 a 52222485
 c 32837458
 g 32882838
 t 52164077
 - 0
 n 20314634




Tier II Gene counts
---------------------------------------------
pristionchus Gene count 24216 (Coding 24216)
remanei Gene count 32431 (Coding 31471)
japonica Gene count 29962 (Coding 29962)
briggsae Gene count 23027 (Coding 21936)
brenneri Gene count 32257 (Coding 30667)
---------------------------------------------




-------------------------------------------------
Pristionchus pacificus Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed               229 (0.9%)      Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    4982 (20.6%)     Some, but not all exon bases are covered by transcript evidence
Predicted             19006 (78.5%)     No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Pristionchus pacificus entries with WormBase-approved Gene name   3182




-------------------------------------------------
Caenorhabditis remanei Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed               956 (3.0%)      Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    5662 (18.0%)     Some, but not all exon bases are covered by transcript evidence
Predicted             24858 (79.0%)     No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis remanei entries with WormBase-approved Gene name   5835




-------------------------------------------------
Caenorhabditis japonica Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed               176 (0.5%)      Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed     578 (1.6%)      Some, but not all exon bases are covered by transcript evidence
Predicted             35351 (97.9%)     No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis japonica entries with WormBase-approved Gene name   4813




-------------------------------------------------
Caenorhabditis briggsae Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed                54 (0.2%)      Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed     859 (3.9%)      Some, but not all exon bases are covered by transcript evidence
Predicted             21048 (95.8%)     No transcriptional evidence at all



Status of entries: Protein Accessions
-------------------------------------
UniProtKB accessions  21662 (98.6%)

Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis briggsae entries with WormBase-approved Gene name   5902




-------------------------------------------------
Caenorhabditis brenneri Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed              1510 (4.9%)      Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    5638 (18.4%)     Some, but not all exon bases are covered by transcript evidence
Predicted             23522 (76.7%)     No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis brenneri entries with WormBase-approved Gene name   3379




-------------------------------------------------
Caenorhabditis elegans Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed             12288 (47.9%)     Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed   11295 (44.1%)     Some, but not all exon bases are covered by transcript evidence
Predicted              2051 (8.0%)      No transcriptional evidence at all



Status of entries: Protein_ID's in EMBL
---------------------------------------
Protein_id            25621 (99.9%)

Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis elegans entries with WormBase-approved Gene name  24983


C. elegans Operons Stats
---------------------------------------------
Description: These exist as closely spaced gene clusters similar to bacterial operons
---------------------------------------------
| Live Operons        1390                |
| Genes in Operons    3634                |
---------------------------------------------


GO Annotation Stats WS230
--------------------------------------

GO_codes - used for assigning evidence
--------------------------------------
IC  Inferred by Curator
IDA Inferred from Direct Assay
IEA Inferred from Electronic Annotation
IEP Inferred from Expression Pattern
IGI Inferred from Genetic Interaction
IMP Inferred from Mutant Phenotype
IPI Inferred from Physical Interaction
ISS Inferred from Sequence (or Structural) Similarity
NAS Non-traceable Author Statement
ND  No Biological Data available
RCA Inferred from Reviewed Computational Analysis
TAS Traceable Author Statement
------------------------------------------------

Total number of Gene::GO connections:  295059

Genes Stats:
----------------
Genes with GO_term connections         95542  
           IEA GO_code present         89478  
       non-IEA GO_code present         6060  

Source of the mapping data             
Source: *RNAi (GFF mapping overlaps)   25979  
        *citace                        2483  
        *Inherited (motif & phenotype) 15097  

GO_terms Stats:
---------------
Total No. GO_terms                     30589  
GO_terms connected to Genes            3502  
GO annotations connected with IEA      1812  
GO annotations connected with non-IEA  1676  
   Breakdown  IC - 6   IDA - 476   ISS - 148 
             IEP - 11   IGI - 140   IMP - 795 
             IPI - 79  NAS - 1     ND  - 1  
             RCA - 0   TAS - 18   


-===================================================================================-

Useful Stats:
---------

Genes with Sequence and WormBase-approved Gene names
WS230 48094 (24983 elegans / 5902 briggsae / 5835 remanei / 4813 japonica / 3379 brenneri / 3182 pristionchus)


-===================================================================================-



New Data:
---------

RNASeq expression:

The expression levels of all genes, CDS isoforms and Transcripts are
now calculated from aligned RNASeq reads, using the program
'Cufflinks'. The expression values are given as FPKM (Fragments
Per Kilobase of transcript per Million reads). The FPKM expression
values for each known RNASeq project are given, grouped by the Life
Stage of the experiment. FPKM values for the genes continue to be
available from the SPELL interface.



Mass spectrometry peptides:

63,000 new mass spectrometry peptides from the MacCoss laboratory have
been added and aligned to the genome as part of the data imported from
the modENCODE project. This increases the number of mass spectrometry
peptides we hold by 30%. These data are used to give supporting
evidence to curated CDS structures. 


Sequence Ontology:

WormBase's ACeDB representation of the Sequence Ontology
(http://www.sequenceontology.org) has been updated to version
2.4.4. We now capture the entire sequence_feature branch of the SO, no
longer restricting to the SOFA subset. AceDB model changes have
accompanied this update (see below).


New species
-----------

This release include the genome sequence and preliminary annotation of
Caenorhabditis species 5. Sequencing and assembly were performed by
Sujai Kumar and Mark Blaxter at the Institute of Evolutionary Biology,
University of Edinburgh; preliminary gene models provided by Erich
Schwarz, Division of Biology, California Institute of Technology;
additional automatic annotation by WormBase.

Caenorhabditis sp. 5 is a member of the Elegans group of species,
which morphologically resemble C. elegans itself and are closely
grouped with it evolutionarily (Kiontke et al., 2011). It is a
gonochoristic species, requiring mating between males and females for
reproduction, and the closest outgroup to the interfertile pair of
hermaphroditic C. briggsae and male-female C. sp. 9. Its geographic
distribution is remarkably confined to East Asia, being commonly found
in China and northern Vietnam, particularly in habitats with moist
decaying vegetation. C. sp. 5 shows strikingly high molecular
diversity, assayed by SNP variability (Wang et al., 2010); however,
its codon usage patterns are similar to those of C. elegans 
(Cutter et al., 2008). 
 
References:

Cutter, A,D., Wasmuth, J.D., and Washington, N.L. (2008). Patterns of
molecular evolution in Caenorhabditis preclude ancient origins of
selfing. Genetics 178, 2093-2104. 

Kiontke, K.C., Felix, M.A., Ailion, M., Rockman, M.V., Braendle, C.,
Penigault, J.B., and Fitch, D.H. (2011). A phylogeny and molecular
barcodes for Caenorhabditis, with numerous new species from rotting
fruits. BMC Evol. Biol. 11, 339. 

Wang, G.X., Ren, S., Ren, Y., Ai, H., and Cutter,
A.D. (2010). Extremely high molecular diversity within the East Asian
nematode Caenorhabditis sp. 5. Mol. Ecol. 19, 5022-5029. 

Known problems
--------------

The genomic sequences of Caenorhabditis sp. 7 and sp. 9 are
cross-contaminated with DNA from one another, and thus need to be
resequenced. Caltech plans to sequence sp. 9 in early 2012 as it is
of high value due to its closeness to C. briggsae.

We include the sequence and annotation of the C. sp. 9 genome for
reference, but users should approach the data with caution. 
 

Proposed Changes / Forthcoming Data:
-------------------------------------


Model Changes:
------------------------------------

Changes to support native GFFv3 dumping in AceDB:

- Redesign of the ?SO_term class to support the 
- Extension of the ?Method class

More infromation and a human readable diff can be found here:
http://wiki.wormbase.org/index.php/WS229_Models.wrm


For more info mail help@wormbase.org
-===================================================================================-


Quick installation guide for UNIX/Linux systems
-----------------------------------------------

1. Create a new directory to contain your copy of WormBase,
        e.g. /users/yourname/wormbase

2. Unpack and untar all of the database.*.tar.gz files into
        this directory. You will need approximately 2-3 Gb of disk space.

3. Obtain and install a suitable acedb binary for your system
        (available from www.acedb.org).

4. Use the acedb 'xace' program to open your database, e.g.
        type 'xace /users/yourname/wormbase' at the command prompt.

5. See the acedb website for more information about acedb and
        using xace.

____________  END _____________

Patches

OMIM URL fix

the OMIM database object links to a wrong URL. This [patch] corrects it.