WS227

From WormBaseWiki
Jump to navigationJump to search

Release Letter

New release of WormBase WS227

WS227 was built by Michael Paulini
-===================================================================================-
The WS227 build directory includes:
species/ DIR              -  contains a sub dir for each WormBase species (G_SPECIES) with the following files:
     - G_SPECIES.WS227.genomic.fa.gz                - Unmasked genomic DNA
     - G_SPECIES.WS227.genomic_masked.fa.gz         - Hard-masked (repeats replaced with Ns) genomic DNA
     - G_SPECIES.WS227.genomic_softmasked.fa.gz     - Soft-masked (repeats lower-cased) genomic DNA
     - G_SPECIES.WS227.protein.fa.gz                - Current live protein set
     - G_SPECIES.WS227.cds_transcripts.fa.gz        - Spliced cDNA sequence for the CDS portion of protein-coding transcripts
     - G_SPECIES.WS227.ncrna_transcripts.fa.gz      - Spliced cDNA sequence for non-coding RNA transcripts
     - G_SPECIES.WS227.intergenic_sequences.fa.gz   - DNA sequence between pairs of adjacent genes
     - G_SPECIES.WS227.annotations.gff[2|3].gz      - Sequence features in either GFF2 or GFF3 format
     - G_SPECIES.WS227.ests.fa.gz                   - ESTs and mRNA sequences extracted from the public databases
     - G_SPECIES.WS227.best_blastp_hits.txt.gz      - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteains
     - G_SPECIES.WS227.*pep_package.tar.gz          - latest version of the [worm|brig|bren|rema|jap|ppa]pep package (if updated since last release)
     - annotation/                    - contains additional annotations:
        - G_SPECIES.WS227.confirmed_genes.txt.gz              - DNA sequences of all genes confirmed by EST &/or cDNA
        - G_SPECIES.WS227.cDNA2orf.txt.gz                     - Latest set of ORF connections to each cDNA (EST, OST, mRNA)
        - G_SPECIES.WS227.geneIDs.txtgz                       - list of all current gene identifiers with CGC & molecular names (when known)
        - G_SPECIES.WS227.PCR_product2gene.txt.gz             - Mappings between PCR products and overlapping Genes
        - G_SPECIES.WS227.*oligo_mapping.txt.gz               - Oligo array mapping files
        - G_SPECIES.WS227.knockout_consortium_alleles.xml.gz  - Table of Knockout Consortium alleles
        - G_SPECIES.WS227.SRA_gene_expression.tar.gz          - Tables of gene expression values computed from SRA RNASeq data
acedb DIR                -  Everything needed to generate a local copy of the The Primary database
     - database.WS227.*.tar.gz   - compressed acedb database for new release
     - models.wrm.WS227          - the latest database schema (also in above database files)
     - WS227-WS226.dbcomp   - log file reporting difference from last release
     - *Non_C_elegans_BLASTX/          - This directory contains the blastx data for non-elegans species
                                                    (reduces the size of the main database)
COMPARATIVE_ANALYSIS DIR - comparative analysis files
     - compara.WS227.tar.bz2     - gene-tree and alignment GFF files
     - wormpep_clw.WS227.sql.bz2 - ClustalW protein multiple alignments
ONTOLOGY DIR             - gene_associations, obo files for (phenotype GO anatomy) and associated association files


Release notes on the web:
-------------------------
http://www.wormbase.org/wiki/index.php/Release_Schedule




C. elegans Synchronisation with GenBank / EMBL:
------------------------------------

No synchronisation issues


C. elegans Chromosomal Changes:
--------------------
There are no changes to the chromosome sequences in this release.


C. elegans Gene data set (Live C. elegans genes 47408)
------------------------------------------
Molecular_info              45748 (96.5%)
Concise_description         5808 (12.3%)
Reference                   14208 (30%)
WormBase_approved_Gene_name 26299 (55.5%)
RNAi_result                 24641 (52%)
Microarray_results          23973 (50.6%)
SAGE_transcript             19163 (40.4%)


C. elegans 

Wormpep data set:
----------------------------

There are 25244 CDSs, from 20470 protein-coding genes

The 25244 sequences contain 11,068,632 base pairs in total.

Modified entries      58
Deleted entries       23
New entries           96
Reappeared entries    1

Net change  +74


C. elegans Genome sequence composition:
----------------------------

       	WS227       	WS226      	change
----------------------------------------------
a    	32367418	32367418	  +0
c    	17780787	17780787	  +0
g    	17756985	17756985	  +0
t    	32367086	32367086	  +0
n    	0       	0       	  +0
-    	0       	0       	  +0

Total	100272276	100272276	  +0


Pristionchus pacificus Genome sequence composition:
----------------------------
 172773083 total
 a 43813958
 c 32811034
 g 32828589
 t 43810996
 - 0
 n 19508506


Caenorhabditis remanei Genome sequence composition:
----------------------------
 145500347 total
 a 42927857
 c 26293828
 g 26276020
 t 42923178
 - 0
 n 7079464


Caenorhabditis japonica Genome sequence composition:
----------------------------
 166565019 total
 a 46865690
 c 30244493
 g 30234317
 t 46807519
 - 0
 n 12413000


Caenorhabditis briggsae Genome sequence composition:
----------------------------
 108419768 total
 a 32984239
 c 19684682
 g 19693545
 t 33054090
 - 0
 n 3003212


Caenorhabditis brenneri Genome sequence composition:
----------------------------
 190421492 total
 a 52222485
 c 32837458
 g 32882838
 t 52164077
 - 0
 n 20314634




Tier II Gene counts
---------------------------------------------
pristionchus Gene count 24216 (Coding 24216)
remanei Gene count 32431 (Coding 31471)
japonica Gene count 29962 (Coding 29962)
briggsae Gene count 23050 (Coding 21962)
brenneri Gene count 32257 (Coding 30667)
---------------------------------------------




-------------------------------------------------
Pristionchus pacificus Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed               229 (0.9%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    4982 (20.6%)	Some, but not all exon bases are covered by transcript evidence
Predicted             19006 (78.5%)	No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Pristionchus pacificus entries with WormBase-approved Gene name   3153




-------------------------------------------------
Caenorhabditis remanei Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed               956 (3.0%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    5662 (18.0%)	Some, but not all exon bases are covered by transcript evidence
Predicted             24858 (79.0%)	No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis remanei entries with WormBase-approved Gene name   5655




-------------------------------------------------
Caenorhabditis japonica Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed               176 (0.5%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed     578 (1.6%)	Some, but not all exon bases are covered by transcript evidence
Predicted             35351 (97.9%)	No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis japonica entries with WormBase-approved Gene name   4652




-------------------------------------------------
Caenorhabditis briggsae Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed                53 (0.2%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed     853 (3.9%)	Some, but not all exon bases are covered by transcript evidence
Predicted             21080 (95.9%)	No transcriptional evidence at all



Status of entries: Protein Accessions
-------------------------------------
UniProtKB accessions  21682 (98.6%)

Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis briggsae entries with WormBase-approved Gene name   5705




-------------------------------------------------
Caenorhabditis brenneri Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed              1510 (4.9%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    5638 (18.4%)	Some, but not all exon bases are covered by transcript evidence
Predicted             23522 (76.7%)	No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis brenneri entries with WormBase-approved Gene name   3236




-------------------------------------------------
Caenorhabditis elegans Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed             12052 (47.7%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed   11172 (44.3%)	Some, but not all exon bases are covered by transcript evidence
Predicted              2020 (8.0%)	No transcriptional evidence at all



Status of entries: Protein Accessions
-------------------------------------
UniProtKB accessions  24444 (96.8%)

Status of entries: Protein_ID's in EMBL
---------------------------------------
Protein_id            24594 (97.4%)

Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis elegans entries with WormBase-approved Gene name  24684


C. elegans Operons Stats
---------------------------------------------
Description: These exist as closely spaced gene clusters similar to bacterial operons
---------------------------------------------
| Live Operons        1253                |
| Genes in Operons    3348                |
---------------------------------------------


GO Annotation Stats WS227
--------------------------------------

GO_codes - used for assigning evidence
--------------------------------------
IC  Inferred by Curator
IDA Inferred from Direct Assay
IEA Inferred from Electronic Annotation
IEP Inferred from Expression Pattern
IGI Inferred from Genetic Interaction
IMP Inferred from Mutant Phenotype
IPI Inferred from Physical Interaction
ISS Inferred from Sequence (or Structural) Similarity
NAS Non-traceable Author Statement
ND  No Biological Data available
RCA Inferred from Reviewed Computational Analysis
TAS Traceable Author Statement
------------------------------------------------

Total number of Gene::GO connections:  293205

Genes Stats:
----------------
Genes with GO_term connections         95463  
           IEA GO_code present         89531  
       non-IEA GO_code present         5928  

Source of the mapping data             
Source: *RNAi (GFF mapping overlaps)   25387  
        *citace                        2249  
        *Inherited (motif & phenotype) 15069  

GO_terms Stats:
---------------
Total No. GO_terms                     30542  
GO_terms connected to Genes            3321  
GO annotations connected with IEA      1766  
GO annotations connected with non-IEA  1544  
   Breakdown  IC - 3   IDA - 419   ISS - 141 
             IEP - 10   IGI - 132   IMP - 740 
             IPI - 74  NAS - 2     ND  - 1  
             RCA - 0   TAS - 21   


-===================================================================================-

Useful Stats:
---------

Genes with Sequence and WormBase-approved Gene names
WS227 47085 (24684 elegans / 5705 briggsae / 5655 remanei / 4652 japonica / 3236 brenneri / 3153 pristionchus)


-===================================================================================-



New Data:
---------
* Introns confirmed by RNASeq based on C.elegans and C.briggsae SRA data (as GFF tracks and ACeDB features)
* Caenorhabditis species 11 gene models have been updated with Augustus models based on RNASeq from Erich Schwartz
* imported H.sapiens ortholog predictions from TreeFam
* imported orthologs/paralogs from PantherDB
* added eggNOG homology_groups from Peer Bork's lab
* C.japonica gene models based on ESTs/cDNAs/RNASeq added (based on Erich Schwartz RNASeq predictions)


Genome sequence updates:
-----------------------
* removed contaminated C.brenneri contigs based on feedback of INSDC (now in sync with INSDC)
* C.japonica assembly update to WashU version 7.0.1

New Fixes:
----------


Known Problems:
---------------
* It has been discovered that C. species 7 & 9 have contamination in their assemblies. Please proceed with caution if using this data.


Other Changes:
--------------
* removed outdated Inparanoid homology_groups, as they have been superceded by Inparanoid orthologs
* removed outdated KOG homology_groups, as they have been superceded by eggNOG
* multiple genome alignments changed to 11 species (C.elegans, C.brenneri, C.sp11, C.remanei,C.briggsae,C.japonica,C.angaria,P.pacificus,M.hapla,B.malayi and T.spiralis)

Proposed Changes / Forthcoming Data:
-------------------------------------


Model Changes:
------------------------------------
* changes to the Transgene model
* changes to the Gene model to allow curation for human disease relevance
* new Rearrangment type Introgression
* Sequence checksums
* Splice_confirmation hash changes to allow for RNASeq and Mass-pec data
* Sequence collections to represent assemblies
* changes to the Homology_type hash to allow more eggNOG types
* changes to the Feature class to allow quantification/scoring
* cleanups in the Movie and Gene class
* addition of an Affiliation to the Person class
* addition of an Fax number to the Address hash
* addition of Rearrangements to the Molecule class

For more info mail help@wormbase.org
-===================================================================================-


Quick installation guide for UNIX/Linux systems
-----------------------------------------------

1. Create a new directory to contain your copy of WormBase,
	e.g. /users/yourname/wormbase

2. Unpack and untar all of the database.*.tar.gz files into
	this directory. You will need approximately 2-3 Gb of disk space.

3. Obtain and install a suitable acedb binary for your system
	(available from www.acedb.org).

4. Use the acedb 'xace' program to open your database, e.g.
	type 'xace /users/yourname/wormbase' at the command prompt.

5. See the acedb website for more information about acedb and
	using xace.

____________  END _____________

Optional Patches

eggNOG protein-gene connections

All proteins of the eggNOG database have connections to all WormBase genes of their cluster in WS227. As that drowns out the other member databases (InParanoid/TreeFam/Panther/...), there is a patch to remove these connection for better visualisation. ftp://ftp.sanger.ac.uk/pub2/wormbase/releases/WS227/patches/patch1_eggNOG.ace.gz