WS225

From WormBaseWiki
Jump to: navigation, search
New release of WormBase WS225, Wormpep225 and Wormrna225 Tue Apr  5 11:57:03 BST 2011


WS225 was built by klh
-===================================================================================-
The WS225 build directory includes:
genomes DIR              -  contains a sub dir for each WormBase species with sequence, gff, and agp data
        genomes/b_malayi:        - genome_feature_tables/	sequences/
        genomes/c_brenneri:      - genome_feature_tables/	sequences/
        genomes/c_briggsae:      - genome_feature_tables/	sequences/
        genomes/c_elegans:       - annotation/  genome_feature_tables/  sequences/
        genomes/c_japonica:      - genome_feature_tables/	sequences/
        genomes/c_remanei:       - genome_feature_tables/	sequences/
        genomes/h_bacteriophora: - genome_feature_tables/	sequences/
        genomes/h_contortus:     - genome_feature_tables/	sequences/
        genomes/m_hapla:         - genome_feature_tables/	sequences/
        genomes/m_incognita:     - sequences/
        genomes/p_pacificus:     - genome_feature_tables/	sequences/
          *annotation/                    - contains additional annotations
      i) confirmed_genes.WS225.gz  - DNA sequences of all genes confirmed by EST &/or cDNA
     ii) cDNA2orf.WS225.gz         - Latest set of ORF connections to each cDNA (EST, OST, mRNA)
    iii) geneIDs.WS225.gz          - list of all current gene identifiers with CGC & molecular names (when known)
     iv) PCR_product2gene.WS225.gz - Mappings between PCR products and overlapping Genes
      v) oligo_mapping.gz           - V 
          *genome_feature_tables/         - contains the main .gff files and supplementary .gff data
          *sequences/                     - contains dna/      protein/  rna/  sub dirs
            sequences/protein           - WormBase protein set for species + history etc.
     vi) wormpep225.tar.gz         - full Wormpep distribution corresponding to WS225
    vii) wormrna225.tar.gz         - latest WormRNA release containing non-coding RNA's in the genome
   viii) best_blastp_hits_species.WS225.gz  - for each C. elegans WormPep protein, lists Best blastp match to
                        human, fly, yeast, C. briggsae, and SwissProt & TrEMBL proteins.
            sequences/dna               - WormBase dna data genomic sequence (raw, soft_masked masked), agp
     ix) intergenic_sequences.dna.gz
            sequences/rna               - WormBase rna gene data.
acedb DIR                -  Everything needed to generate a local copy of the The Primary database
      x) database.WS225.*.tar.gz   - compressed acedb database for new release
     xi) models.wrm.WS225          - the latest database schema (also in above database files)
    xii) WS225-WS224.dbcomp   - log file reporting difference from last release
          *Non_C_elegans_BLASTX/          - This directory contains the blastx data for non-elegans species
                                                    (reduces the size of the main database)
COMPARATIVE_ANALYSIS DIR - compara.tar.bz2 wormpep217_clw.sql.bz2
ONTOLOGY DIR             - gene_associations, obo files for (phenotype GO anatomy) and associated association files


Release notes on the web:
-------------------------
http://www.wormbase.org/wiki/index.php/Release_Schedule




C. elegans Synchronisation with GenBank / EMBL:
------------------------------------

No synchronisation issues


C. elegans Chromosomal Changes:
--------------------
There are no changes to the chromosome sequences in this release.


C. elegans Gene data set (Live C. elegans genes 47390)
------------------------------------------
Molecular_info              45720 (96.5%)
Concise_description         5741 (12.1%)
Reference                   14138 (29.8%)
WormBase_approved_Gene_name 26143 (55.2%)
RNAi_result                 24630 (52%)
Microarray_results          22114 (46.7%)
SAGE_transcript             19135 (40.4%)


C. elegans 

Wormpep data set:
----------------------------

There are 20431 CDS in autoace, 25030 when counting 4599 alternate splice forms.

The 25030 sequences contain 10,991,449 base pairs in total.

Modified entries      40
Deleted entries       17
New entries           37
Reappeared entries    2

Net change  +22
The differnce between the total CDS's of this (25030) and the last build (25010) does not equal the net change 22
Please investigate! ! 


C. elegans Genome sequence composition:
----------------------------

       	WS225       	WS224      	change
----------------------------------------------
a    	32367418	32367418	  +0
c    	17780787	17780787	  +0
g    	17756985	17756985	  +0
t    	32367086	32367086	  +0
n    	0       	0       	  +0
-    	0       	0       	  +0

Total	100272276	100272276	  +0


Pristionchus pacificus Genome sequence composition:
----------------------------
 172773083 total
 a 43813958
 c 32811034
 g 32828589
 t 43810996
 - 0
 n 19508506


Caenorhabditis remanei Genome sequence composition:
----------------------------
 145500347 total
 a 42927857
 c 26293828
 g 26276020
 t 42923178
 - 0
 n 7079464


Caenorhabditis japonica Genome sequence composition:
----------------------------
 163282347 total
 a 39053092
 c 25603225
 g 25576971
 t 39126103
 - 0
 n 33922956


Caenorhabditis briggsae Genome sequence composition:
----------------------------
 108419768 total
 a 32984239
 c 19684682
 g 19693545
 t 33054090
 - 0
 n 3003212


Caenorhabditis brenneri Genome sequence composition:
----------------------------
 190487923 total
 a 52239259
 c 32853644
 g 32897666
 t 52181360
 - 0
 n 20315994




Tier II Gene counts
---------------------------------------------
pristionchus Gene count 24216 (Coding 24216)
remanei Gene count 32431 (Coding 31471)
japonica Gene count 27177 (Coding 25870)
briggsae Gene count 23051 (Coding 21963)
brenneri Gene count 32259 (Coding 30670)
---------------------------------------------




-------------------------------------------------
Pristionchus pacificus Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed               229 (0.9%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    4982 (20.6%)	Some, but not all exon bases are covered by transcript evidence
Predicted             19006 (78.5%)	No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Pristionchus pacificus entries with WormBase-approved Gene name   3085




-------------------------------------------------
Caenorhabditis remanei Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed               956 (3.0%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    5662 (18.0%)	Some, but not all exon bases are covered by transcript evidence
Predicted             24858 (79.0%)	No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis remanei entries with WormBase-approved Gene name   5517




-------------------------------------------------
Caenorhabditis japonica Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed              1182 (4.6%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    4974 (19.2%)	Some, but not all exon bases are covered by transcript evidence
Predicted             19714 (76.2%)	No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis japonica entries with WormBase-approved Gene name   4848




-------------------------------------------------
Caenorhabditis briggsae Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed                53 (0.2%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed     854 (3.9%)	Some, but not all exon bases are covered by transcript evidence
Predicted             21080 (95.9%)	No transcriptional evidence at all



Status of entries: Protein Accessions
-------------------------------------
UniProtKB accessions  21683 (98.6%)

Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis briggsae entries with WormBase-approved Gene name   5563




-------------------------------------------------
Caenorhabditis brenneri Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed              1512 (4.9%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    5635 (18.4%)	Some, but not all exon bases are covered by transcript evidence
Predicted             23526 (76.7%)	No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis brenneri entries with WormBase-approved Gene name   3136




-------------------------------------------------
Caenorhabditis elegans Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed             11526 (46.0%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed   11320 (45.2%)	Some, but not all exon bases are covered by transcript evidence
Predicted              2184 (8.7%)	No transcriptional evidence at all



Status of entries: Protein Accessions
-------------------------------------
UniProtKB accessions  24691 (98.6%)

Status of entries: Protein_ID's in EMBL
---------------------------------------
Protein_id            24846 (99.3%)

Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis elegans entries with WormBase-approved Gene name  24521


C. elegans Operons Stats
---------------------------------------------
Description: These exist as closely spaced gene clusters similar to bacterial operons
---------------------------------------------
| Live Operons        1287                |
| Genes in Operons    3336                |
---------------------------------------------


GO Annotation Stats WS225
--------------------------------------

GO_codes - used for assigning evidence
--------------------------------------
IC  Inferred by Curator
IDA Inferred from Direct Assay
IEA Inferred from Electronic Annotation
IEP Inferred from Expression Pattern
IGI Inferred from Genetic Interaction
IMP Inferred from Mutant Phenotype
IPI Inferred from Physical Interaction
ISS Inferred from Sequence (or Structural) Similarity
NAS Non-traceable Author Statement
ND  No Biological Data available
RCA Inferred from Reviewed Computational Analysis
TAS Traceable Author Statement
------------------------------------------------

Total number of Gene::GO connections:  251965

Genes Stats:
----------------
Genes with GO_term connections         86464  
           IEA GO_code present         80577  
       non-IEA GO_code present         5883  

Source of the mapping data             
Source: *RNAi (GFF mapping overlaps)   24514  
        *citace                        2288  
        *Inherited (motif & phenotype) 15098  

GO_terms Stats:
---------------
Total No. GO_terms                     30490  
GO_terms connected to Genes            3368  
GO annotations connected with IEA      1859  
GO annotations connected with non-IEA  1502  
   Breakdown  IC - 3   IDA - 392   ISS - 137 
             IEP - 9   IGI - 124   IMP - 741 
             IPI - 73  NAS - 1     ND  - 1  
             RCA - 0   TAS - 20   


-===================================================================================-

Useful Stats:
---------

Genes with Sequence and CGC name
WS225 46670 (24521 elegans / 5563 briggsae / 5517 remanei / 4848 japonica / 3136 brenneri / 3085 pristionchus)


-===================================================================================-



New Data:
---------

New genome: Trichinella spiralis

The newly released Trichinella spiralis genome and associated gene set (Mitreva et.al, Nature
Genetics 43, 228-235(2011)) has been imported into WormBase from the public nucleotide archive 
(ENA accession ABIR00000000). Orthology to the genes of C.elegans, C.briggsae, C.remanei, 
C.brenneri, C.japonica and P.pacificus has been derived using Compara analysis. T.spiralis Genes 
have been assigned provisional CGC names based on their C.elegans orthologs. GERP sequence 
conservation and PECAN whole genome alignments have also been added to the available comparative 
data.


Genome sequence updates:
-----------------------


New Fixes:
----------


Known Problems:
---------------


Other Changes:
--------------

Proposed Changes / Forthcoming Data:
-------------------------------------


Model Changes:
------------------------------------

WS225 sees the introduction of the WBProcess model. 


For more info mail hinxton@wormbase.org
-===================================================================================-


Quick installation guide for UNIX/Linux systems
-----------------------------------------------

1. Create a new directory to contain your copy of WormBase,
	e.g. /users/yourname/wormbase

2. Unpack and untar all of the database.*.tar.gz files into
	this directory. You will need approximately 2-3 Gb of disk space.

3. Obtain and install a suitable acedb binary for your system
	(available from www.acedb.org).

4. Use the acedb 'xace' program to open your database, e.g.
	type 'xace /users/yourname/wormbase' at the command prompt.

5. See the acedb website for more information about acedb and
	using xace.

____________  END _____________

Patches

BLAT

Acedb:

These patches address a bug that caused a small number of BLAT alignments to be missing from the initial GFF dumps, consequently gave rise to slight under-calling of the number of confirmed CDSs.

Corresponding patches Wormpep and GFF files were patched on the FTP site at 11th April 2011 15.30GMT

After applying the patches, the breakdown is as follows:

-------------------------------------------------
Caenorhabditis elegans Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed             11820 (47.2%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed   11320 (44.1%)	Some, but not all exon bases are covered by transcript evidence
Predicted              2184 (8.7%)	No transcriptional evidence at all

MicroArray Results

Acedb patch: ftp://ftp.sanger.ac.uk/pub2/wormbase/WS225/acedb/patches/microarray_mappings.ace.gz

Microarray results whose probes targeted only UTR were not connected to the respective gene, but only to the transcripts. This patch propagates the connections into the genes.