WS234

From WormBaseWiki
Jump to navigationJump to search

Release Notes

New release of WormBase WS234

WS234 was built by Paul Davis
-===================================================================================-
The WS234 build directory includes:
species/ DIR              -  contains a sub dir for each WormBase species (G_SPECIES) with the following files:
     - G_SPECIES.WS234.genomic.fa.gz                - Unmasked genomic DNA
     - G_SPECIES.WS234.genomic_masked.fa.gz         - Hard-masked (repeats replaced with Ns) genomic DNA
     - G_SPECIES.WS234.genomic_softmasked.fa.gz     - Soft-masked (repeats lower-cased) genomic DNA
     - G_SPECIES.WS234.protein.fa.gz                - Current live protein set
     - G_SPECIES.WS234.cds_transcripts.fa.gz        - Spliced cDNA sequence for the CDS portion of protein-coding transcripts
     - G_SPECIES.WS234.ncrna_transcripts.fa.gz      - Spliced cDNA sequence for non-coding RNA transcripts
     - G_SPECIES.WS234.intergenic_sequences.fa.gz   - DNA sequence between pairs of adjacent genes
     - G_SPECIES.WS234.annotations.gff[2|3].gz      - Sequence features in either GFF2 or GFF3 format
     - G_SPECIES.WS234.ests.fa.gz                   - ESTs and mRNA sequences extracted from the public databases
     - G_SPECIES.WS234.best_blastp_hits.txt.gz      - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteins
     - G_SPECIES.WS234.*pep_package.tar.gz          - latest version of the [worm|brig|bren|rema|jap|ppa]pep package (if updated since last release)
     - annotation/                    - contains additional annotations:
        - G_SPECIES.WS234.confirmed_genes.txt.gz              - DNA sequences of all genes confirmed by EST &/or cDNA
        - G_SPECIES.WS234.cDNA2orf.txt.gz                     - Latest set of ORF connections to each cDNA (EST, OST, mRNA)
        - G_SPECIES.WS234.geneIDs.txtgz                       - list of all current gene identifiers with CGC & molecular names (when known)
        - G_SPECIES.WS234.PCR_product2gene.txt.gz             - Mappings between PCR products and overlapping Genes
        - G_SPECIES.WS234.*oligo_mapping.txt.gz               - Oligo array mapping files
        - G_SPECIES.WS234.knockout_consortium_alleles.xml.gz  - Table of Knockout Consortium alleles
        - G_SPECIES.WS234.SRA_gene_expression.tar.gz          - Tables of gene expression values computed from SRA RNASeq data
acedb DIR                -  Everything needed to generate a local copy of the The Primary database
     - database.WS234.*.tar.gz   - compressed acedb database for new release
     - models.wrm.WS234          - the latest database schema (also in above database files)
     - WS234-WS233.dbcomp   - log file reporting difference from last release
     - *Non_C_elegans_BLASTX/          - This directory contains the blastx data for non-elegans species
                                                    (reduces the size of the main database)
COMPARATIVE_ANALYSIS DIR - comparative analysis files
     - compara.WS234.tar.bz2     - gene-tree and alignment GFF files
     - wormpep_clw.WS234.sql.bz2 - ClustalW protein multiple alignments
ONTOLOGY DIR             - gene_associations, obo files for (phenotype GO anatomy) and associated association files


Release notes on the web:
-------------------------
http://www.wormbase.org/wiki/index.php/Release_Schedule



C. elegans Chromosomal Changes:
--------------------
There are no changes to the chromosome sequences in this release.


C. elegans Gene data set (Live C. elegans genes 48240)
------------------------------------------
Molecular_info              46605 (96.6%)
Concise_description         5984  (12.4%)
Human_disease_relevance     92    (0.2%)
Reference                   17243 (35.7%)
WormBase_approved_Gene_name 27151 (56.3%)
RNAi_result                 25027 (51.9%)
Microarray_results          24095 (49.9%)
SAGE_transcript             19220 (39.8%)


C. elegans 

Wormpep data set:
----------------------------

There are 26041 CDSs, from 20537 protein-coding genes

The 26041 sequences contain 34594587 base pairs in total.

Modified entries      30
Deleted entries       66
New entries           96
Reappeared entries    3

Net change  +33
The difference (30) between the total CDS's of this (26041) and the last build (26011) does not equal the net change 33
Please investigate! ! 


C. elegans Genome sequence composition:
----------------------------

       	WS234       	WS233      	change
----------------------------------------------
a    	32367418	32367418	  +0
c    	17780787	17780787	  +0
g    	17756985	17756985	  +0
t    	32367086	32367086	  +0
n    	0       	0       	  +0
-    	0       	0       	  +0

Total	100272276	100272276	  +0


Pristionchus pacificus Genome sequence composition:
----------------------------
 172773083 total
 a 43813958
 c 32811034
 g 32828589
 t 43810996
 - 0
 n 19508506


Caenorhabditis remanei Genome sequence composition:
----------------------------
 145500347 total
 a 42927857
 c 26293828
 g 26276020
 t 42923178
 - 0
 n 7079464


Caenorhabditis japonica Genome sequence composition:
----------------------------
 166565019 total
 a 46865690
 c 30244493
 g 30234317
 t 46807519
 - 0
 n 12413000


Caenorhabditis briggsae Genome sequence composition:
----------------------------
 108419768 total
 a 32984239
 c 19684682
 g 19693545
 t 33054090
 - 62600
 n 2940612


Caenorhabditis brenneri Genome sequence composition:
----------------------------
 190421492 total
 a 52222485
 c 32837458
 g 32882838
 t 52164077
 - 0
 n 20314634




Tier II Gene counts
---------------------------------------------
pristionchus Gene count 24216 (Coding 24216)
remanei Gene count 32414 (Coding 31445)
japonica Gene count 29964 (Coding 29964)
briggsae Gene count 23026 (Coding 21936)
brenneri Gene count 32360 (Coding 30770)
---------------------------------------------




-------------------------------------------------
Pristionchus pacificus Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed               229 (0.9%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    4982 (20.6%)	Some, but not all exon bases are covered by transcript evidence
Predicted             19006 (78.5%)	No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Pristionchus pacificus entries with WormBase-approved Gene name   3272




-------------------------------------------------
Caenorhabditis remanei Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed               961 (3.1%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    5658 (18.0%)	Some, but not all exon bases are covered by transcript evidence
Predicted             24831 (79.0%)	No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis remanei entries with WormBase-approved Gene name   6045




-------------------------------------------------
Caenorhabditis japonica Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed              1637 (4.5%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    5200 (14.4%)	Some, but not all exon bases are covered by transcript evidence
Predicted             29197 (81.0%)	No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis japonica entries with WormBase-approved Gene name   5002




-------------------------------------------------
Caenorhabditis briggsae Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed                54 (0.2%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed     859 (3.9%)	Some, but not all exon bases are covered by transcript evidence
Predicted             21048 (95.8%)	No transcriptional evidence at all



Status of entries: Protein Accessions
-------------------------------------
UniProtKB accessions  21662 (98.6%)

Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis briggsae entries with WormBase-approved Gene name   6135




-------------------------------------------------
Caenorhabditis brenneri Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed              1565 (5.1%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    5669 (18.4%)	Some, but not all exon bases are covered by transcript evidence
Predicted             23545 (76.5%)	No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis brenneri entries with WormBase-approved Gene name   3535




-------------------------------------------------
Caenorhabditis elegans Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed             12476 (47.9%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed   11494 (44.1%)	Some, but not all exon bases are covered by transcript evidence
Predicted              2071 (8.0%)	No transcriptional evidence at all



Status of entries: Protein Accessions
-------------------------------------
UniProtKB accessions  25921 (99.5%)

Status of entries: Protein_ID's in EMBL
---------------------------------------
Protein_id            26029 (100.0%)

Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis elegans entries with WormBase-approved Gene name  25561


C. elegans Operons Stats
---------------------------------------------
Description: These exist as closely spaced gene clusters similar to bacterial operons
---------------------------------------------
| Live Operons        1390                |
| Genes in Operons    3632                |
---------------------------------------------


GO Annotation Stats WS234
--------------------------------------

GO_codes - used for assigning evidence
--------------------------------------
IC  Inferred by Curator
IDA Inferred from Direct Assay
IEA Inferred from Electronic Annotation
IEP Inferred from Expression Pattern
IGI Inferred from Genetic Interaction
IMP Inferred from Mutant Phenotype
IPI Inferred from Physical Interaction
ISS Inferred from Sequence (or Structural) Similarity
NAS Non-traceable Author Statement
ND  No Biological Data available
RCA Inferred from Reviewed Computational Analysis
TAS Traceable Author Statement
------------------------------------------------

Total number of Gene::GO connections:  264984

Genes Stats:
----------------
Genes with GO_term connections         91253  
           IEA GO_code present         84952  
       non-IEA GO_code present         6297  

Source of the mapping data             
Source: *RNAi (GFF mapping overlaps)   26727  
        *citace                        2790  
        *Inherited (motif & phenotype) 15238  

GO_terms Stats:
---------------
Total No. GO_terms                     30642  
GO_terms connected to Genes            3642  
GO annotations connected with IEA      1855  
GO annotations connected with non-IEA  1762  
   Breakdown  IC - 6   IDA - 523   ISS - 151 
             IEP - 11   IGI - 145   IMP - 822 
             IPI - 83  NAS - 1     ND  - 1  
             RCA - 0   TAS - 17   


-===================================================================================-

Useful Stats:
---------

Genes with Sequence and WormBase-approved Gene names
WS234 49550 (25561 elegans / 6135 briggsae / 6045 remanei / 5002 japonica / 3535 brenneri / 3272 pristionchus)


-===================================================================================-



New Data:
---------
* OMIM Human Disease data derived from Human protein orthology.

Data has automatically been promoted to the level of the gene from the protein orthologs that are calculated during this build.

This will allow the display and easy access of human diseas associations on a large scale while manula curation of this data ramps up.


Genome sequence updates:
-----------------------


New Fixes:
----------


Known Problems:
---------------


Other Changes:
--------------

Proposed Changes / Forthcoming Data:
-------------------------------------

We plan to correct 77 indel errors in the C. elegans genome that
affect the structure of a coding genes for the WS235 release of WormBase.

Full details can be pulled from the previous release letter but this is new data from the following projects:


- Weber KP et al. (2010) PLoS One "Whole genome sequencing highlights
  genetic changes associated with laboratory ...."

PMID 21085631

- Doitsidou M et al. (2010) PLoS One "C. elegans mutant identification
  with a one-step whole-genome-sequencing and ...."

PMID 21079745

- McGrath PT et al. (2011) Nature "Parallel evolution of domesticated
  Caenorhabditis species targets pheromone ...."

PMID 21849976


Model Changes:
------------------------------------
WS234 models

* This cycle we see 3 simple model changes.

?Variation ?Laboratory

Remove the XREF connection between these 2 classes as this has performance issues.
Ability still in place to make the 2 way connection but not for very large projects.

?Transgene

Add a Public_name field.

?Picture

Add a Unique species field.

More infromation and a human readable diff can be found here:
http://wiki.wormbase.org/index.php/WS234_Models.wrm


For more info mail help@wormbase.org
-===================================================================================-


Quick installation guide for UNIX/Linux systems
-----------------------------------------------

1. Create a new directory to contain your copy of WormBase,
	e.g. /users/yourname/wormbase

2. Unpack and untar all of the database.*.tar.gz files into
	this directory. You will need approximately 2-3 Gb of disk space.

3. Obtain and install a suitable acedb binary for your system
	(available from www.acedb.org).

4. Use the acedb 'xace' program to open your database, e.g.
	type 'xace /users/yourname/wormbase' at the command prompt.

5. See the acedb website for more information about acedb and
	using xace.

____________  END _____________