WS221

From WormBaseWiki
Jump to: navigation, search
New release of WormBase WS221, Wormpep221 and Wormrna221 Tue Nov 23 12:32:16 GMT 2010

WS221 was built by klh
-===================================================================================-
The WS221 build directory includes:
genomes DIR              -  contains a sub dir for each WormBase species with sequence, gff, and agp data
        genomes/b_malayi:        - genome_feature_tables/	sequences/
        genomes/c_brenneri:      - genome_feature_tables/	sequences/
        genomes/c_briggsae:      - genome_feature_tables/	sequences/
        genomes/c_elegans:       - annotation/  genome_feature_tables/  sequences/
        genomes/c_japonica:      - genome_feature_tables/	sequences/
        genomes/c_remanei:       - genome_feature_tables/	sequences/
        genomes/h_bacteriophora: - genome_feature_tables/	sequences/
        genomes/h_contortus:     - genome_feature_tables/	sequences/
        genomes/m_hapla:         - genome_feature_tables/	sequences/
        genomes/m_incognita:     - sequences/
        genomes/p_pacificus:     - genome_feature_tables/	sequences/
          *annotation/                    - contains additional annotations
      i) confirmed_genes.WS221.gz  - DNA sequences of all genes confirmed by EST &/or cDNA
     ii) cDNA2orf.WS221.gz         - Latest set of ORF connections to each cDNA (EST, OST, mRNA)
    iii) geneIDs.WS221.gz          - list of all current gene identifiers with CGC & molecular names (when known)
     iv) PCR_product2gene.WS221.gz - Mappings between PCR products and overlapping Genes
      v) oligo_mapping.gz           - V 
          *genome_feature_tables/         - contains the main .gff files and supplementary .gff data
          *sequences/                     - contains dna/      protein/  rna/  sub dirs
            sequences/protein           - WormBase protein set for species + history etc.
     vi) wormpep221.tar.gz         - full Wormpep distribution corresponding to WS221
    vii) wormrna221.tar.gz         - latest WormRNA release containing non-coding RNA's in the genome
   viii) best_blastp_hits_species.WS221.gz  - for each C. elegans WormPep protein, lists Best blastp match to
                        human, fly, yeast, C. briggsae, and SwissProt & TrEMBL proteins.
            sequences/dna               - WormBase dna data genomic sequence (raw, soft_masked masked), agp
     ix) intergenic_sequences.dna.gz
            sequences/rna               - WormBase rna gene data.
acedb DIR                -  Everything needed to generate a local copy of the The Primary database
      x) database.WS221.*.tar.gz   - compressed acedb database for new release
     xi) models.wrm.WS221          - the latest database schema (also in above database files)
    xii) WS221-WS220.dbcomp   - log file reporting difference from last release
          *Non_C_elegans_BLASTX/          - This directory contains the blastx data for non-elegans species
                                                    (reduces the size of the main database)
COMPARATIVE_ANALYSIS DIR - compara.tar.bz2 wormpep217_clw.sql.bz2
ONTOLOGY DIR             - gene_associations, obo files for (phenotype GO anatomy) and associated association files


Release notes on the web:
-------------------------
http://www.wormbase.org/wiki/index.php/Release_Schedule




C. elegans Synchronisation with GenBank / EMBL:
------------------------------------

No synchronisation issues


C. elegans Chromosomal Changes:
--------------------
There are no changes to the chromosome sequences in this release.


C. elegans Gene data set (Live C.elegans genes 47378)
------------------------------------------
Molecular_info              45617 (96.3%)
Concise_description         5719 (12.1%)
Reference                   14129 (29.8%)
WormBase_approved_Gene_name 26090 (55.1%)
RNAi_result                 24606 (51.9%)
Microarray_results          22095 (46.6%)
SAGE_transcript             19163 (40.4%)


C. elegans 

Wormpep data set:
----------------------------

This release of Wormpep derived from 24890 CDSs (from 20416
protein-coding genes).

The 24890 sequences contain 10,941,455 base pairs in total.

Modified entries      48
Deleted entries       24
New entries           72
Reappeared entries    0

Net change  +48

C. elegans Genome sequence composition:
----------------------------

       	WS221       	WS220      	change
----------------------------------------------
a    	32367418	32367418	  +0
c    	17780787	17780787	  +0
g    	17756985	17756985	  +0
t    	32367086	32367086	  +0
n    	0       	0       	  +0
-    	0       	0       	  +0

Total	100272276	100272276	  +0


Pristionchus pacificus Genome sequence composition:
----------------------------
 172773083 total
 a 43813958
 c 32811034
 g 32828589
 t 43810996
 - 0
 n 19508506


Caenorhabditis remanei Genome sequence composition:
----------------------------
 145500347 total
 a 42927857
 c 26293828
 g 26276020
 t 42923178
 - 0
 n 7079464


Caenorhabditis japonica Genome sequence composition:
----------------------------
 163282347 total
 a 39053092
 c 25603225
 g 25576971
 t 39126103
 - 0
 n 33922956


Caenorhabditis briggsae Genome sequence composition:
----------------------------
 108478630 total
 a 33004189
 c 19675861
 g 19707411
 t 33049803
 - 0
 n 3041366


Caenorhabditis brenneri Genome sequence composition:
----------------------------
 190426650 total
 a 52223359
 c 32838518
 g 32883836
 t 52164943
 - 0
 n 20315994




Tier II Gene counts
---------------------------------------------
pristionchus Gene count 24216 (CDS 24217)
remanei Gene count 32431 (CDS 31476)
heterorhabditis Gene count 0 (CDS 0)
japonica Gene count 27177 (CDS 25870)
briggsae Gene count 23038 (CDS 21991)
brenneri Gene count 32265 (CDS 30707)
---------------------------------------------




-------------------------------------------------
Pristionchus pacificus Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed               229 (0.9%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    4982 (20.6%)	Some, but not all exon bases are covered by transcript evidence
Predicted             19006 (78.5%)	No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Pristionchus pacificus entries with WormBase-approved Gene name   3264




-------------------------------------------------
Caenorhabditis remanei Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed               956 (3.0%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    5662 (18.0%)	Some, but not all exon bases are covered by transcript evidence
Predicted             24858 (79.0%)	No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis remanei entries with WormBase-approved Gene name   5472




-------------------------------------------------
Caenorhabditis japonica Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed              1182 (4.6%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    4974 (19.2%)	Some, but not all exon bases are covered by transcript evidence
Predicted             19714 (76.2%)	No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis japonica entries with WormBase-approved Gene name   4804




-------------------------------------------------
Caenorhabditis briggsae Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed                52 (0.2%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed     856 (3.9%)	Some, but not all exon bases are covered by transcript evidence
Predicted             21083 (95.9%)	No transcriptional evidence at all



Status of entries: Protein Accessions
-------------------------------------
UniProtKB accessions  21703 (98.7%)

Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis briggsae entries with WormBase-approved Gene name   5512




-------------------------------------------------
Caenorhabditis brenneri Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed              1517 (4.9%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    5638 (18.4%)	Some, but not all exon bases are covered by transcript evidence
Predicted             23550 (76.7%)	No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis brenneri entries with WormBase-approved Gene name   3100




-------------------------------------------------
Caenorhabditis elegans Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed             11703 (47.0%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed   11012 (44.2%)	Some, but not all exon bases are covered by transcript evidence
Predicted              2175 (8.7%)	No transcriptional evidence at all



Status of entries: Protein Accessions
-------------------------------------
UniProtKB accessions  24696 (99.2%)

Status of entries: Protein_ID's in EMBL
---------------------------------------
Protein_id            24696 (99.2%)

Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis elegans entries with WormBase-approved Gene name  24452


C. elegans Operons Stats
---------------------------------------------
Description: These exist as closely spaced gene clusters similar to bacterial operons
---------------------------------------------
| Live Operons        1288                |
| Genes in Operons    3342                |
---------------------------------------------


GO Annotation Stats WS221
--------------------------------------

GO_codes - used for assigning evidence
--------------------------------------
IC  Inferred by Curator
IDA Inferred from Direct Assay
IEA Inferred from Electronic Annotation
IEP Inferred from Expression Pattern
IGI Inferred from Genetic Interaction
IMP Inferred from Mutant Phenotype
IPI Inferred from Physical Interaction
ISS Inferred from Sequence (or Structural) Similarity
NAS Non-traceable Author Statement
ND  No Biological Data available
RCA Inferred from Reviewed Computational Analysis
TAS Traceable Author Statement
------------------------------------------------

Total number of Gene::GO connections:  262947

Genes Stats:
----------------
Genes with GO_term connections         87115  
           IEA GO_code present         81115  
       non-IEA GO_code present         5996  

Source of the mapping data             
Source: *RNAi (GFF mapping overlaps)   25788  
        *citace                        2201  
        *Inherited (motif & phenotype) 14545  

GO_terms Stats:
---------------
Total No. GO_terms                     30474  
GO_terms connected to Genes            3258  
GO annotations connected with IEA      1826  
GO annotations connected with non-IEA  1427  
   Breakdown  IC - 2   IDA - 353   ISS - 130 
             IEP - 9   IGI - 116   IMP - 726 
             IPI - 68  NAS - 1     ND  - 1  
             RCA - 0   TAS - 20   


-===================================================================================-

Useful Stats:
---------

Genes with Sequence and CGC name
WS221 46604 (24452 elegans / 5512 briggsae / 5472 remanei / 4804 japonica / 3100 brenneri / 3264 pristionchus)


-===================================================================================-



New Data:
---------

Caenorhabditis elegans:

The Gerstein Lab (as part of the modENCODE project) has determined the
probable source (parent) gene for over 1,000 pseudogenes. These have
been added to this Build as Paralog links between the parent coding
gene and the pseudogenes. 


Pristionchus pacificus:

This build includes a new genome assembly and set of gene predictions
for P. pacificus. These were provided by the Sommer group at the MPI
for Developmental Biology in Tuebingen, Germany (see Borchert et al,
PMID:20237107).

Gene identifiers were preserved where appropriate by identfying
corresponding predictions in the old and new gene sets. In addition,
WormBase-curated gene structures were projected across to the new
assembly.  


Genome sequence updates:
-----------------------

Pristionchus pacificus:

See New Data, above.


New Fixes:
----------


Known Problems:
---------------


Other Changes:
--------------


Proposed Changes / Forthcoming Data:
-------------------------------------

WS221 includes ~21000 unmapped 3' UTR features from the modENCODE
project. Mappings to the genome will be included in WS222.


WS221 Model Changes:
--------------------

New Class


 //////////////////////////////////////////////////////////////////////////////////////////
 //
 // Transcription_factor  class This describes a type of transcription factor (and Pol II)
 // Gary W. 14/10/2010
 //
 //////////////////////////////////////////////////////////////////////////////////////////

 ?Transcription_factor  Name UNIQUE Text
                        Position_matrix  ?Position_Matrix XREF Transcription_factor #Evidence
                        Product_of ?Gene XREF Transcription_factor
                        Remark Text #Evidence
                        Binding_site ?Feature XREF Transcription_factor



Gene Class Modifications

                       Transcription_factor ?Transcription_factor XREF Product_of

Feature Class Modifications

         Score Float           // this would be a log score as indicated by the analysis used in gff dump
         Transcription_factor UNIQUE ?Transcription_factor XREF Binding_site

Position_Matrix Class Modifications

                  Transcription_factor  UNIQUE ?Transcription_factor XREF Position_matrix #Evidence //gw3


Strain Class Modifications

<       Made_by Text
---
>       Made_by ?Person //updated from Text

>       Other_name ?Text #Evidence
        Contact ?Person
          Wild_isolate // to identify wild isolates - since genotype is free text
          Isolation GPS UNIQUE Float UNIQUE Float // Latitude +/-DD.DDDDD Longitude +/-DD.DDDDD
                  Place UNIQUE ?Text
                  Landscape UNIQUE Urban_garden
                                    Wild_forest
                                    Wild_grassland
                                    Agricultural_land
                                    Oasis
                                    Rural_garden
                  Substrate UNIQUE ?Text // e.g. snail, soil, apple, rotting fruit
                  Associated_organisms ?Species
                  Life_stage ?Life_stage
                  Log_size_of_population UNIQUE Float
                  Sampled_by Text // the person who found the worms
                  Isolated_by ?Person // the person who isolated the worms from the sample



Proposed Model Changes for WS221:
---------------------------------


?Variation tag removal

Due to a change in how we process CGH alleles there is no longer any need to calculate the 5' and 3' gap
(the gap between the CGH_deleted_probes and Flanking_sequences).

We therefore propose to remove FivePrimeGap & ThreePrimeGap from the ?Variation model.

< FivePrimeGap
< ThreePrimeGap


For more info mail help@wormbase.org
-===================================================================================-


Quick installation guide for UNIX/Linux systems
-----------------------------------------------

1. Create a new directory to contain your copy of WormBase,
	e.g. /users/yourname/wormbase

2. Unpack and untar all of the database.*.tar.gz files into
	this directory. You will need approximately 2-3 Gb of disk space.

3. Obtain and install a suitable acedb binary for your system
	(available from www.acedb.org).

4. Use the acedb 'xace' program to open your database, e.g.
	type 'xace /users/yourname/wormbase' at the command prompt.

5. See the acedb website for more information about acedb and
	using xace.

____________  END _____________