WS226

From WormBaseWiki
Jump to: navigation, search
New release of WormBase WS226

WS226 was built by gw3
-===================================================================================-
The WS226 build directory includes:
species/ DIR              -  contains a sub dir for each WormBase species (G_SPECIES) with the following files:
     - G_SPECIES.WS226.genomic.fa.gz                - Unmasked genomic DNA
     - G_SPECIES.WS226.genomic_masked.fa.gz         - Hard-masked (repeats replaced with Ns) genomic DNA
     - G_SPECIES.WS226.genomic_softmasked.fa.gz     - Soft-masked (repeats lower-cased) genomic DNA
     - G_SPECIES.WS226.protein.fa.gz                - Current live protein set
     - G_SPECIES.WS226.cds_transcripts.fa.gz        - Spliced cDNA sequence for the CDS portion of protein-coding transcripts
     - G_SPECIES.WS226.ncrna_transcripts.fa.gz      - Spliced cDNA sequence for non-coding RNA transcripts
     - G_SPECIES.WS226.intergenic_sequences.fa.gz   - DNA sequence between pairs of adjacent genes
     - G_SPECIES.WS226.annotations.gff[2|3].gz      - Sequence features in either GFF2 or GFF3 format
     - G_SPECIES.WS226.ests.fa.gz                   - ESTs and mRNA sequences extracted from the public databases
     - G_SPECIES.WS226.best_blastp_hits.txt.gz      - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteains
     - G_SPECIES.WS226.*pep_package.tar.gz          - latest version of the [worm|brig|bren|rema|jap|ppa]pep package (if updated since last release)
     - annotation/                    - contains additional annotations:
        - G_SPECIES.WS226.confirmed_genes.txt.gz              - DNA sequences of all genes confirmed by EST &/or cDNA
        - G_SPECIES.WS226.cDNA2orf.txt.gz                     - Latest set of ORF connections to each cDNA (EST, OST, mRNA)
        - G_SPECIES.WS226.geneIDs.txtgz                       - list of all current gene identifiers with CGC & molecular names (when known)
        - G_SPECIES.WS226.PCR_product2gene.txt.gz             - Mappings between PCR products and overlapping Genes
        - G_SPECIES.WS226.*oligo_mapping.txt.gz               - Oligo array mapping files
        - G_SPECIES.WS226.knockout_consortium_alleles.xml.gz  - Table of Knockout Consortium alleles
        - G_SPECIES.WS226.SRA_gene_expression.tar.gz          - Tables of gene expression values computed from SRA RNASeq data
acedb DIR                -  Everything needed to generate a local copy of the The Primary database
     - database.WS226.*.tar.gz   - compressed acedb database for new release
     - models.wrm.WS226          - the latest database schema (also in above database files)
     - WS226-WS225.dbcomp   - log file reporting difference from last release
     - *Non_C_elegans_BLASTX/          - This directory contains the blastx data for non-elegans species
                                                    (reduces the size of the main database)
COMPARATIVE_ANALYSIS DIR - comparative analysis files
     - compara.WS226.tar.bz2     - gene-tree and alignment GFF files
     - wormpep_clw.WS226.sql.bz2 - ClustalW protein multiple alignments
ONTOLOGY DIR             - gene_associations, obo files for (phenotype GO anatomy) and associated association files


Release notes on the web:
-------------------------
http://www.wormbase.org/wiki/index.php/Release_Schedule




C. elegans Synchronisation with GenBank / EMBL:
------------------------------------

No synchronisation issues


C. elegans Chromosomal Changes:
--------------------
There are no changes to the chromosome sequences in this release.


C. elegans Gene data set (Live C. elegans genes 47395)
------------------------------------------
Molecular_info              45728 (96.5%)
Concise_description         5777 (12.2%)
Reference                   14160 (29.9%)
WormBase_approved_Gene_name 26208 (55.3%)
RNAi_result                 24628 (52%)
Microarray_results          23962 (50.6%)
SAGE_transcript             19144 (40.4%)


C. elegans 

Wormpep data set:
----------------------------

There are 20439 CDS in autoace, 25171 when counting 4732 alternate splice forms.

The 25171 sequences contain 11,044,670 base pairs in total.

Modified entries      86
Deleted entries       90
New entries           231
Reappeared entries    3

Net change  +144



C. elegans Genome sequence composition:
----------------------------

       	WS226       	WS225      	change
----------------------------------------------
a    	32367418	32367418	  +0
c    	17780787	17780787	  +0
g    	17756985	17756985	  +0
t    	32367086	32367086	  +0
n    	0       	0       	  +0
-    	0       	0       	  +0

Total	100272276	100272276	  +0


Pristionchus pacificus Genome sequence composition:
----------------------------
 172773083 total
 a 43813958
 c 32811034
 g 32828589
 t 43810996
 - 0
 n 19508506


Caenorhabditis remanei Genome sequence composition:
----------------------------
 145500347 total
 a 42927857
 c 26293828
 g 26276020
 t 42923178
 - 0
 n 7079464


Caenorhabditis japonica Genome sequence composition:
----------------------------
 163282347 total
 a 39053092
 c 25603225
 g 25576971
 t 39126103
 - 0
 n 33922956


Caenorhabditis briggsae Genome sequence composition:
----------------------------
 108419768 total
 a 32984239
 c 19684682
 g 19693545
 t 33054090
 - 0
 n 3003212


Caenorhabditis brenneri Genome sequence composition:
----------------------------
 190484472 total
 a 52238342
 c 32852873
 g 32896829
 t 52180434
 - 0
 n 20315994




Tier II Gene counts
---------------------------------------------
pristionchus Gene count 24216 (Coding 24216)
remanei Gene count 32431 (Coding 31471)
japonica Gene count 27177 (Coding 25870)
briggsae Gene count 23050 (Coding 21963)
brenneri Gene count 32259 (Coding 30669)
---------------------------------------------




-------------------------------------------------
Pristionchus pacificus Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed               229 (0.9%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    4982 (20.6%)	Some, but not all exon bases are covered by transcript evidence
Predicted             19006 (78.5%)	No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Pristionchus pacificus entries with WormBase-approved Gene name   3110




-------------------------------------------------
Caenorhabditis remanei Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed               956 (3.0%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    5662 (18.0%)	Some, but not all exon bases are covered by transcript evidence
Predicted             24858 (79.0%)	No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis remanei entries with WormBase-approved Gene name   5574




-------------------------------------------------
Caenorhabditis japonica Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed              1182 (4.6%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    4974 (19.2%)	Some, but not all exon bases are covered by transcript evidence
Predicted             19714 (76.2%)	No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis japonica entries with WormBase-approved Gene name   4897




-------------------------------------------------
Caenorhabditis briggsae Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed                53 (0.2%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed     854 (3.9%)	Some, but not all exon bases are covered by transcript evidence
Predicted             21080 (95.9%)	No transcriptional evidence at all



Status of entries: Protein Accessions
-------------------------------------
UniProtKB accessions  21683 (98.6%)

Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis briggsae entries with WormBase-approved Gene name   5619




-------------------------------------------------
Caenorhabditis brenneri Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed              1510 (4.9%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    5638 (18.4%)	Some, but not all exon bases are covered by transcript evidence
Predicted             23524 (76.7%)	No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis brenneri entries with WormBase-approved Gene name   3175




-------------------------------------------------
Caenorhabditis elegans Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed             11994 (47.7%)	Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed   11171 (44.4%)	Some, but not all exon bases are covered by transcript evidence
Predicted              2006 (8.0%)	No transcriptional evidence at all



Status of entries: Protein Accessions
-------------------------------------
UniProtKB accessions  24665 (98.0%)

Status of entries: Protein_ID's in EMBL
---------------------------------------
Protein_id            25002 (99.3%)

Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis elegans entries with WormBase-approved Gene name  24588


C. elegans Operons Stats
---------------------------------------------
Description: These exist as closely spaced gene clusters similar to bacterial operons
---------------------------------------------
| Live Operons        1253                |
| Genes in Operons    3350                |
---------------------------------------------


GO Annotation Stats WS226
--------------------------------------

GO_codes - used for assigning evidence
--------------------------------------
IC  Inferred by Curator
IDA Inferred from Direct Assay
IEA Inferred from Electronic Annotation
IEP Inferred from Expression Pattern
IGI Inferred from Genetic Interaction
IMP Inferred from Mutant Phenotype
IPI Inferred from Physical Interaction
ISS Inferred from Sequence (or Structural) Similarity
NAS Non-traceable Author Statement
ND  No Biological Data available
RCA Inferred from Reviewed Computational Analysis
TAS Traceable Author Statement
------------------------------------------------

Total number of Gene::GO connections:  252614

Genes Stats:
----------------
Genes with GO_term connections         87536  
           IEA GO_code present         81607  
       non-IEA GO_code present         5925  

Source of the mapping data             
Source: *RNAi (GFF mapping overlaps)   24596  
        *citace                        2326  
        *Inherited (motif & phenotype) 15067  

GO_terms Stats:
---------------
Total No. GO_terms                     30521  
GO_terms connected to Genes            3433  
GO annotations connected with IEA      1888  
GO annotations connected with non-IEA  1536  
   Breakdown  IC - 3   IDA - 405   ISS - 137 
             IEP - 9   IGI - 127   IMP - 756 
             IPI - 75  NAS - 2     ND  - 1  
             RCA - 0   TAS - 20   


-===================================================================================-

Useful Stats:
---------

Genes with Sequence and WormBase-approved Gene names
WS226 46963 (24588 elegans / 5619 briggsae / 5574 remanei / 4897 japonica / 3175 brenneri / 3110 pristionchus)


-===================================================================================-



New Data:
---------

New gene model predictions
--------------------------

The set of Aggregate Integrated gene models submitted to the modENCODE
DCC by LaDeana Hillier (Waterston Lab) in March 2010 and released by
the modENCODE DCC on 23rd May 2011 has been added to WS226.


New SNP data
------------

WGS SNP data from the Jarriault lab (PMID 20610404).
This dataset consists of nearly 6000 SNPs from 3 elegans strains.
They have been assigned the allele prefix snx-
Nearly half of these SNPs affect predicted CDSs.


New Caenorhabditis species added to WormBase:
---------------------------------------------

Caenorhabditis species 7,9 and 11 draft genomes, sequenced by the
Genome Institute of the Washington University were added to WormBase
and will be made available through GBrowse and BLAST/BLAT. A draft
gene set based on ab initio Augustus predictions is provided, and will
be replaced in one of the next WormBase releases by a RNAseq based
gene set.
All data of the three new Caenorhabditis species should be considered
DRAFT quality, and has not been submitted to public nucleotide
repositories.

new parasitic nematode Strongyloides ratti:
-------------------------------------------

A draft genome of S.ratti has been made available to the public by the
parasite genomics group of the Wellcome Trust Sanger Institute. It was
incorporated into WormBase together with draft EST/cDNA-guided
Augustus gene predictions and will be accessible through GBrowse as
well as BLAST/BLAT. A revised gene-set is being prepared by the
parasite genomics group and replace the draft set as soon as it is
available. In addition a revised genomic assembly is in production.
The data available through WormBase should be considered of DRAFT quality.

Caenorhabditis brenneri assembly sync:
--------------------------------------

The C.brenneri assembly and gene set has been synced with the
ENA/DDBJ/GenBank and any changes will be submitted regularly in line
with C.elegans, C.briggsae and C.remanei. Two contigs not available
through the public repositories will be retired in WS227.


Genome sequence updates:
-----------------------

None

New Fixes:
----------

None

Known Problems:
---------------

None

Other Changes:
--------------

None

Proposed Changes / Forthcoming Data:
-------------------------------------

* gene updates for S.ratti, Caenorhabditis species 7,9 and 11
* Caenorhabditis species 5 genome



?Transgene model clean up
-------------------------

1. Remove Supporting_data Movie

2. Replace the following tags under Reporter_product with one tag
called Reporter ?Text
GFP
YFP
CFP
Venus
DsRed
mCherry
RFP
LacZ
Other_reporter ?Text

    * Keep Gene ?Gene as is

3. Remove all sub tags (Gamma_ray, X_ray etc.) under Integrated_by.
Reformat this tag as
Integrated_by UNIQUE ?Text


?Rearrangement
--------------

Addition of Introgression to Type hash.

This will allow me to more fully curate introgressed regions e.g.
qqR1(X,CB4856>N2)

?Person
-------

Addition of Affiliation Text // used for storing institute affiliations that aren't in their Address

#Address
--------

Fax Text
Make fax not unique


?Homology_group
---------------

remove the HOPS and RIO tags, as we don't actually have the data in the database and add some eggNOG specific types.

Homology_group
               Group_type  UNIQUE COG    COG_type #Homology_type
                                         COG_code #COG_codes
                                  eggNOG eggNOG_type #Homology_type
                                         eggNOG_code #COG_codes

Homology_type KOG
               TWOG
               FOG
               LSE
               NOG   //eggNOG - standard cluster
               euNOG //eggNOG - eukaryote cluster
               meNOG //eggNOG - metazoan cluster

?Sequence
---------
Add a Checksum tag to the Sequence class:

?Sequence Checksum MD5 Text        // Checksum of upper-cased sequence
                   CRC64 Text      // Checksum of upper-cased sequence


?Molecule
---------
Rearrangement added to the Affects_phenotype_of list of objects


?Molecule    Name    ?Text
    Affects_phenotype_of Rearrangement ?Rearrangement ?Phenotype #Ev


?Gene
-----

Human_disease_relevance "Text" #Evidence

Used for storing computationally derived OMIM data from publications to move this from it's current storage place in the Concise_description which was causing a break in the flow (readability) of the description.


?Species
--------

Removal of unused tags and the addition of other name storage tags.

?Species Other_name ?Text
     NCBITaxonomyID UNIQUE Int
     Short_name UNIQUE Text   // e.g. 'C. elegans'
     G_species UNIQUE Text    // e.g. 'c_elegans'
     Properties ?Database_properties // descriptions of sequences and acedb information



New class: Database_properties - the final model has not been decided but this was the initial proposal
-------------------------------------------------------------------------------------------------------

Describes information associated with a Species in the acedb database

?Database_properties Title Text
             Species UNIQUE ?Species
             Strain UNIQUE ?Strain
             Assembly UNIQUE ?Sequence_collection
             Sequences Chromosome_prefix UNIQUE Text // these are all copied from Species.pm
                   Pep_prefix UNIQUE Text
                   Pepdir_prefix UNIQUE Text
                   CDS_regex UNIQUE Text
                   Seq_name_regex UNIQUE Text
                   CDS_regex_noend UNIQUE Text
                   Wormpep_prefix UNIQUE Text
                   Assembly_type UNIQUE chromosome
                            contig
                   Seq_db UNIQUE Text
                   Wormpep_files Text
                   Upload_db_name Text
                   Mitochondrion ?Sequence


New class: Sequence_collection
------------------------------

Holds a collection of sequences and some descriptions

?Sequence_collection Origin    Name ?Text // name that the author gave this collection
                Species UNIQUE ?Species
                Strain UNIQUE ?Strain
                Laboratory ?Laboratory
                Evidence #Evidence
                Database ?Database ?Database_field ?Accession_number
             History    First_WS_release Int // first WormBase release this assembly was used
                Latest_WS_release Int // latest release where it was used
                Supercedes    UNIQUE ?Sequence_collection
                Superceded_by UNIQUE ?Sequence_collection
             Remark    Text
             Status    UNIQUE Live #Evidence
                       Dead #Evidence
             Sequences    ?Sequence


#Splice_confirmation
--------------------

Added: RNASeq ?Analysis Int Mass_spec ?Mass_spec_peptide

#Splice_confirmation RNASeq ?Analysis Int
             Mass_spec ?Mass_spec_peptide



?Feature
--------

Defined_by_analysis ?Analysis Int - to hold number of an analysis observations 


Model Changes:
------------------------------------

WS226 models

The models file has been re-tagged in line with the new release schedule.

Two changes to the ?Strain class.

Introduction of an Other_name tag to hold old names associated with a species,
(pre-classification names) etc.

Removal of an XREF to protein.

Added ?Strain/?Species to a minimal set of classes to help
the new website architecture

    * ?Clone
    * ?Condition
    * ?Expr_profile
    * ?Expression_cluster
    * ?Feature_data
    * ?Microarray_experiment
    * ?PCR_product
    * ?SAGE_experiment
    * ?Transcription_factor



For more info mail help@wormbase.org
-===================================================================================-


Quick installation guide for UNIX/Linux systems
-----------------------------------------------

1. Create a new directory to contain your copy of WormBase,
	e.g. /users/yourname/wormbase

2. Unpack and untar all of the database.*.tar.gz files into
	this directory. You will need approximately 2-3 Gb of disk space.

3. Obtain and install a suitable acedb binary for your system
	(available from www.acedb.org).

4. Use the acedb 'xace' program to open your database, e.g.
	type 'xace /users/yourname/wormbase' at the command prompt.

5. See the acedb website for more information about acedb and
	using xace.

____________  END _____________