WS233

From WormBaseWiki
Jump to navigationJump to search

Release notes for WS233


New release of WormBase WS233

WS233 was built by mt3
-===================================================================================-
The WS233 build directory includes:
species/ DIR              -  contains a sub dir for each WormBase species (G_SPECIES) with the following files:
     - G_SPECIES.WS233.genomic.fa.gz                - Unmasked genomic DNA
     - G_SPECIES.WS233.genomic_masked.fa.gz         - Hard-masked (repeats replaced with Ns) genomic DNA
     - G_SPECIES.WS233.genomic_softmasked.fa.gz     - Soft-masked (repeats lower-cased) genomic DNA
     - G_SPECIES.WS233.protein.fa.gz                - Current live protein set
     - G_SPECIES.WS233.cds_transcripts.fa.gz        - Spliced cDNA sequence for the CDS portion of protein-coding transcripts
     - G_SPECIES.WS233.ncrna_transcripts.fa.gz      - Spliced cDNA sequence for non-coding RNA transcripts
     - G_SPECIES.WS233.intergenic_sequences.fa.gz   - DNA sequence between pairs of adjacent genes
     - G_SPECIES.WS233.annotations.gff[2|3].gz      - Sequence features in either GFF2 or GFF3 format
     - G_SPECIES.WS233.ests.fa.gz                   - ESTs and mRNA sequences extracted from the public databases
     - G_SPECIES.WS233.best_blastp_hits.txt.gz      - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteins
     - G_SPECIES.WS233.*pep_package.tar.gz          - latest version of the [worm|brig|bren|rema|jap|ppa]pep package (if updated since last release)
     - annotation/                    - contains additional annotations:
        - G_SPECIES.WS233.confirmed_genes.txt.gz              - DNA sequences of all genes confirmed by EST &/or cDNA
        - G_SPECIES.WS233.cDNA2orf.txt.gz                     - Latest set of ORF connections to each cDNA (EST, OST, mRNA)
        - G_SPECIES.WS233.geneIDs.txtgz                       - list of all current gene identifiers with CGC & molecular names (when known)
        - G_SPECIES.WS233.PCR_product2gene.txt.gz             - Mappings between PCR products and overlapping Genes
        - G_SPECIES.WS233.*oligo_mapping.txt.gz               - Oligo array mapping files
        - G_SPECIES.WS233.knockout_consortium_alleles.xml.gz  - Table of Knockout Consortium alleles
        - G_SPECIES.WS233.SRA_gene_expression.tar.gz          - Tables of gene expression values computed from SRA RNASeq data
acedb DIR                -  Everything needed to generate a local copy of the The Primary database
     - database.WS233.*.tar.gz   - compressed acedb database for new release
     - models.wrm.WS233          - the latest database schema (also in above database files)
     - WS233-WS232.dbcomp   - log file reporting difference from last release
     - *Non_C_elegans_BLASTX/          - This directory contains the blastx data for non-elegans species
                                                    (reduces the size of the main database)
COMPARATIVE_ANALYSIS DIR - comparative analysis files
     - compara.WS233.tar.bz2     - gene-tree and alignment GFF files
     - wormpep_clw.WS233.sql.bz2 - ClustalW protein multiple alignments
ONTOLOGY DIR             - gene_associations, obo files for (phenotype GO anatomy) and associated association files


Release notes on the web:
-------------------------
http://www.wormbase.org/wiki/index.php/Release_Schedule



C. elegans Chromosomal Changes:
--------------------
There are no changes to the chromosome sequences in this release.


C. elegans Gene data set (Live C. elegans genes 47559)
------------------------------------------
Molecular_info              45917 (96.5%)
Concise_description         5982  (12.6%)
Human_disease_relevance     88    (0.2%)
Reference                   17219 (36.2%)
WormBase_approved_Gene_name 26890 (56.5%)
RNAi_result                 24881 (52.3%)
Microarray_results          23981 (50.4%)
SAGE_transcript             19223 (40.4%)


C. elegans

Wormpep data set:
----------------------------

There are 26011 CDSs, from 20554 protein-coding genes

The 26011 sequences contain 34548564 base pairs in total.

Modified entries      12
Deleted entries       21
New entries           45
Reappeared entries    4

Net change  +28
The difference (24) between the total CDS's of this (26011) and the last build (25987) does not equal the net change 28
Please investigate! !


C. elegans Genome sequence composition:
----------------------------

        WS233           WS232           change
----------------------------------------------
a       32367418        32367418          +0
c       17780787        17780787          +0
g       17756985        17756985          +0
t       32367086        32367086          +0
n       0               0                 +0
-       0               0                 +0

Total   100272276       100272276         +0


Pristionchus pacificus Genome sequence composition:
----------------------------
 172773083 total
 a 43813958
 c 32811034
 g 32828589
 t 43810996
 - 0
 n 19508506


Caenorhabditis remanei Genome sequence composition:
----------------------------
 145500347 total
 a 42927857
 c 26293828
 g 26276020
 t 42923178
 - 0
 n 7079464


Caenorhabditis japonica Genome sequence composition:
----------------------------
 166565019 total
 a 46865690
 c 30244493
 g 30234317
 t 46807519
 - 0
 n 12413000


Caenorhabditis briggsae Genome sequence composition:
----------------------------
 108419768 total
 a 32984239
 c 19684682
 g 19693545
 t 33054090
 - 62600
 n 2940612


Caenorhabditis brenneri Genome sequence composition:
----------------------------
 190421492 total
 a 52222485
 c 32837458
 g 32882838
 t 52164077
 - 0
 n 20314634




Tier II Gene counts
---------------------------------------------
pristionchus Gene count 24216 (Coding 24216)
remanei Gene count 32414 (Coding 31445)
japonica Gene count 29964 (Coding 29964)
briggsae Gene count 23027 (Coding 21936)
brenneri Gene count 32362 (Coding 30667)
---------------------------------------------




-------------------------------------------------
Pristionchus pacificus Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed               229 (0.9%)      Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    4982 (20.6%)     Some, but not all exon bases are covered by transcript evidence
Predicted             19006 (78.5%)     No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Pristionchus pacificus entries with WormBase-approved Gene name   3259




-------------------------------------------------
Caenorhabditis remanei Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed               961 (3.1%)      Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    5658 (18.0%)     Some, but not all exon bases are covered by transcript evidence
Predicted             24831 (79.0%)     No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis remanei entries with WormBase-approved Gene name   6018




-------------------------------------------------
Caenorhabditis japonica Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed              1637 (4.5%)      Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    5200 (14.4%)     Some, but not all exon bases are covered by transcript evidence
Predicted             29197 (81.0%)     No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis japonica entries with WormBase-approved Gene name   4973




-------------------------------------------------
Caenorhabditis briggsae Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed                54 (0.2%)      Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed     859 (3.9%)      Some, but not all exon bases are covered by transcript evidence
Predicted             21048 (95.8%)     No transcriptional evidence at all



Status of entries: Protein Accessions
-------------------------------------
UniProtKB accessions  21662 (98.6%)

Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis briggsae entries with WormBase-approved Gene name   6101




-------------------------------------------------
Caenorhabditis brenneri Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed              1510 (4.9%)      Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    5638 (18.4%)     Some, but not all exon bases are covered by transcript evidence
Predicted             23522 (76.7%)     No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis brenneri entries with WormBase-approved Gene name   3513




-------------------------------------------------
Caenorhabditis elegans Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed             12443 (47.8%)     Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed   11492 (44.2%)     Some, but not all exon bases are covered by transcript evidence
Predicted              2076 (8.0%)      No transcriptional evidence at all



Status of entries: Protein Accessions
-------------------------------------
UniProtKB accessions  25947 (99.8%)

Status of entries: Protein_ID's in EMBL
---------------------------------------
Protein_id            25999 (100.0%)

Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis elegans entries with WormBase-approved Gene name  25293


C. elegans Operons Stats
---------------------------------------------
Description: These exist as closely spaced gene clusters similar to bacterial operons
---------------------------------------------
| Live Operons        1390                |
| Genes in Operons    3634                |
---------------------------------------------


GO Annotation Stats WS233
--------------------------------------

GO_codes - used for assigning evidence
--------------------------------------
IC  Inferred by Curator
IDA Inferred from Direct Assay
IEA Inferred from Electronic Annotation
IEP Inferred from Expression Pattern
IGI Inferred from Genetic Interaction
IMP Inferred from Mutant Phenotype
IPI Inferred from Physical Interaction
ISS Inferred from Sequence (or Structural) Similarity
NAS Non-traceable Author Statement
ND  No Biological Data available
RCA Inferred from Reviewed Computational Analysis
TAS Traceable Author Statement
------------------------------------------------

Total number of Gene::GO connections:  263501

Genes Stats:
----------------
Genes with GO_term connections         90767
           IEA GO_code present         84543
       non-IEA GO_code present         6220

Source of the mapping data
Source: *RNAi (GFF mapping overlaps)   26630
        *citace                        2561
        *Inherited (motif & phenotype) 15134

GO_terms Stats:
---------------
Total No. GO_terms                     30611
GO_terms connected to Genes            3586
GO annotations connected with IEA      1842
GO annotations connected with non-IEA  1727
   Breakdown  IC - 6   IDA - 498   ISS - 151
             IEP - 11   IGI - 147   IMP - 810
             IPI - 83  NAS - 1     ND  - 1
             RCA - 0   TAS - 18


-===================================================================================-

Useful Stats:
---------

Genes with Sequence and WormBase-approved Gene names
WS233 49157 (25293 elegans / 6101 briggsae / 6018 remanei / 4973 japonica / 3513 brenneri / 3259 pristionchus)


-===================================================================================-



New Data:
---------

Transcription Factor Binding sites

The modENCODE Transcription Factor data has been added as 321,212 new
Feature objects with the Method tag 'TF_binding_site_region'.  This
data was derived from the Snyder Lab project to map the binding sites
of a selection of transcription factors in various life-stages and
conditions using ChIP-Seq.

The region where the transcription factor binds has been called by
modENCODE using peak-finding software on the ChIP-Seq results, the
binding sites are therefore currently known only approximately,
typically in a region of about 200 bases, or larger.

These data have a GFF line with the "type" column containing the SO
term 'TF_binding_site' and with the name of the transcription factor
and the ID of the transcription factor object in the database, for
example:


CHROMOSOME_III  TF_binding_site_region  TF_binding_site 14696   15618   .       +       .       Feature "WBsf401679" ; TF_ID "WBTranscriptionFactor000025" ; TF_name "DAF-16"

Where binding sites are known exactly, the Feature object's Method tag
will be changed to 'TF_binding_site'. There are some existing exactly
known sites, from other projects, having Feature objects with the
Method tag 'TF_binding_site'.




Genome Sequence error sites


Features marking known genome sequence error locations that have
already been corrected by changing the C. elegans reference genome
sequence now have the Method tag 'Corrected_genome_sequence_error'.

These have a GFF line with the "type" column containing the SO term
'base_call_error_correction', for example:

CHROMOSOME_III  RNASeq  base_call_error_correction      661017  661017  .       +       .       Feature "WBsf047774"


We have been working on marking possible genome sequence errors in the
C. elegans reference genome using data from the following projects:


- Weber KP et al. (2010) PLoS One "Whole genome sequencing highlights
  genetic changes associated with laboratory ...."

PMID 21085631

- Doitsidou M et al. (2010) PLoS One "C. elegans mutant identification
  with a one-step whole-genome-sequencing and ...."

PMID 21079745

- McGrath PT et al. (2011) Nature "Parallel evolution of domesticated
  Caenorhabditis species targets pheromone ...."

PMID 21849976

This data has been used to create Feature objects with a Method tag of
'Genome_sequence_error', taking the total of these locations in the
C. elegans database to 2,428.

These data have a GFF line with the "type" column containing the SO
term 'possible_base_call_error', for example:


CHROMOSOME_III  RNASeq  possible_base_call_error        38559   38560   .       +       .       Feature "WBsf268625"


No changes have been made yet to the C. elegans reference genome
sequence to correct these locations. We intend to correct the sites
that can be shown to influence coding genes in a future release of
WormBase. There are 77 indel errors in the C. elegans genome that
affect the structure of a coding gene.

There are two possible error sites on the C. elegans mitochondrial
sequence.  As the WormBase consortium does not own the mitochondrial
sequence, we will mark these sites with a Feature object with the
Method tag 'Genome_sequence_error', but we will not be changing this
sequence.


Genome sequence updates:
-----------------------

None this release

New Fixes:
----------


Known Problems:
---------------


Other Changes:
--------------

The life-cycle data in all objects has been changed from being held as
the name of the life-cycle stage to the ID number of the life-cycle
stage.

For example: 'L1 larva' has been changed to be held as the ID
'WBls:0000024' in all objects that refer to it.


Model Changes:
------------------------------------

In this release there are changes to a number of classes:

?Strain
?RNAi
?Life _stage
#Interactor_info
?Interaction
?Transgene
?Rearrangement
?PCR_product
?Anatomy_term
?Paper
?Process

More infromation and a human readable diff can be found here:
http://wiki.wormbase.org/index.php/WS232_Models.wrm

For more info mail help@wormbase.org
-===================================================================================-


Quick installation guide for UNIX/Linux systems
-----------------------------------------------

1. Create a new directory to contain your copy of WormBase,
        e.g. /users/yourname/wormbase

2. Unpack and untar all of the database.*.tar.gz files into
        this directory. You will need approximately 2-3 Gb of disk space.

3. Obtain and install a suitable acedb binary for your system
        (available from www.acedb.org).

4. Use the acedb 'xace' program to open your database, e.g.
        type 'xace /users/yourname/wormbase' at the command prompt.

5. See the acedb website for more information about acedb and
        using xace.

____________  END _____________