Difference between revisions of "WS233"

From WormBaseWiki
Jump to navigationJump to search
(Created page with ' <pre> New release of WormBase WS232 WS232 was built by Michael Paulini [michael.paulini@wormbase.org] -=======================================================================…')
 
 
Line 1: Line 1:
  
 +
__TOC__
 +
 +
== Release notes for WS233 ==
 
<pre>
 
<pre>
New release of WormBase WS232
 
  
 +
New release of WormBase WS233
  
 
+
WS233 was built by mt3
WS232 was built by Michael Paulini [michael.paulini@wormbase.org] -===================================================================================- The WS232 build directory includes: species/ DIR - contains a sub dir for each WormBase species (G_SPECIES) with the following files: - G_SPECIES.WS232.genomic.fa.gz - Unmasked genomic DNA - G_SPECIES.WS232.genomic_masked.fa.gz - Hard-masked (repeats replaced with Ns) genomic DNA - G_SPECIES.WS232.genomic_softmasked.fa.gz - Soft-masked (repeats lower-cased) genomic DNA - G_SPECIES.WS232.protein.fa.gz - Current live protein set - G_SPECIES.WS232.cds_transcripts.fa.gz - Spliced cDNA sequence for the CDS portion of protein-coding transcripts - G_SPECIES.WS232.ncrna_transcripts.fa.gz - Spliced cDNA sequence for non-coding RNA transcripts - G_SPECIES.WS232.intergenic_sequences.fa.gz - DNA sequence between pairs of adjacent genes - G_SPECIES.WS232.annotations.gff[2|3].gz - Sequence features in either GFF2 or GFF3 format - G_SPECIES.WS232.ests.fa.gz - ESTs and mRNA sequences extracted from the public databases - G_SPECIES.WS232.best_blastp_hits.txt.gz - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteins - G_SPECIES.WS232.pep_package.tar.gz - latest version of the [worm|brig|bren|rema|jap|ppa]pep package (if updated since last release) - annotation/ - contains additional annotations: - G_SPECIES.WS232.confirmed_genes.txt.gz - DNA sequences of all genes confirmed by EST &/or cDNA - G_SPECIES.WS232.cDNA2orf.txt.gz - Latest set of ORF connections to each cDNA (EST, OST, mRNA) - G_SPECIES.WS232.geneIDs.txtgz - list of all current gene identifiers with CGC & molecular names (when known) - G_SPECIES.WS232.PCR_product2gene.txt.gz - Mappings between PCR products and overlapping Genes - G_SPECIES.WS232.oligo_mapping.txt.gz - Oligo array mapping files - G_SPECIES.WS232.knockout_consortium_alleles.xml.gz - Table of Knockout Consortium alleles - G_SPECIES.WS232.SRA_gene_expression.tar.gz - Tables of gene expression values computed from SRA RNASeq data acedb DIR - Everything needed to generate a local copy of the The Primary database - database.WS232.*.tar.gz - compressed acedb database for new release - models.wrm.WS232 - the latest database schema (also in above database files) - WS232-WS231.dbcomp - log file reporting difference from last release - *Non_C_elegans_BLASTX/ - This directory contains the blastx data for non-elegans species (reduces the size of the main database) COMPARATIVE_ANALYSIS DIR - comparative analysis files - compara.WS232.tar.bz2 - gene-tree and alignment GFF files - wormpep_clw.WS232.sql.bz2 - ClustalW protein multiple alignments ONTOLOGY DIR - gene_associations, obo files for (phenotype GO anatomy) and associated association files
+
-===================================================================================-
 +
The WS233 build directory includes:
 +
species/ DIR             - contains a sub dir for each WormBase species (G_SPECIES) with the following files:
 +
    - G_SPECIES.WS233.genomic.fa.gz               - Unmasked genomic DNA
 +
    - G_SPECIES.WS233.genomic_masked.fa.gz         - Hard-masked (repeats replaced with Ns) genomic DNA
 +
    - G_SPECIES.WS233.genomic_softmasked.fa.gz     - Soft-masked (repeats lower-cased) genomic DNA
 +
    - G_SPECIES.WS233.protein.fa.gz               - Current live protein set
 +
    - G_SPECIES.WS233.cds_transcripts.fa.gz       - Spliced cDNA sequence for the CDS portion of protein-coding transcripts
 +
    - G_SPECIES.WS233.ncrna_transcripts.fa.gz     - Spliced cDNA sequence for non-coding RNA transcripts
 +
    - G_SPECIES.WS233.intergenic_sequences.fa.gz   - DNA sequence between pairs of adjacent genes
 +
    - G_SPECIES.WS233.annotations.gff[2|3].gz     - Sequence features in either GFF2 or GFF3 format
 +
    - G_SPECIES.WS233.ests.fa.gz                   - ESTs and mRNA sequences extracted from the public databases
 +
    - G_SPECIES.WS233.best_blastp_hits.txt.gz     - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteins
 +
    - G_SPECIES.WS233.*pep_package.tar.gz         - latest version of the [worm|brig|bren|rema|jap|ppa]pep package (if updated since last release)
 +
    - annotation/                   - contains additional annotations:
 +
        - G_SPECIES.WS233.confirmed_genes.txt.gz             - DNA sequences of all genes confirmed by EST &/or cDNA
 +
        - G_SPECIES.WS233.cDNA2orf.txt.gz                     - Latest set of ORF connections to each cDNA (EST, OST, mRNA)
 +
        - G_SPECIES.WS233.geneIDs.txtgz                       - list of all current gene identifiers with CGC & molecular names (when known)
 +
        - G_SPECIES.WS233.PCR_product2gene.txt.gz             - Mappings between PCR products and overlapping Genes
 +
        - G_SPECIES.WS233.*oligo_mapping.txt.gz               - Oligo array mapping files
 +
        - G_SPECIES.WS233.knockout_consortium_alleles.xml.gz - Table of Knockout Consortium alleles
 +
        - G_SPECIES.WS233.SRA_gene_expression.tar.gz         - Tables of gene expression values computed from SRA RNASeq data
 +
acedb DIR               - Everything needed to generate a local copy of the The Primary database
 +
    - database.WS233.*.tar.gz   - compressed acedb database for new release
 +
    - models.wrm.WS233          - the latest database schema (also in above database files)
 +
    - WS233-WS232.dbcomp   - log file reporting difference from last release
 +
    - *Non_C_elegans_BLASTX/         - This directory contains the blastx data for non-elegans species
 +
                                                    (reduces the size of the main database)
 +
COMPARATIVE_ANALYSIS DIR - comparative analysis files
 +
    - compara.WS233.tar.bz2     - gene-tree and alignment GFF files
 +
    - wormpep_clw.WS233.sql.bz2 - ClustalW protein multiple alignments
 +
ONTOLOGY DIR             - gene_associations, obo files for (phenotype GO anatomy) and associated association files
  
  
 
Release notes on the web:
 
Release notes on the web:
 
+
-------------------------
 
 
 
http://www.wormbase.org/wiki/index.php/Release_Schedule
 
http://www.wormbase.org/wiki/index.php/Release_Schedule
  
 
C. elegans Synchronisation with GenBank / EMBL:
 
 
 
No synchronisation issues
 
  
  
 
C. elegans Chromosomal Changes:
 
C. elegans Chromosomal Changes:
 
+
--------------------
 
 
 
There are no changes to the chromosome sequences in this release.
 
There are no changes to the chromosome sequences in this release.
  
  
C. elegans Gene data set (Live C. elegans genes 47556)
+
C. elegans Gene data set (Live C. elegans genes 47559)
 
+
------------------------------------------
 
+
Molecular_info             45917 (96.5%)
Molecular_info 45911 (96.5%) Concise_description 5971 (12.6%) Reference 17204 (36.2%) WormBase_approved_Gene_name 26801 (56.4%) RNAi_result 24690 (51.9%) Microarray_results 23990 (50.4%) SAGE_transcript 19220 (40.4%)
+
Concise_description         5982  (12.6%)
 +
Human_disease_relevance    88    (0.2%)
 +
Reference                   17219 (36.2%)
 +
WormBase_approved_Gene_name 26890 (56.5%)
 +
RNAi_result                 24881 (52.3%)
 +
Microarray_results         23981 (50.4%)
 +
SAGE_transcript             19223 (40.4%)
  
  
 
C. elegans
 
C. elegans
 
  
 
Wormpep data set:
 
Wormpep data set:
 +
----------------------------
  
 +
There are 26011 CDSs, from 20554 protein-coding genes
  
There are 25987 CDSs, from 20553 protein-coding genes
+
The 26011 sequences contain 34548564 base pairs in total.
  
 +
Modified entries      12
 +
Deleted entries      21
 +
New entries          45
 +
Reappeared entries    4
  
The 25987 sequences contain 34451067 base pairs in total.
+
Net change  +28
 
+
The difference (24) between the total CDS's of this (26011) and the last build (25987) does not equal the net change 28
 
+
Please investigate! !
Modified entries 77 Deleted entries 58 New entries 197 Reappeared entries 7
 
 
 
 
 
Net change +146
 
  
  
 
C. elegans Genome sequence composition:
 
C. elegans Genome sequence composition:
 +
----------------------------
  
 +
        WS233          WS232          change
 +
----------------------------------------------
 +
a      32367418        32367418          +0
 +
c      17780787        17780787          +0
 +
g      17756985        17756985          +0
 +
t      32367086        32367086          +0
 +
n      0              0                +0
 +
-      0              0                +0
  
WS232 WS231 change
+
Total   100272276       100272276         +0
 
 
 
 
a 32367418 32367418 +0 c 17780787 17780787 +0 g 17756985 17756985 +0 t 32367086 32367086 +0 n 0 0 +0 - 0 0 +0
 
 
 
 
 
Total 100272276 100272276 +0
 
  
  
 
Pristionchus pacificus Genome sequence composition:
 
Pristionchus pacificus Genome sequence composition:
 
+
----------------------------
 
+
172773083 total
172773083 total a 43813958 c 32811034 g 32828589 t 43810996 - 0 n 19508506
+
a 43813958
 +
c 32811034
 +
g 32828589
 +
t 43810996
 +
- 0
 +
n 19508506
  
  
 
Caenorhabditis remanei Genome sequence composition:
 
Caenorhabditis remanei Genome sequence composition:
 
+
----------------------------
 
+
145500347 total
145500347 total a 42927857 c 26293828 g 26276020 t 42923178 - 0 n 7079464
+
a 42927857
 +
c 26293828
 +
g 26276020
 +
t 42923178
 +
- 0
 +
n 7079464
  
  
 
Caenorhabditis japonica Genome sequence composition:
 
Caenorhabditis japonica Genome sequence composition:
 
+
----------------------------
 
+
166565019 total
166565019 total a 46865690 c 30244493 g 30234317 t 46807519 - 0 n 12413000
+
a 46865690
 +
c 30244493
 +
g 30234317
 +
t 46807519
 +
- 0
 +
n 12413000
  
  
 
Caenorhabditis briggsae Genome sequence composition:
 
Caenorhabditis briggsae Genome sequence composition:
 
+
----------------------------
 
+
108419768 total
108419768 total a 32984239 c 19684682 g 19693545 t 33054090 - 62600 n 2940612
+
a 32984239
 +
c 19684682
 +
g 19693545
 +
t 33054090
 +
- 62600
 +
n 2940612
  
  
 
Caenorhabditis brenneri Genome sequence composition:
 
Caenorhabditis brenneri Genome sequence composition:
 +
----------------------------
 +
190421492 total
 +
a 52222485
 +
c 32837458
 +
g 32882838
 +
t 52164077
 +
- 0
 +
n 20314634
  
  
190421492 total a 52222485 c 32837458 g 32882838 t 52164077 - 0 n 20314634
 
  
  
 
Tier II Gene counts
 
Tier II Gene counts
 
+
---------------------------------------------
 
+
pristionchus Gene count 24216 (Coding 24216)
pristionchus Gene count 24216 (Coding 24216) remanei Gene count 32414 (Coding 31445) japonica Gene count 29962 (Coding 29962) briggsae Gene count 23027 (Coding 21936)
+
remanei Gene count 32414 (Coding 31445)
 
+
japonica Gene count 29964 (Coding 29964)
 
+
briggsae Gene count 23027 (Coding 21936)
brenneri Gene count 32331 (Coding 30667)
+
brenneri Gene count 32362 (Coding 30667)
 +
---------------------------------------------
  
  
  
  
 +
-------------------------------------------------
 
Pristionchus pacificus Protein Stats:
 
Pristionchus pacificus Protein Stats:
 
+
-------------------------------------------------
 
 
 
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
 
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
 +
-------------------------------------------------
 +
Confirmed              229 (0.9%)      Every base of every exon has transcription evidence (mRNA, EST etc.)
 +
Partially_confirmed    4982 (20.6%)    Some, but not all exon bases are covered by transcript evidence
 +
Predicted            19006 (78.5%)    No transcriptional evidence at all
  
 
Confirmed 229 (0.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 4982 (20.6%) Some, but not all exon bases are covered by transcript evidence Predicted 19006 (78.5%) No transcriptional evidence at all
 
  
  
 
Gene <-> CDS,Transcript,Pseudogene connections
 
Gene <-> CDS,Transcript,Pseudogene connections
 
+
----------------------------------------------
 
+
Pristionchus pacificus entries with WormBase-approved Gene name   3259
Pristionchus pacificus entries with WormBase-approved Gene name 3248
 
  
  
  
  
 +
-------------------------------------------------
 
Caenorhabditis remanei Protein Stats:
 
Caenorhabditis remanei Protein Stats:
 
+
-------------------------------------------------
 
 
 
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
 
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
 +
-------------------------------------------------
 +
Confirmed              961 (3.1%)      Every base of every exon has transcription evidence (mRNA, EST etc.)
 +
Partially_confirmed    5658 (18.0%)    Some, but not all exon bases are covered by transcript evidence
 +
Predicted            24831 (79.0%)    No transcriptional evidence at all
  
 
Confirmed 961 (3.1%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5658 (18.0%) Some, but not all exon bases are covered by transcript evidence Predicted 24831 (79.0%) No transcriptional evidence at all
 
  
  
 
Gene <-> CDS,Transcript,Pseudogene connections
 
Gene <-> CDS,Transcript,Pseudogene connections
 
+
----------------------------------------------
 
+
Caenorhabditis remanei entries with WormBase-approved Gene name   6018
Caenorhabditis remanei entries with WormBase-approved Gene name 5982
 
  
  
  
  
 +
-------------------------------------------------
 
Caenorhabditis japonica Protein Stats:
 
Caenorhabditis japonica Protein Stats:
 
+
-------------------------------------------------
 
 
 
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
 
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
 +
-------------------------------------------------
 +
Confirmed              1637 (4.5%)      Every base of every exon has transcription evidence (mRNA, EST etc.)
 +
Partially_confirmed    5200 (14.4%)    Some, but not all exon bases are covered by transcript evidence
 +
Predicted            29197 (81.0%)    No transcriptional evidence at all
  
 
Confirmed 176 (0.5%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 578 (1.6%) Some, but not all exon bases are covered by transcript evidence Predicted 35351 (97.9%) No transcriptional evidence at all
 
  
  
 
Gene <-> CDS,Transcript,Pseudogene connections
 
Gene <-> CDS,Transcript,Pseudogene connections
 
+
----------------------------------------------
 
+
Caenorhabditis japonica entries with WormBase-approved Gene name   4973
Caenorhabditis japonica entries with WormBase-approved Gene name 4945
 
  
  
  
  
 +
-------------------------------------------------
 
Caenorhabditis briggsae Protein Stats:
 
Caenorhabditis briggsae Protein Stats:
 
+
-------------------------------------------------
 
 
 
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
 
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
 +
-------------------------------------------------
 +
Confirmed                54 (0.2%)      Every base of every exon has transcription evidence (mRNA, EST etc.)
 +
Partially_confirmed    859 (3.9%)      Some, but not all exon bases are covered by transcript evidence
 +
Predicted            21048 (95.8%)    No transcriptional evidence at all
  
 
Confirmed 54 (0.2%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 859 (3.9%) Some, but not all exon bases are covered by transcript evidence Predicted 21048 (95.8%) No transcriptional evidence at all
 
  
  
 
Status of entries: Protein Accessions
 
Status of entries: Protein Accessions
 
+
-------------------------------------
 
+
UniProtKB accessions 21662 (98.6%)
UniProtKB accessions 21662 (98.6%)
 
 
 
  
 
Gene <-> CDS,Transcript,Pseudogene connections
 
Gene <-> CDS,Transcript,Pseudogene connections
 
+
----------------------------------------------
 
+
Caenorhabditis briggsae entries with WormBase-approved Gene name   6101
Caenorhabditis briggsae entries with WormBase-approved Gene name 6064
 
  
  
  
  
 +
-------------------------------------------------
 
Caenorhabditis brenneri Protein Stats:
 
Caenorhabditis brenneri Protein Stats:
 
+
-------------------------------------------------
 
 
 
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
 
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
 +
-------------------------------------------------
 +
Confirmed              1510 (4.9%)      Every base of every exon has transcription evidence (mRNA, EST etc.)
 +
Partially_confirmed    5638 (18.4%)    Some, but not all exon bases are covered by transcript evidence
 +
Predicted            23522 (76.7%)    No transcriptional evidence at all
  
 
Confirmed 1510 (4.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5638 (18.4%) Some, but not all exon bases are covered by transcript evidence Predicted 23522 (76.7%) No transcriptional evidence at all
 
  
  
 
Gene <-> CDS,Transcript,Pseudogene connections
 
Gene <-> CDS,Transcript,Pseudogene connections
 
+
----------------------------------------------
 
+
Caenorhabditis brenneri entries with WormBase-approved Gene name   3513
Caenorhabditis brenneri entries with WormBase-approved Gene name 3484
 
  
  
  
  
 +
-------------------------------------------------
 
Caenorhabditis elegans Protein Stats:
 
Caenorhabditis elegans Protein Stats:
 
+
-------------------------------------------------
 
 
 
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
 
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
 +
-------------------------------------------------
 +
Confirmed            12443 (47.8%)    Every base of every exon has transcription evidence (mRNA, EST etc.)
 +
Partially_confirmed  11492 (44.2%)    Some, but not all exon bases are covered by transcript evidence
 +
Predicted              2076 (8.0%)      No transcriptional evidence at all
  
 
Confirmed 12440 (47.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 11472 (44.1%) Some, but not all exon bases are covered by transcript evidence Predicted 2075 (8.0%) No transcriptional evidence at all
 
  
  
 
Status of entries: Protein Accessions
 
Status of entries: Protein Accessions
 
+
-------------------------------------
 
+
UniProtKB accessions 25947 (99.8%)
UniProtKB accessions 25814 (99.3%)
 
 
 
  
 
Status of entries: Protein_ID's in EMBL
 
Status of entries: Protein_ID's in EMBL
 
+
---------------------------------------
 
+
Protein_id           25999 (100.0%)
Protein_id 25975 (100.0%)
 
 
 
  
 
Gene <-> CDS,Transcript,Pseudogene connections
 
Gene <-> CDS,Transcript,Pseudogene connections
 
+
----------------------------------------------
 
+
Caenorhabditis elegans entries with WormBase-approved Gene name 25293
Caenorhabditis elegans entries with WormBase-approved Gene name 25201
 
  
  
 
C. elegans Operons Stats
 
C. elegans Operons Stats
 
+
---------------------------------------------
 
 
 
Description: These exist as closely spaced gene clusters similar to bacterial operons
 
Description: These exist as closely spaced gene clusters similar to bacterial operons
 +
---------------------------------------------
 +
| Live Operons        1390                |
 +
| Genes in Operons    3634                |
 +
---------------------------------------------
  
  
| Live Operons 1390 |
+
GO Annotation Stats WS233
 
+
--------------------------------------
 
 
| Genes in Operons 3634 |
 
 
 
 
 
GO Annotation Stats WS232
 
 
 
  
 
GO_codes - used for assigning evidence
 
GO_codes - used for assigning evidence
 
+
--------------------------------------
 
+
IC Inferred by Curator
IC Inferred by Curator IDA Inferred from Direct Assay IEA Inferred from Electronic Annotation IEP Inferred from Expression Pattern IGI Inferred from Genetic Interaction IMP Inferred from Mutant Phenotype IPI Inferred from Physical Interaction ISS Inferred from Sequence (or Structural) Similarity NAS Non-traceable Author Statement ND No Biological Data available RCA Inferred from Reviewed Computational Analysis
+
IDA Inferred from Direct Assay
 
+
IEA Inferred from Electronic Annotation
 
+
IEP Inferred from Expression Pattern
 +
IGI Inferred from Genetic Interaction
 +
IMP Inferred from Mutant Phenotype
 +
IPI Inferred from Physical Interaction
 +
ISS Inferred from Sequence (or Structural) Similarity
 +
NAS Non-traceable Author Statement
 +
ND No Biological Data available
 +
RCA Inferred from Reviewed Computational Analysis
 
TAS Traceable Author Statement
 
TAS Traceable Author Statement
 +
------------------------------------------------
  
 
+
Total number of Gene::GO connections: 263501
Total number of Gene::GO connections: 290838
 
 
 
  
 
Genes Stats:
 
Genes Stats:
 
+
----------------
 
+
Genes with GO_term connections         90767
Genes with GO_term connections 96609
+
          IEA GO_code present         84543
IEA GO_code present 91223
+
      non-IEA GO_code present         6220
non-IEA GO_code present 5382
 
 
 
  
 
Source of the mapping data
 
Source of the mapping data
Source: *RNAi (GFF mapping overlaps) 20107
+
Source: *RNAi (GFF mapping overlaps)   26630
*citace 2560
+
        *citace                       2561
*Inherited (motif & phenotype) 15076
+
        *Inherited (motif & phenotype) 15134
 
 
  
 
GO_terms Stats:
 
GO_terms Stats:
 
+
---------------
 
+
Total No. GO_terms                     30611
Total No. GO_terms 30606
+
GO_terms connected to Genes           3586
GO_terms connected to Genes 3572
+
GO annotations connected with IEA     1842
GO annotations connected with IEA 1833
+
GO annotations connected with non-IEA 1727
GO annotations connected with non-IEA 1722
+
  Breakdown IC - 6   IDA - 498  ISS - 151
Breakdown IC - 6 IDA - 501 ISS - 148 IEP - 11 IGI - 148 IMP - 803 IPI - 84 NAS - 1 ND - 1
+
            IEP - 11   IGI - 147  IMP - 810
RCA - 0 TAS - 18
+
            IPI - 83  NAS - 1     ND - 1
 +
            RCA - 0   TAS - 18
  
  
 
-===================================================================================-
 
-===================================================================================-
 
  
 
Useful Stats:
 
Useful Stats:
 +
---------
  
 +
Genes with Sequence and WormBase-approved Gene names
 +
WS233 49157 (25293 elegans / 6101 briggsae / 6018 remanei / 4973 japonica / 3513 brenneri / 3259 pristionchus)
  
Genes with Sequence and WormBase-approved Gene names WS232 48924 (25201 elegans / 6064 briggsae / 5982 remanei / 4945 japonica / 3484 brenneri / 3248 pristionchus)
 
  
 +
-===================================================================================-
  
-===================================================================================-
 
  
  
 
New Data:
 
New Data:
 +
---------
  
 +
Transcription Factor Binding sites
  
New Transcriptionally Active Region Features
+
The modENCODE Transcription Factor data has been added as 321,212 new
 +
Feature objects with the Method tag 'TF_binding_site_region'.  This
 +
data was derived from the Snyder Lab project to map the binding sites
 +
of a selection of transcription factors in various life-stages and
 +
conditions using ChIP-Seq.
  
 +
The region where the transcription factor binds has been called by
 +
modENCODE using peak-finding software on the ChIP-Seq results, the
 +
binding sites are therefore currently known only approximately,
 +
typically in a region of about 200 bases, or larger.
  
The Tiling Array data specifying Transcriptionallly Active Regions (TARs) from David Miller's lab: http://intermine.modencode.org/query/experiment.do?experiment=Identification+of+tissue+and+stage-specific+transcribed+sequences+with+expression+profile+maps has been added to the database as Feature_data objects with a GFF source of 'TranscriptionallyActiveRegion'.
+
These data have a GFF line with the "type" column containing the SO
 +
term 'TF_binding_site' and with the name of the transcription factor
 +
and the ID of the transcription factor object in the database, for
 +
example:
  
  
Genome sequence updates:
+
CHROMOSOME_III  TF_binding_site_region  TF_binding_site 14696  15618  .      +      .      Feature "WBsf401679" ; TF_ID "WBTranscriptionFactor000025" ; TF_name "DAF-16"
  
 +
Where binding sites are known exactly, the Feature object's Method tag
 +
will be changed to 'TF_binding_site'. There are some existing exactly
 +
known sites, from other projects, having Feature objects with the
 +
Method tag 'TF_binding_site'.
  
C.angaria update
 
  
  
The C.angaria assembly has been updated to the latest genome draft available from the Schwarz lab at CalTech. It also includes a new set of gene predictions provided by CalTech, and can be downloaded from ftp://ftp.wormbase.org/pub/wormbase/releases/WS232/species/c_angaria
 
  
 +
Genome Sequence error sites
  
New Fixes:
 
  
 +
Features marking known genome sequence error locations that have
 +
already been corrected by changing the C. elegans reference genome
 +
sequence now have the Method tag 'Corrected_genome_sequence_error'.
  
Known Problems:
+
These have a GFF line with the "type" column containing the SO term
 +
'base_call_error_correction', for example:
  
 +
CHROMOSOME_III  RNASeq  base_call_error_correction      661017  661017  .      +      .      Feature "WBsf047774"
  
Other Changes:
 
  
 +
We have been working on marking possible genome sequence errors in the
 +
C. elegans reference genome using data from the following projects:
  
C.elegans genetic map frozen
 
  
 +
- Weber KP et al. (2010) PLoS One "Whole genome sequencing highlights
 +
  genetic changes associated with laboratory ...."
  
The genetic map of C. elegans has changed very little in the past three years in terms of recombinational map distances and marker gene locations. It is therefore being frozen, from WS232 onward. In future, new genetic loci, deficiencies and duplications will continue to be added to the genetic map, but these will simply be interpolated into the existing map.
+
PMID 21085631
  
 +
- Doitsidou M et al. (2010) PLoS One "C. elegans mutant identification
 +
  with a one-step whole-genome-sequencing and ...."
  
Renaming of naturally occurring variation data
+
PMID 21079745
  
 +
- McGrath PT et al. (2011) Nature "Parallel evolution of domesticated
 +
  Caenorhabditis species targets pheromone ...."
  
From WS232 on, all naturally occurring variation data objects will be identified by their WBVariationID. Two datasets are currently named in accordance with this policy in WS232:
+
PMID 21849976
  
 +
This data has been used to create Feature objects with a Method tag of
 +
'Genome_sequence_error', taking the total of these locations in the
 +
C. elegans database to 2,428.
  
1) PMID 21849976 McGrath PT et al. (2011) Nature "Parallel evolution of domesticated Caenorhabditis species targets pheromone ...."
+
These data have a GFF line with the "type" column containing the SO
 +
term 'possible_base_call_error', for example:
  
  
2) Cutter/Stein - four distinct C.elegans strains vs. N2.
+
CHROMOSOME_III  RNASeq  possible_base_call_error        38559  38560  .       +      .       Feature "WBsf268625"
  
  
Proposed Changes / Forthcoming Data:
+
No changes have been made yet to the C. elegans reference genome
 +
sequence to correct these locations. We intend to correct the sites
 +
that can be shown to influence coding genes in a future release of
 +
WormBase. There are 77 indel errors in the C. elegans genome that
 +
affect the structure of a coding gene.
  
 +
There are two possible error sites on the C. elegans mitochondrial
 +
sequence.  As the WormBase consortium does not own the mitochondrial
 +
sequence, we will mark these sites with a Feature object with the
 +
Method tag 'Genome_sequence_error', but we will not be changing this
 +
sequence.
  
Renaming of all naturally occurring variations already held in WormBase will be complete in WS233.
 
  
 +
Genome sequence updates:
 +
-----------------------
  
Model Changes:
+
None this release
  
 +
New Fixes:
 +
----------
  
1.) ?Interaction and #Interactor_info
 
  
 +
Known Problems:
 +
---------------
  
Interactor_info
 
  
 +
Other Changes:
 +
--------------
  
Remove: Antibody Remark
+
The life-cycle data in all objects has been changed from being held as
 +
the name of the life-cycle stage to the ID number of the life-cycle
 +
stage.
  
 +
For example: 'L1 larva' has been changed to be held as the ID
 +
'WBls:0000024' in all objects that refer to it.
  
Rename: Antibody_info -> Antibody
 
  
 +
Model Changes:
 +
------------------------------------
  
?Interaction Add: Antibody_remark ?Text
+
In this release there are changes to a number of classes:
  
 +
?Strain
 +
?RNAi
 +
?Life _stage
 +
#Interactor_info
 +
?Interaction
 +
?Transgene
 +
?Rearrangement
 +
?PCR_product
 +
?Anatomy_term
 +
?Paper
 +
?Process
  
More infromation and a human readable diff can be found here: http://wiki.wormbase.org/index.php/WS232_Models.wrm
+
More infromation and a human readable diff can be found here:
 +
http://wiki.wormbase.org/index.php/WS232_Models.wrm
  
 
+
For more info mail help@wormbase.org
For more info mail help@wormbase.org -===================================================================================-
+
-===================================================================================-
  
  
 
Quick installation guide for UNIX/Linux systems
 
Quick installation guide for UNIX/Linux systems
 +
-----------------------------------------------
  
 +
1. Create a new directory to contain your copy of WormBase,
 +
        e.g. /users/yourname/wormbase
  
    Create a new directory to contain your copy of WormBase, e.g. /users/yourname/wormbase
+
2. Unpack and untar all of the database.*.tar.gz files into
 
+
        this directory. You will need approximately 2-3 Gb of disk space.
    Unpack and untar all of the database.*.tar.gz files into this directory. You will need approximately 2-3 Gb of disk space.
 
 
 
    Obtain and install a suitable acedb binary for your system (available from www.acedb.org).
 
  
    Use the acedb 'xace' program to open your database, e.g. type 'xace /users/yourname/wormbase' at the command prompt.
+
3. Obtain and install a suitable acedb binary for your system
 +
        (available from www.acedb.org).
  
    See the acedb website for more information about acedb and using xace.
+
4. Use the acedb 'xace' program to open your database, e.g.
 +
        type 'xace /users/yourname/wormbase' at the command prompt.
  
 +
5. See the acedb website for more information about acedb and
 +
        using xace.
  
 +
____________  END _____________
  
__________END___________
 
 
</pre>
 
</pre>

Latest revision as of 14:17, 13 September 2012

Release notes for WS233


New release of WormBase WS233

WS233 was built by mt3
-===================================================================================-
The WS233 build directory includes:
species/ DIR              -  contains a sub dir for each WormBase species (G_SPECIES) with the following files:
     - G_SPECIES.WS233.genomic.fa.gz                - Unmasked genomic DNA
     - G_SPECIES.WS233.genomic_masked.fa.gz         - Hard-masked (repeats replaced with Ns) genomic DNA
     - G_SPECIES.WS233.genomic_softmasked.fa.gz     - Soft-masked (repeats lower-cased) genomic DNA
     - G_SPECIES.WS233.protein.fa.gz                - Current live protein set
     - G_SPECIES.WS233.cds_transcripts.fa.gz        - Spliced cDNA sequence for the CDS portion of protein-coding transcripts
     - G_SPECIES.WS233.ncrna_transcripts.fa.gz      - Spliced cDNA sequence for non-coding RNA transcripts
     - G_SPECIES.WS233.intergenic_sequences.fa.gz   - DNA sequence between pairs of adjacent genes
     - G_SPECIES.WS233.annotations.gff[2|3].gz      - Sequence features in either GFF2 or GFF3 format
     - G_SPECIES.WS233.ests.fa.gz                   - ESTs and mRNA sequences extracted from the public databases
     - G_SPECIES.WS233.best_blastp_hits.txt.gz      - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteins
     - G_SPECIES.WS233.*pep_package.tar.gz          - latest version of the [worm|brig|bren|rema|jap|ppa]pep package (if updated since last release)
     - annotation/                    - contains additional annotations:
        - G_SPECIES.WS233.confirmed_genes.txt.gz              - DNA sequences of all genes confirmed by EST &/or cDNA
        - G_SPECIES.WS233.cDNA2orf.txt.gz                     - Latest set of ORF connections to each cDNA (EST, OST, mRNA)
        - G_SPECIES.WS233.geneIDs.txtgz                       - list of all current gene identifiers with CGC & molecular names (when known)
        - G_SPECIES.WS233.PCR_product2gene.txt.gz             - Mappings between PCR products and overlapping Genes
        - G_SPECIES.WS233.*oligo_mapping.txt.gz               - Oligo array mapping files
        - G_SPECIES.WS233.knockout_consortium_alleles.xml.gz  - Table of Knockout Consortium alleles
        - G_SPECIES.WS233.SRA_gene_expression.tar.gz          - Tables of gene expression values computed from SRA RNASeq data
acedb DIR                -  Everything needed to generate a local copy of the The Primary database
     - database.WS233.*.tar.gz   - compressed acedb database for new release
     - models.wrm.WS233          - the latest database schema (also in above database files)
     - WS233-WS232.dbcomp   - log file reporting difference from last release
     - *Non_C_elegans_BLASTX/          - This directory contains the blastx data for non-elegans species
                                                    (reduces the size of the main database)
COMPARATIVE_ANALYSIS DIR - comparative analysis files
     - compara.WS233.tar.bz2     - gene-tree and alignment GFF files
     - wormpep_clw.WS233.sql.bz2 - ClustalW protein multiple alignments
ONTOLOGY DIR             - gene_associations, obo files for (phenotype GO anatomy) and associated association files


Release notes on the web:
-------------------------
http://www.wormbase.org/wiki/index.php/Release_Schedule



C. elegans Chromosomal Changes:
--------------------
There are no changes to the chromosome sequences in this release.


C. elegans Gene data set (Live C. elegans genes 47559)
------------------------------------------
Molecular_info              45917 (96.5%)
Concise_description         5982  (12.6%)
Human_disease_relevance     88    (0.2%)
Reference                   17219 (36.2%)
WormBase_approved_Gene_name 26890 (56.5%)
RNAi_result                 24881 (52.3%)
Microarray_results          23981 (50.4%)
SAGE_transcript             19223 (40.4%)


C. elegans

Wormpep data set:
----------------------------

There are 26011 CDSs, from 20554 protein-coding genes

The 26011 sequences contain 34548564 base pairs in total.

Modified entries      12
Deleted entries       21
New entries           45
Reappeared entries    4

Net change  +28
The difference (24) between the total CDS's of this (26011) and the last build (25987) does not equal the net change 28
Please investigate! !


C. elegans Genome sequence composition:
----------------------------

        WS233           WS232           change
----------------------------------------------
a       32367418        32367418          +0
c       17780787        17780787          +0
g       17756985        17756985          +0
t       32367086        32367086          +0
n       0               0                 +0
-       0               0                 +0

Total   100272276       100272276         +0


Pristionchus pacificus Genome sequence composition:
----------------------------
 172773083 total
 a 43813958
 c 32811034
 g 32828589
 t 43810996
 - 0
 n 19508506


Caenorhabditis remanei Genome sequence composition:
----------------------------
 145500347 total
 a 42927857
 c 26293828
 g 26276020
 t 42923178
 - 0
 n 7079464


Caenorhabditis japonica Genome sequence composition:
----------------------------
 166565019 total
 a 46865690
 c 30244493
 g 30234317
 t 46807519
 - 0
 n 12413000


Caenorhabditis briggsae Genome sequence composition:
----------------------------
 108419768 total
 a 32984239
 c 19684682
 g 19693545
 t 33054090
 - 62600
 n 2940612


Caenorhabditis brenneri Genome sequence composition:
----------------------------
 190421492 total
 a 52222485
 c 32837458
 g 32882838
 t 52164077
 - 0
 n 20314634




Tier II Gene counts
---------------------------------------------
pristionchus Gene count 24216 (Coding 24216)
remanei Gene count 32414 (Coding 31445)
japonica Gene count 29964 (Coding 29964)
briggsae Gene count 23027 (Coding 21936)
brenneri Gene count 32362 (Coding 30667)
---------------------------------------------




-------------------------------------------------
Pristionchus pacificus Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed               229 (0.9%)      Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    4982 (20.6%)     Some, but not all exon bases are covered by transcript evidence
Predicted             19006 (78.5%)     No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Pristionchus pacificus entries with WormBase-approved Gene name   3259




-------------------------------------------------
Caenorhabditis remanei Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed               961 (3.1%)      Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    5658 (18.0%)     Some, but not all exon bases are covered by transcript evidence
Predicted             24831 (79.0%)     No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis remanei entries with WormBase-approved Gene name   6018




-------------------------------------------------
Caenorhabditis japonica Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed              1637 (4.5%)      Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    5200 (14.4%)     Some, but not all exon bases are covered by transcript evidence
Predicted             29197 (81.0%)     No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis japonica entries with WormBase-approved Gene name   4973




-------------------------------------------------
Caenorhabditis briggsae Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed                54 (0.2%)      Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed     859 (3.9%)      Some, but not all exon bases are covered by transcript evidence
Predicted             21048 (95.8%)     No transcriptional evidence at all



Status of entries: Protein Accessions
-------------------------------------
UniProtKB accessions  21662 (98.6%)

Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis briggsae entries with WormBase-approved Gene name   6101




-------------------------------------------------
Caenorhabditis brenneri Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed              1510 (4.9%)      Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed    5638 (18.4%)     Some, but not all exon bases are covered by transcript evidence
Predicted             23522 (76.7%)     No transcriptional evidence at all



Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis brenneri entries with WormBase-approved Gene name   3513




-------------------------------------------------
Caenorhabditis elegans Protein Stats:
-------------------------------------------------
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
-------------------------------------------------
Confirmed             12443 (47.8%)     Every base of every exon has transcription evidence (mRNA, EST etc.)
Partially_confirmed   11492 (44.2%)     Some, but not all exon bases are covered by transcript evidence
Predicted              2076 (8.0%)      No transcriptional evidence at all



Status of entries: Protein Accessions
-------------------------------------
UniProtKB accessions  25947 (99.8%)

Status of entries: Protein_ID's in EMBL
---------------------------------------
Protein_id            25999 (100.0%)

Gene <-> CDS,Transcript,Pseudogene connections
----------------------------------------------
Caenorhabditis elegans entries with WormBase-approved Gene name  25293


C. elegans Operons Stats
---------------------------------------------
Description: These exist as closely spaced gene clusters similar to bacterial operons
---------------------------------------------
| Live Operons        1390                |
| Genes in Operons    3634                |
---------------------------------------------


GO Annotation Stats WS233
--------------------------------------

GO_codes - used for assigning evidence
--------------------------------------
IC  Inferred by Curator
IDA Inferred from Direct Assay
IEA Inferred from Electronic Annotation
IEP Inferred from Expression Pattern
IGI Inferred from Genetic Interaction
IMP Inferred from Mutant Phenotype
IPI Inferred from Physical Interaction
ISS Inferred from Sequence (or Structural) Similarity
NAS Non-traceable Author Statement
ND  No Biological Data available
RCA Inferred from Reviewed Computational Analysis
TAS Traceable Author Statement
------------------------------------------------

Total number of Gene::GO connections:  263501

Genes Stats:
----------------
Genes with GO_term connections         90767
           IEA GO_code present         84543
       non-IEA GO_code present         6220

Source of the mapping data
Source: *RNAi (GFF mapping overlaps)   26630
        *citace                        2561
        *Inherited (motif & phenotype) 15134

GO_terms Stats:
---------------
Total No. GO_terms                     30611
GO_terms connected to Genes            3586
GO annotations connected with IEA      1842
GO annotations connected with non-IEA  1727
   Breakdown  IC - 6   IDA - 498   ISS - 151
             IEP - 11   IGI - 147   IMP - 810
             IPI - 83  NAS - 1     ND  - 1
             RCA - 0   TAS - 18


-===================================================================================-

Useful Stats:
---------

Genes with Sequence and WormBase-approved Gene names
WS233 49157 (25293 elegans / 6101 briggsae / 6018 remanei / 4973 japonica / 3513 brenneri / 3259 pristionchus)


-===================================================================================-



New Data:
---------

Transcription Factor Binding sites

The modENCODE Transcription Factor data has been added as 321,212 new
Feature objects with the Method tag 'TF_binding_site_region'.  This
data was derived from the Snyder Lab project to map the binding sites
of a selection of transcription factors in various life-stages and
conditions using ChIP-Seq.

The region where the transcription factor binds has been called by
modENCODE using peak-finding software on the ChIP-Seq results, the
binding sites are therefore currently known only approximately,
typically in a region of about 200 bases, or larger.

These data have a GFF line with the "type" column containing the SO
term 'TF_binding_site' and with the name of the transcription factor
and the ID of the transcription factor object in the database, for
example:


CHROMOSOME_III  TF_binding_site_region  TF_binding_site 14696   15618   .       +       .       Feature "WBsf401679" ; TF_ID "WBTranscriptionFactor000025" ; TF_name "DAF-16"

Where binding sites are known exactly, the Feature object's Method tag
will be changed to 'TF_binding_site'. There are some existing exactly
known sites, from other projects, having Feature objects with the
Method tag 'TF_binding_site'.




Genome Sequence error sites


Features marking known genome sequence error locations that have
already been corrected by changing the C. elegans reference genome
sequence now have the Method tag 'Corrected_genome_sequence_error'.

These have a GFF line with the "type" column containing the SO term
'base_call_error_correction', for example:

CHROMOSOME_III  RNASeq  base_call_error_correction      661017  661017  .       +       .       Feature "WBsf047774"


We have been working on marking possible genome sequence errors in the
C. elegans reference genome using data from the following projects:


- Weber KP et al. (2010) PLoS One "Whole genome sequencing highlights
  genetic changes associated with laboratory ...."

PMID 21085631

- Doitsidou M et al. (2010) PLoS One "C. elegans mutant identification
  with a one-step whole-genome-sequencing and ...."

PMID 21079745

- McGrath PT et al. (2011) Nature "Parallel evolution of domesticated
  Caenorhabditis species targets pheromone ...."

PMID 21849976

This data has been used to create Feature objects with a Method tag of
'Genome_sequence_error', taking the total of these locations in the
C. elegans database to 2,428.

These data have a GFF line with the "type" column containing the SO
term 'possible_base_call_error', for example:


CHROMOSOME_III  RNASeq  possible_base_call_error        38559   38560   .       +       .       Feature "WBsf268625"


No changes have been made yet to the C. elegans reference genome
sequence to correct these locations. We intend to correct the sites
that can be shown to influence coding genes in a future release of
WormBase. There are 77 indel errors in the C. elegans genome that
affect the structure of a coding gene.

There are two possible error sites on the C. elegans mitochondrial
sequence.  As the WormBase consortium does not own the mitochondrial
sequence, we will mark these sites with a Feature object with the
Method tag 'Genome_sequence_error', but we will not be changing this
sequence.


Genome sequence updates:
-----------------------

None this release

New Fixes:
----------


Known Problems:
---------------


Other Changes:
--------------

The life-cycle data in all objects has been changed from being held as
the name of the life-cycle stage to the ID number of the life-cycle
stage.

For example: 'L1 larva' has been changed to be held as the ID
'WBls:0000024' in all objects that refer to it.


Model Changes:
------------------------------------

In this release there are changes to a number of classes:

?Strain
?RNAi
?Life _stage
#Interactor_info
?Interaction
?Transgene
?Rearrangement
?PCR_product
?Anatomy_term
?Paper
?Process

More infromation and a human readable diff can be found here:
http://wiki.wormbase.org/index.php/WS232_Models.wrm

For more info mail help@wormbase.org
-===================================================================================-


Quick installation guide for UNIX/Linux systems
-----------------------------------------------

1. Create a new directory to contain your copy of WormBase,
        e.g. /users/yourname/wormbase

2. Unpack and untar all of the database.*.tar.gz files into
        this directory. You will need approximately 2-3 Gb of disk space.

3. Obtain and install a suitable acedb binary for your system
        (available from www.acedb.org).

4. Use the acedb 'xace' program to open your database, e.g.
        type 'xace /users/yourname/wormbase' at the command prompt.

5. See the acedb website for more information about acedb and
        using xace.

____________  END _____________