WS231

From WormBaseWiki
Revision as of 15:51, 29 March 2012 by Gwilliams (talk | contribs)
Jump to navigationJump to search
New release of WormBase WS231

WS231 was built by gw3
-===================================================================================-
The WS231 build directory includes:
species/ DIR              -  contains a sub dir for each WormBase species (G_SPECIES) with the following files:
     - G_SPECIES.WS231.genomic.fa.gz                - Unmasked genomic DNA
     - G_SPECIES.WS231.genomic_masked.fa.gz         - Hard-masked (repeats replaced with Ns) genomic DNA
     - G_SPECIES.WS231.genomic_softmasked.fa.gz     - Soft-masked (repeats lower-cased) genomic DNA
     - G_SPECIES.WS231.protein.fa.gz                - Current live protein set
     - G_SPECIES.WS231.cds_transcripts.fa.gz        - Spliced cDNA sequence for the CDS portion of protein-coding transcripts
     - G_SPECIES.WS231.ncrna_transcripts.fa.gz      - Spliced cDNA sequence for non-coding RNA transcripts
     - G_SPECIES.WS231.intergenic_sequences.fa.gz   - DNA sequence between pairs of adjacent genes
     - G_SPECIES.WS231.annotations.gff[2|3].gz      - Sequence features in either GFF2 or GFF3 format
     - G_SPECIES.WS231.ests.fa.gz                   - ESTs and mRNA sequences extracted from the public databases
     - G_SPECIES.WS231.best_blastp_hits.txt.gz      - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteins
     - G_SPECIES.WS231.*pep_package.tar.gz          - latest version of the [worm|brig|bren|rema|jap|ppa]pep package (if updated since last release)
     - annotation/                    - contains additional annotations:
        - G_SPECIES.WS231.confirmed_genes.txt.gz              - DNA sequences of all genes confirmed by EST &/or cDNA
        - G_SPECIES.WS231.cDNA2orf.txt.gz                     - Latest set of ORF connections to each cDNA (EST, OST, mRNA)
        - G_SPECIES.WS231.geneIDs.txtgz                       - list of all current gene identifiers with CGC & molecular names (when known)
        - G_SPECIES.WS231.PCR_product2gene.txt.gz             - Mappings between PCR products and overlapping Genes
        - G_SPECIES.WS231.*oligo_mapping.txt.gz               - Oligo array mapping files
        - G_SPECIES.WS231.knockout_consortium_alleles.xml.gz  - Table of Knockout Consortium alleles
        - G_SPECIES.WS231.SRA_gene_expression.tar.gz          - Tables of gene expression values computed from SRA RNASeq data
acedb DIR                -  Everything needed to generate a local copy of the The Primary database
     - database.WS231.*.tar.gz   - compressed acedb database for new release
     - models.wrm.WS231          - the latest database schema (also in above database files)
     - WS231-WS230.dbcomp   - log file reporting difference from last release
     - *Non_C_elegans_BLASTX/          - This directory contains the blastx data for non-elegans species
                                                    (reduces the size of the main database)
COMPARATIVE_ANALYSIS DIR - comparative analysis files
     - compara.WS231.tar.bz2     - gene-tree and alignment GFF files
     - wormpep_clw.WS231.sql.bz2 - ClustalW protein multiple alignments
ONTOLOGY DIR             - gene_associations, obo files for (phenotype GO anatomy) and associated association files


Release notes on the web:

http://www.wormbase.org/wiki/index.php/Release_Schedule C. elegans Synchronisation with GenBank / EMBL:
No synchronisation issues C. elegans Chromosomal Changes:
There are no changes to the chromosome sequences in this release. C. elegans Gene data set (Live C. elegans genes 47528)
Molecular_info 45878 (96.5%) Concise_description 5952 (12.5%) Reference 14424 (30.3%) WormBase_approved_Gene_name 26726 (56.2%) RNAi_result 24687 (51.9%) Microarray_results 23985 (50.5%) SAGE_transcript 19204 (40.4%) C. elegans Wormpep data set:
There are 25848 CDSs, from 20513 protein-coding genes The 25848 sequences contain 34300239 base pairs in total. Modified entries 167 Deleted entries 108 New entries 322 Reappeared entries 3 Net change +217 C. elegans Genome sequence composition:
WS231 WS230 change
a 32367418 32367418 +0 c 17780787 17780787 +0 g 17756985 17756985 +0 t 32367086 32367086 +0 n 0 0 +0 - 0 0 +0 Total 100272276 100272276 +0 Pristionchus pacificus Genome sequence composition:
172773083 total a 43813958 c 32811034 g 32828589 t 43810996 - 0 n 19508506 Caenorhabditis remanei Genome sequence composition:
145500347 total a 42927857 c 26293828 g 26276020 t 42923178 - 0 n 7079464 Caenorhabditis japonica Genome sequence composition:
166565019 total a 46865690 c 30244493 g 30234317 t 46807519 - 0 n 12413000 Caenorhabditis briggsae Genome sequence composition:
108419768 total a 32984239 c 19684682 g 19693545 t 33054090 - 62600 n 2940612 Caenorhabditis brenneri Genome sequence composition:
190421492 total a 52222485 c 32837458 g 32882838 t 52164077 - 0 n 20314634 Tier II Gene counts
pristionchus Gene count 24216 (Coding 24216) remanei Gene count 32431 (Coding 31471) japonica Gene count 29962 (Coding 29962) briggsae Gene count 23027 (Coding 21936) brenneri Gene count 32284 (Coding 30667)

Pristionchus pacificus Protein Stats:
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
Confirmed 229 (0.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 4982 (20.6%) Some, but not all exon bases are covered by transcript evidence Predicted 19006 (78.5%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections
Pristionchus pacificus entries with WormBase-approved Gene name 3226
Caenorhabditis remanei Protein Stats:
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
Confirmed 956 (3.0%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5662 (18.0%) Some, but not all exon bases are covered by transcript evidence Predicted 24858 (79.0%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections
Caenorhabditis remanei entries with WormBase-approved Gene name 5928
Caenorhabditis japonica Protein Stats:
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
Confirmed 176 (0.5%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 578 (1.6%) Some, but not all exon bases are covered by transcript evidence Predicted 35351 (97.9%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections
Caenorhabditis japonica entries with WormBase-approved Gene name 4899
Caenorhabditis briggsae Protein Stats:
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
Confirmed 54 (0.2%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 859 (3.9%) Some, but not all exon bases are covered by transcript evidence Predicted 21048 (95.8%) No transcriptional evidence at all Status of entries: Protein Accessions
UniProtKB accessions 21662 (98.6%) Gene <-> CDS,Transcript,Pseudogene connections
Caenorhabditis briggsae entries with WormBase-approved Gene name 6001
Caenorhabditis brenneri Protein Stats:
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
Confirmed 1510 (4.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5638 (18.4%) Some, but not all exon bases are covered by transcript evidence Predicted 23522 (76.7%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections
Caenorhabditis brenneri entries with WormBase-approved Gene name 3443
Caenorhabditis elegans Protein Stats:
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
Confirmed 12388 (47.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 11403 (44.1%) Some, but not all exon bases are covered by transcript evidence Predicted 2057 (8.0%) No transcriptional evidence at all Status of entries: Protein Accessions
UniProtKB accessions 25577 (99.0%) Status of entries: Protein_ID's in EMBL
Protein_id 25836 (100.0%) Gene <-> CDS,Transcript,Pseudogene connections
Caenorhabditis elegans entries with WormBase-approved Gene name 25124 C. elegans Operons Stats
Description: These exist as closely spaced gene clusters similar to bacterial operons
| Live Operons 1390 | | Genes in Operons 3634 |
GO Annotation Stats WS231
GO_codes - used for assigning evidence
IC Inferred by Curator IDA Inferred from Direct Assay IEA Inferred from Electronic Annotation IEP Inferred from Expression Pattern IGI Inferred from Genetic Interaction IMP Inferred from Mutant Phenotype IPI Inferred from Physical Interaction ISS Inferred from Sequence (or Structural) Similarity NAS Non-traceable Author Statement ND No Biological Data available RCA Inferred from Reviewed Computational Analysis TAS Traceable Author Statement
Total number of Gene::GO connections: 295368 Genes Stats:
Genes with GO_term connections 95547 IEA GO_code present 89448 non-IEA GO_code present 6095 Source of the mapping data Source: *RNAi (GFF mapping overlaps) 26056 *citace 2541 *Inherited (motif & phenotype) 15097 GO_terms Stats:
Total No. GO_terms 30601 GO_terms connected to Genes 3563 GO annotations connected with IEA 1817 GO annotations connected with non-IEA 1730 Breakdown IC - 6 IDA - 498 ISS - 152 IEP - 11 IGI - 146 IMP - 811 IPI - 85 NAS - 1 ND - 1 RCA - 0 TAS - 18 -===================================================================================- Useful Stats:
Genes with Sequence and WormBase-approved Gene names WS231 48621 (25124 elegans / 6001 briggsae / 5928 remanei / 4899 japonica / 3443 brenneri / 3226 pristionchus) -===================================================================================- New Data:
eggNOG

==

eggNOG has been updated to Version 3 http://eggnog.embl.de/version_3.0/ eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups) is a database of orthologous groups of genes. The orthologous groups are annotated with functional description lines (derived by identifying a common denominator for the genes based on their various annotations), with functional categories (i.e derived from the original COG/KOG categories). Data from the Andy Fire lab
===============
The Lamm et al. (2011) paper describes three techniques for doing RNASeq reads. As part of their work they find TSL sites and polyA sites. They also find reads that are part of the polysome fraction of RNAs - these are regions that are being actively translated as well as just transcribed. The data was downloaded from GEO: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE22410 Lamm AT, Stadler MR, Zhang H, Gent JI et al. "Multimodal RNA-seq using single-strand, double-strand, and CircLigase-based capture yields a refined and extended description of the C. elegans transcriptome." Genome Res 2011 Feb;21(2):265-75. PMID: 21177965 No of new SL1 features: 504 No of new SL2 features: 36 No of new polyA_site features: 10936 There is a new 'polysome' set of FeatureData objects. modENCODE Aggregate CDS structures
======================
We have updated the modENCODE Aggregate CDS structures to version "AG1110.v1201" Heterorhabditis bacteriophora
=================
The WormBase H.bacteriophora data (genome assembly and gene structure annotation) has been updated to the latest (February 2012) INSDC submission of the H.bacteriophora annotation project. BLAST databases
===
The following blast databases were updated to the latest versions for this Build: gadfly trembl swissprot wormpep Genome sequence updates:
New Fixes:
Known Problems:
Other Changes:
We are changing the GFF source for the nGASP prediction sets for WS231 (to clearly disambiguate them from other runs of the same programs, which we also dump). The old->new source map is: AUGUSTUS => nGASP_AUGUSTUS FGENESH => nGASP_FGENESH mGENE => nGASP_mGENE nGASP => nGASP_jigsaw Proposed Changes / Forthcoming Data:
Model Changes:
WS231 models This cycle we see 6 model changes. i) Unified Interaction model - major re-working of the Interaction class http://wiki.wormbase.org/index.php/WormBase_Model:Interaction ii) Legacy data incorporation (Variation Phenotype_info) iii) New Homology type nemNOG iv) Re-homing of Rescued_by_Transgene from Variation -> Phenotype_info v) Removal of Accession_number XREF from the Homology_group vi) ?Transposon, Corresponding_CDS and Corresponding_Pseudogene - More information and a human readable diff can be found here: http://wiki.wormbase.org/index.php/WS231_Models.wrm For more info mail help@wormbase.org -===================================================================================- Quick installation guide for UNIX/Linux systems
1. Create a new directory to contain your copy of WormBase, e.g. /users/yourname/wormbase 2. Unpack and untar all of the database.*.tar.gz files into this directory. You will need approximately 2-3 Gb of disk space. 3. Obtain and install a suitable acedb binary for your system (available from www.acedb.org). 4. Use the acedb 'xace' program to open your database, e.g. type 'xace /users/yourname/wormbase' at the command prompt. 5. See the acedb website for more information about acedb and using xace. ____________ END _____________ <\pre>