WS228
From WormBaseWiki
Jump to navigationJump to searchContents
Release Notes
New release of WormBase WS228 WS228 was built by Paul Davis -===================================================================================- The WS228 build directory includes: species/ DIR - contains a sub dir for each WormBase species (G_SPECIES) with the following files: - G_SPECIES.WS228.genomic.fa.gz - Unmasked genomic DNA - G_SPECIES.WS228.genomic_masked.fa.gz - Hard-masked (repeats replaced with Ns) genomic DNA - G_SPECIES.WS228.genomic_softmasked.fa.gz - Soft-masked (repeats lower-cased) genomic DNA - G_SPECIES.WS228.protein.fa.gz - Current live protein set - G_SPECIES.WS228.cds_transcripts.fa.gz - Spliced cDNA sequence for the CDS portion of protein-coding transcripts - G_SPECIES.WS228.ncrna_transcripts.fa.gz - Spliced cDNA sequence for non-coding RNA transcripts - G_SPECIES.WS228.intergenic_sequences.fa.gz - DNA sequence between pairs of adjacent genes - G_SPECIES.WS228.annotations.gff[2|3].gz - Sequence features in either GFF2 or GFF3 format - G_SPECIES.WS228.ests.fa.gz - ESTs and mRNA sequences extracted from the public databases - G_SPECIES.WS228.best_blastp_hits.txt.gz - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteains - G_SPECIES.WS228.*pep_package.tar.gz - latest version of the [worm|brig|bren|rema|jap|ppa]pep package (if updated since last release) - annotation/ - contains additional annotations: - G_SPECIES.WS228.confirmed_genes.txt.gz - DNA sequences of all genes confirmed by EST &/or cDNA - G_SPECIES.WS228.cDNA2orf.txt.gz - Latest set of ORF connections to each cDNA (EST, OST, mRNA) - G_SPECIES.WS228.geneIDs.txtgz - list of all current gene identifiers with CGC & molecular names (when known) - G_SPECIES.WS228.PCR_product2gene.txt.gz - Mappings between PCR products and overlapping Genes - G_SPECIES.WS228.*oligo_mapping.txt.gz - Oligo array mapping files - G_SPECIES.WS228.knockout_consortium_alleles.xml.gz - Table of Knockout Consortium alleles - G_SPECIES.WS228.SRA_gene_expression.tar.gz - Tables of gene expression values computed from SRA RNASeq data acedb DIR - Everything needed to generate a local copy of the The Primary database - database.WS228.*.tar.gz - compressed acedb database for new release - models.wrm.WS228 - the latest database schema (also in above database files) - WS228-WS227.dbcomp - log file reporting difference from last release - *Non_C_elegans_BLASTX/ - This directory contains the blastx data for non-elegans species (reduces the size of the main database) COMPARATIVE_ANALYSIS DIR - comparative analysis files - compara.WS228.tar.bz2 - gene-tree and alignment GFF files - wormpep_clw.WS228.sql.bz2 - ClustalW protein multiple alignments ONTOLOGY DIR - gene_associations, obo files for (phenotype GO anatomy) and associated association files Release notes on the web: ------------------------- http://www.wormbase.org/wiki/index.php/Release_Schedule C. elegans Synchronisation with GenBank / EMBL: ------------------------------------ No synchronisation issues C. elegans Chromosomal Changes: -------------------- There are no changes to the chromosome sequences in this release. C. elegans Gene data set (Live C. elegans genes 47455) ------------------------------------------ Molecular_info 45796 (96.5%) Concise_description 5849 (12.3%) Reference 14229 (30%) WormBase_approved_Gene_name 26336 (55.5%) RNAi_result 24670 (52%) Microarray_results 23978 (50.5%) SAGE_transcript 19180 (40.4%) C. elegans Wormpep data set: ---------------------------- There are 25391 CDSs, from 20484 protein-coding genes The 25391 sequences contain base pairs in total. Modified entries 266 Deleted entries 105 New entries 252 Reappeared entries 5 Net change +152 C. elegans Genome sequence composition: ---------------------------- WS228 WS227 change ---------------------------------------------- a 32367418 32367418 +0 c 17780787 17780787 +0 g 17756985 17756985 +0 t 32367086 32367086 +0 n 0 0 +0 - 0 0 +0 Total 100272276 100272276 +0 Pristionchus pacificus Genome sequence composition: ---------------------------- 172773083 total a 43813958 c 32811034 g 32828589 t 43810996 - 0 n 19508506 Caenorhabditis remanei Genome sequence composition: ---------------------------- 145500347 total a 42927857 c 26293828 g 26276020 t 42923178 - 0 n 7079464 Caenorhabditis japonica Genome sequence composition: ---------------------------- 166565019 total a 46865690 c 30244493 g 30234317 t 46807519 - 0 n 12413000 Caenorhabditis briggsae Genome sequence composition: ---------------------------- 108419768 total a 32984239 c 19684682 g 19693545 t 33054090 - 0 n 3003212 Caenorhabditis brenneri Genome sequence composition: ---------------------------- 190421492 total a 52222485 c 32837458 g 32882838 t 52164077 - 0 n 20314634 Tier II Gene counts --------------------------------------------- pristionchus Gene count 24216 (Coding 24216) remanei Gene count 32431 (Coding 31471) japonica Gene count 29962 (Coding 29962) briggsae Gene count 23050 (Coding 21962) brenneri Gene count 32257 (Coding 30667) --------------------------------------------- ------------------------------------------------- Pristionchus pacificus Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 229 (0.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 4982 (20.6%) Some, but not all exon bases are covered by transcript evidence Predicted 19006 (78.5%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Pristionchus pacificus entries with WormBase-approved Gene name 3169 ------------------------------------------------- Caenorhabditis remanei Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 956 (3.0%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5662 (18.0%) Some, but not all exon bases are covered by transcript evidence Predicted 24858 (79.0%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis remanei entries with WormBase-approved Gene name 5681 ------------------------------------------------- Caenorhabditis japonica Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 176 (0.5%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 578 (1.6%) Some, but not all exon bases are covered by transcript evidence Predicted 35351 (97.9%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis japonica entries with WormBase-approved Gene name 4682 ------------------------------------------------- Caenorhabditis briggsae Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 53 (0.2%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 853 (3.9%) Some, but not all exon bases are covered by transcript evidence Predicted 21080 (95.9%) No transcriptional evidence at all Status of entries: Protein Accessions ------------------------------------- UniProtKB accessions 21682 (98.6%) Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis briggsae entries with WormBase-approved Gene name 5739 ------------------------------------------------- Caenorhabditis brenneri Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 1510 (4.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5638 (18.4%) Some, but not all exon bases are covered by transcript evidence Predicted 23522 (76.7%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis brenneri entries with WormBase-approved Gene name 3260 ------------------------------------------------- Caenorhabditis elegans Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 12126 (47.8%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 11232 (44.2%) Some, but not all exon bases are covered by transcript evidence Predicted 2033 (8.0%) No transcriptional evidence at all Status of entries: Protein Accessions ------------------------------------- UniProtKB accessions 24610 (96.9%) Status of entries: Protein_ID's in EMBL --------------------------------------- Protein_id 25168 (99.1%) Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis elegans entries with WormBase-approved Gene name 24724 C. elegans Operons Stats --------------------------------------------- Description: These exist as closely spaced gene clusters similar to bacterial operons --------------------------------------------- | Live Operons 1254 | | Genes in Operons 3354 | --------------------------------------------- GO Annotation Stats WS228 -------------------------------------- GO_codes - used for assigning evidence -------------------------------------- IC Inferred by Curator IDA Inferred from Direct Assay IEA Inferred from Electronic Annotation IEP Inferred from Expression Pattern IGI Inferred from Genetic Interaction IMP Inferred from Mutant Phenotype IPI Inferred from Physical Interaction ISS Inferred from Sequence (or Structural) Similarity NAS Non-traceable Author Statement ND No Biological Data available RCA Inferred from Reviewed Computational Analysis TAS Traceable Author Statement ------------------------------------------------ Total number of Gene::GO connections: 294345 Genes Stats: ---------------- Genes with GO_term connections 95532 IEA GO_code present 89490 non-IEA GO_code present 6038 Source of the mapping data Source: *RNAi (GFF mapping overlaps) 25676 *citace 2426 *Inherited (motif & phenotype) 15087 GO_terms Stats: --------------- Total No. GO_terms 30550 GO_terms connected to Genes 3384 GO annotations connected with IEA 1766 GO annotations connected with non-IEA 1608 Breakdown IC - 3 IDA - 435 ISS - 147 IEP - 10 IGI - 137 IMP - 777 IPI - 77 NAS - 2 ND - 1 RCA - 0 TAS - 18 -===================================================================================- Useful Stats: --------- Genes with Sequence and WormBase-approved Gene names WS228 47255 (24724 elegans / 5739 briggsae / 5681 remanei / 4682 japonica / 3260 brenneri / 3169 pristionchus) -===================================================================================- New Data: --------- Variation_data - Over 65,000 variations identified by the Itai Yanai lab and published in PMID 21367940 Genome Res. 2011 Core promoter T-blocks correlate with gene expression levels in C. elegans. Hryshkevich U, Hashimshony T, Yanai I. were added to this release. Genome sequence updates: ----------------------- New Fixes: ---------- Known Problems: --------------- Other Changes: -------------- Proposed Changes / Forthcoming Data: ------------------------------------- Model Changes: ------------------------------------ WS228 models v 1.2 This cycle we see 5 model changes. #Homology type - COG - proposed by Michael ?CDS - Start_not_found <UNIQUE> int - proposed by Michael ?Transcript - u21RNA -> piRNA conversion - proposed by Gary Has implications for StLouis sequence database. ?Expression_cluster - proposed by Wen ?Variation - Readthrough Text #Evidence - Proposed by Kevin & corresponding change in the #Molecular_change hash. More infromation and a human readable diff can be found here: http://wiki.wormbase.org/index.php/WS228_Models.wrm For more info mail hinxton@wormbase.org -===================================================================================- Quick installation guide for UNIX/Linux systems ----------------------------------------------- 1. Create a new directory to contain your copy of WormBase, e.g. /users/yourname/wormbase 2. Unpack and untar all of the database.*.tar.gz files into this directory. You will need approximately 2-3 Gb of disk space. 3. Obtain and install a suitable acedb binary for your system (available from www.acedb.org). 4. Use the acedb 'xace' program to open your database, e.g. type 'xace /users/yourname/wormbase' at the command prompt. 5. See the acedb website for more information about acedb and using xace. ____________ END _____________
Bug Fixes
eggNOG
missing eggNOG data can be found here: ftp://ftp.sanger.ac.uk/pub2/wormbase/releases/WS228/patches/eggNOG.ace