WS234
From WormBaseWiki
Jump to navigationJump to searchRelease Notes
New release of WormBase WS234 WS234 was built by Paul Davis -===================================================================================- The WS234 build directory includes: species/ DIR - contains a sub dir for each WormBase species (G_SPECIES) with the following files: - G_SPECIES.WS234.genomic.fa.gz - Unmasked genomic DNA - G_SPECIES.WS234.genomic_masked.fa.gz - Hard-masked (repeats replaced with Ns) genomic DNA - G_SPECIES.WS234.genomic_softmasked.fa.gz - Soft-masked (repeats lower-cased) genomic DNA - G_SPECIES.WS234.protein.fa.gz - Current live protein set - G_SPECIES.WS234.cds_transcripts.fa.gz - Spliced cDNA sequence for the CDS portion of protein-coding transcripts - G_SPECIES.WS234.ncrna_transcripts.fa.gz - Spliced cDNA sequence for non-coding RNA transcripts - G_SPECIES.WS234.intergenic_sequences.fa.gz - DNA sequence between pairs of adjacent genes - G_SPECIES.WS234.annotations.gff[2|3].gz - Sequence features in either GFF2 or GFF3 format - G_SPECIES.WS234.ests.fa.gz - ESTs and mRNA sequences extracted from the public databases - G_SPECIES.WS234.best_blastp_hits.txt.gz - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteins - G_SPECIES.WS234.*pep_package.tar.gz - latest version of the [worm|brig|bren|rema|jap|ppa]pep package (if updated since last release) - annotation/ - contains additional annotations: - G_SPECIES.WS234.confirmed_genes.txt.gz - DNA sequences of all genes confirmed by EST &/or cDNA - G_SPECIES.WS234.cDNA2orf.txt.gz - Latest set of ORF connections to each cDNA (EST, OST, mRNA) - G_SPECIES.WS234.geneIDs.txtgz - list of all current gene identifiers with CGC & molecular names (when known) - G_SPECIES.WS234.PCR_product2gene.txt.gz - Mappings between PCR products and overlapping Genes - G_SPECIES.WS234.*oligo_mapping.txt.gz - Oligo array mapping files - G_SPECIES.WS234.knockout_consortium_alleles.xml.gz - Table of Knockout Consortium alleles - G_SPECIES.WS234.SRA_gene_expression.tar.gz - Tables of gene expression values computed from SRA RNASeq data acedb DIR - Everything needed to generate a local copy of the The Primary database - database.WS234.*.tar.gz - compressed acedb database for new release - models.wrm.WS234 - the latest database schema (also in above database files) - WS234-WS233.dbcomp - log file reporting difference from last release - *Non_C_elegans_BLASTX/ - This directory contains the blastx data for non-elegans species (reduces the size of the main database) COMPARATIVE_ANALYSIS DIR - comparative analysis files - compara.WS234.tar.bz2 - gene-tree and alignment GFF files - wormpep_clw.WS234.sql.bz2 - ClustalW protein multiple alignments ONTOLOGY DIR - gene_associations, obo files for (phenotype GO anatomy) and associated association files Release notes on the web: ------------------------- http://www.wormbase.org/wiki/index.php/Release_Schedule C. elegans Chromosomal Changes: -------------------- There are no changes to the chromosome sequences in this release. C. elegans Gene data set (Live C. elegans genes 48240) ------------------------------------------ Molecular_info 46605 (96.6%) Concise_description 5984 (12.4%) Human_disease_relevance 92 (0.2%) Reference 17243 (35.7%) WormBase_approved_Gene_name 27151 (56.3%) RNAi_result 25027 (51.9%) Microarray_results 24095 (49.9%) SAGE_transcript 19220 (39.8%) C. elegans Wormpep data set: ---------------------------- There are 26041 CDSs, from 20537 protein-coding genes The 26041 sequences contain 34594587 base pairs in total. Modified entries 30 Deleted entries 66 New entries 96 Reappeared entries 3 Net change +33 The difference (30) between the total CDS's of this (26041) and the last build (26011) does not equal the net change 33 Please investigate! ! C. elegans Genome sequence composition: ---------------------------- WS234 WS233 change ---------------------------------------------- a 32367418 32367418 +0 c 17780787 17780787 +0 g 17756985 17756985 +0 t 32367086 32367086 +0 n 0 0 +0 - 0 0 +0 Total 100272276 100272276 +0 Pristionchus pacificus Genome sequence composition: ---------------------------- 172773083 total a 43813958 c 32811034 g 32828589 t 43810996 - 0 n 19508506 Caenorhabditis remanei Genome sequence composition: ---------------------------- 145500347 total a 42927857 c 26293828 g 26276020 t 42923178 - 0 n 7079464 Caenorhabditis japonica Genome sequence composition: ---------------------------- 166565019 total a 46865690 c 30244493 g 30234317 t 46807519 - 0 n 12413000 Caenorhabditis briggsae Genome sequence composition: ---------------------------- 108419768 total a 32984239 c 19684682 g 19693545 t 33054090 - 62600 n 2940612 Caenorhabditis brenneri Genome sequence composition: ---------------------------- 190421492 total a 52222485 c 32837458 g 32882838 t 52164077 - 0 n 20314634 Tier II Gene counts --------------------------------------------- pristionchus Gene count 24216 (Coding 24216) remanei Gene count 32414 (Coding 31445) japonica Gene count 29964 (Coding 29964) briggsae Gene count 23026 (Coding 21936) brenneri Gene count 32360 (Coding 30770) --------------------------------------------- ------------------------------------------------- Pristionchus pacificus Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 229 (0.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 4982 (20.6%) Some, but not all exon bases are covered by transcript evidence Predicted 19006 (78.5%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Pristionchus pacificus entries with WormBase-approved Gene name 3272 ------------------------------------------------- Caenorhabditis remanei Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 961 (3.1%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5658 (18.0%) Some, but not all exon bases are covered by transcript evidence Predicted 24831 (79.0%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis remanei entries with WormBase-approved Gene name 6045 ------------------------------------------------- Caenorhabditis japonica Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 1637 (4.5%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5200 (14.4%) Some, but not all exon bases are covered by transcript evidence Predicted 29197 (81.0%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis japonica entries with WormBase-approved Gene name 5002 ------------------------------------------------- Caenorhabditis briggsae Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 54 (0.2%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 859 (3.9%) Some, but not all exon bases are covered by transcript evidence Predicted 21048 (95.8%) No transcriptional evidence at all Status of entries: Protein Accessions ------------------------------------- UniProtKB accessions 21662 (98.6%) Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis briggsae entries with WormBase-approved Gene name 6135 ------------------------------------------------- Caenorhabditis brenneri Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 1565 (5.1%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5669 (18.4%) Some, but not all exon bases are covered by transcript evidence Predicted 23545 (76.5%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis brenneri entries with WormBase-approved Gene name 3535 ------------------------------------------------- Caenorhabditis elegans Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 12476 (47.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 11494 (44.1%) Some, but not all exon bases are covered by transcript evidence Predicted 2071 (8.0%) No transcriptional evidence at all Status of entries: Protein Accessions ------------------------------------- UniProtKB accessions 25921 (99.5%) Status of entries: Protein_ID's in EMBL --------------------------------------- Protein_id 26029 (100.0%) Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis elegans entries with WormBase-approved Gene name 25561 C. elegans Operons Stats --------------------------------------------- Description: These exist as closely spaced gene clusters similar to bacterial operons --------------------------------------------- | Live Operons 1390 | | Genes in Operons 3632 | --------------------------------------------- GO Annotation Stats WS234 -------------------------------------- GO_codes - used for assigning evidence -------------------------------------- IC Inferred by Curator IDA Inferred from Direct Assay IEA Inferred from Electronic Annotation IEP Inferred from Expression Pattern IGI Inferred from Genetic Interaction IMP Inferred from Mutant Phenotype IPI Inferred from Physical Interaction ISS Inferred from Sequence (or Structural) Similarity NAS Non-traceable Author Statement ND No Biological Data available RCA Inferred from Reviewed Computational Analysis TAS Traceable Author Statement ------------------------------------------------ Total number of Gene::GO connections: 264984 Genes Stats: ---------------- Genes with GO_term connections 91253 IEA GO_code present 84952 non-IEA GO_code present 6297 Source of the mapping data Source: *RNAi (GFF mapping overlaps) 26727 *citace 2790 *Inherited (motif & phenotype) 15238 GO_terms Stats: --------------- Total No. GO_terms 30642 GO_terms connected to Genes 3642 GO annotations connected with IEA 1855 GO annotations connected with non-IEA 1762 Breakdown IC - 6 IDA - 523 ISS - 151 IEP - 11 IGI - 145 IMP - 822 IPI - 83 NAS - 1 ND - 1 RCA - 0 TAS - 17 -===================================================================================- Useful Stats: --------- Genes with Sequence and WormBase-approved Gene names WS234 49550 (25561 elegans / 6135 briggsae / 6045 remanei / 5002 japonica / 3535 brenneri / 3272 pristionchus) -===================================================================================- New Data: --------- * OMIM Human Disease data derived from Human protein orthology. Data has automatically been promoted to the level of the gene from the protein orthologs that are calculated during this build. This will allow the display and easy access of human diseas associations on a large scale while manula curation of this data ramps up. Genome sequence updates: ----------------------- New Fixes: ---------- Known Problems: --------------- Other Changes: -------------- Proposed Changes / Forthcoming Data: ------------------------------------- We plan to correct 77 indel errors in the C. elegans genome that affect the structure of a coding genes for the WS235 release of WormBase. Full details can be pulled from the previous release letter but this is new data from the following projects: - Weber KP et al. (2010) PLoS One "Whole genome sequencing highlights genetic changes associated with laboratory ...." PMID 21085631 - Doitsidou M et al. (2010) PLoS One "C. elegans mutant identification with a one-step whole-genome-sequencing and ...." PMID 21079745 - McGrath PT et al. (2011) Nature "Parallel evolution of domesticated Caenorhabditis species targets pheromone ...." PMID 21849976 Model Changes: ------------------------------------ WS234 models * This cycle we see 3 simple model changes. ?Variation ?Laboratory Remove the XREF connection between these 2 classes as this has performance issues. Ability still in place to make the 2 way connection but not for very large projects. ?Transgene Add a Public_name field. ?Picture Add a Unique species field. More infromation and a human readable diff can be found here: http://wiki.wormbase.org/index.php/WS234_Models.wrm For more info mail help@wormbase.org -===================================================================================- Quick installation guide for UNIX/Linux systems ----------------------------------------------- 1. Create a new directory to contain your copy of WormBase, e.g. /users/yourname/wormbase 2. Unpack and untar all of the database.*.tar.gz files into this directory. You will need approximately 2-3 Gb of disk space. 3. Obtain and install a suitable acedb binary for your system (available from www.acedb.org). 4. Use the acedb 'xace' program to open your database, e.g. type 'xace /users/yourname/wormbase' at the command prompt. 5. See the acedb website for more information about acedb and using xace. ____________ END _____________