WS235
From WormBaseWiki
Jump to navigationJump to searchRelease Notes
New release of WormBase WS235 WS235 was built by klh -===================================================================================- The WS235 build directory includes: species/ DIR - contains a sub dir for each WormBase species (G_SPECIES) with the following files: - G_SPECIES.WS235.genomic.fa.gz - Unmasked genomic DNA - G_SPECIES.WS235.genomic_masked.fa.gz - Hard-masked (repeats replaced with Ns) genomic DNA - G_SPECIES.WS235.genomic_softmasked.fa.gz - Soft-masked (repeats lower-cased) genomic DNA - G_SPECIES.WS235.protein.fa.gz - Current live protein set - G_SPECIES.WS235.cds_transcripts.fa.gz - Spliced cDNA sequence for the CDS portion of protein-coding transcripts - G_SPECIES.WS235.ncrna_transcripts.fa.gz - Spliced cDNA sequence for non-coding RNA transcripts - G_SPECIES.WS235.intergenic_sequences.fa.gz - DNA sequence between pairs of adjacent genes - G_SPECIES.WS235.annotations.gff[2|3].gz - Sequence features in either GFF2 or GFF3 format - G_SPECIES.WS235.ests.fa.gz - ESTs and mRNA sequences extracted from the public databases - G_SPECIES.WS235.best_blastp_hits.txt.gz - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteins - G_SPECIES.WS235.*pep_package.tar.gz - latest version of the [worm|brig|bren|rema|jap|ppa]pep package (if updated since last release) - annotation/ - contains additional annotations: - G_SPECIES.WS235.confirmed_genes.txt.gz - DNA sequences of all genes confirmed by EST &/or cDNA - G_SPECIES.WS235.cDNA2orf.txt.gz - Latest set of ORF connections to each cDNA (EST, OST, mRNA) - G_SPECIES.WS235.geneIDs.txtgz - list of all current gene identifiers with CGC & molecular names (when known) - G_SPECIES.WS235.PCR_product2gene.txt.gz - Mappings between PCR products and overlapping Genes - G_SPECIES.WS235.*oligo_mapping.txt.gz - Oligo array mapping files - G_SPECIES.WS235.knockout_consortium_alleles.xml.gz - Table of Knockout Consortium alleles - G_SPECIES.WS235.SRA_gene_expression.tar.gz - Tables of gene expression values computed from SRA RNASeq data acedb DIR - Everything needed to generate a local copy of the The Primary database - database.WS235.*.tar.gz - compressed acedb database for new release - models.wrm.WS235 - the latest database schema (also in above database files) - WS235-WS234.dbcomp - log file reporting difference from last release - *Non_C_elegans_BLASTX/ - This directory contains the blastx data for non-elegans species (reduces the size of the main database) COMPARATIVE_ANALYSIS DIR - comparative analysis files - compara.WS235.tar.bz2 - gene-tree and alignment GFF files - wormpep_clw.WS235.sql.bz2 - ClustalW protein multiple alignments ONTOLOGY DIR - gene_associations, obo files for (phenotype GO anatomy) and associated association files Release notes on the web: ------------------------- http://www.wormbase.org/wiki/index.php/Release_Schedule C. elegans Chromosomal Changes: -------------------- There are no changes to the chromosome sequences in this release. C. elegans Gene data set (Live C. elegans genes 48231) ------------------------------------------ Molecular_info 46597 (96.6%) Concise_description 5995 (12.4%) Human_disease_relevance 92 (0.2%) Reference 17244 (35.8%) WormBase_approved_Gene_name 27199 (56.4%) RNAi_result 24995 (51.8%) Microarray_results 24082 (49.9%) SAGE_transcript 19204 (39.8%) C. elegans Wormpep data set: ---------------------------- There are 26107 CDSs, from 20532 protein-coding genes The 26107 sequences contain 34723014 base pairs in total. Modified entries 197 Deleted entries 54 New entries 120 Reappeared entries 6 Net change +72 The difference (66) between the total CDS's of this (26107) and the last build (26041) does not equal the net change 72 Please investigate! ! C. elegans Genome sequence composition: ---------------------------- WS235 WS234 change ---------------------------------------------- a 32367475 32367418 +57 c 17780880 17780787 +93 g 17757087 17756985 +102 t 32367165 32367086 +79 n 0 0 +0 - 0 0 +0 Total 100272607 100272276 +331 Total number of bases has increased - please investigate ! Pristionchus pacificus Genome sequence composition: ---------------------------- 172773083 total a 43813958 c 32811034 g 32828589 t 43810996 - 0 n 19508506 Caenorhabditis remanei Genome sequence composition: ---------------------------- 145500347 total a 42927857 c 26293828 g 26276020 t 42923178 - 0 n 7079464 Caenorhabditis japonica Genome sequence composition: ---------------------------- 166565019 total a 46865690 c 30244493 g 30234317 t 46807519 - 0 n 12413000 Caenorhabditis briggsae Genome sequence composition: ---------------------------- 108419768 total a 32984239 c 19684682 g 19693545 t 33054090 - 62600 n 2940612 Caenorhabditis brenneri Genome sequence composition: ---------------------------- 190421492 total a 52222485 c 32837458 g 32882838 t 52164077 - 0 n 20314634 Tier II Gene counts --------------------------------------------- pristionchus Gene count 24216 (Coding 24216) remanei Gene count 32414 (Coding 31445) japonica Gene count 29964 (Coding 29964) briggsae Gene count 23026 (Coding 21936) brenneri Gene count 32360 (Coding 30770) --------------------------------------------- ------------------------------------------------- Pristionchus pacificus Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 229 (0.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 4982 (20.6%) Some, but not all exon bases are covered by transcript evidence Predicted 19006 (78.5%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Pristionchus pacificus entries with WormBase-approved Gene name 3287 ------------------------------------------------- Caenorhabditis remanei Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 961 (3.1%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5658 (18.0%) Some, but not all exon bases are covered by transcript evidence Predicted 24831 (79.0%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis remanei entries with WormBase-approved Gene name 6088 ------------------------------------------------- Caenorhabditis japonica Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 1637 (4.5%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5200 (14.4%) Some, but not all exon bases are covered by transcript evidence Predicted 29197 (81.0%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis japonica entries with WormBase-approved Gene name 5036 ------------------------------------------------- Caenorhabditis briggsae Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 54 (0.2%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 859 (3.9%) Some, but not all exon bases are covered by transcript evidence Predicted 21048 (95.8%) No transcriptional evidence at all Status of entries: Protein Accessions ------------------------------------- UniProtKB accessions 21662 (98.6%) Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis briggsae entries with WormBase-approved Gene name 6178 ------------------------------------------------- Caenorhabditis brenneri Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 1565 (5.1%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5669 (18.4%) Some, but not all exon bases are covered by transcript evidence Predicted 23545 (76.5%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis brenneri entries with WormBase-approved Gene name 3566 ------------------------------------------------- Caenorhabditis elegans Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 12499 (47.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 11539 (44.2%) Some, but not all exon bases are covered by transcript evidence Predicted 2069 (7.9%) No transcriptional evidence at all Status of entries: Protein Accessions ------------------------------------- UniProtKB accessions 25961 (99.4%) Status of entries: Protein_ID's in EMBL --------------------------------------- Protein_id 26095 (100.0%) Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis elegans entries with WormBase-approved Gene name 25609 C. elegans Operons Stats --------------------------------------------- Description: These exist as closely spaced gene clusters similar to bacterial operons --------------------------------------------- | Live Operons 1390 | | Genes in Operons 3633 | --------------------------------------------- GO Annotation Stats WS235 -------------------------------------- GO_codes - used for assigning evidence -------------------------------------- IC Inferred by Curator IDA Inferred from Direct Assay IEA Inferred from Electronic Annotation IEP Inferred from Expression Pattern IGI Inferred from Genetic Interaction IMP Inferred from Mutant Phenotype IPI Inferred from Physical Interaction ISS Inferred from Sequence (or Structural) Similarity NAS Non-traceable Author Statement ND No Biological Data available RCA Inferred from Reviewed Computational Analysis TAS Traceable Author Statement ------------------------------------------------ Total number of Gene::GO connections: 266374 Genes Stats: ---------------- Genes with GO_term connections 91425 IEA GO_code present 85097 non-IEA GO_code present 6322 Source of the mapping data Source: *RNAi (GFF mapping overlaps) 27534 *citace 2861 *Inherited (motif & phenotype) 15436 GO_terms Stats: --------------- Total No. GO_terms 30670 GO_terms connected to Genes 3703 GO annotations connected with IEA 1903 GO annotations connected with non-IEA 1769 Breakdown IC - 6 IDA - 530 ISS - 148 IEP - 10 IGI - 149 IMP - 821 IPI - 86 NAS - 1 ND - 1 RCA - 0 TAS - 16 -===================================================================================- Useful Stats: --------- Genes with Sequence and WormBase-approved Gene names WS235 49764 (25609 elegans / 6178 briggsae / 6088 remanei / 5036 japonica / 3566 brenneri / 3287 pristionchus) -===================================================================================- New Genomes ----------- This release includes the genome of the "eye worm" Loa loa, causative agent of Loa loa filariasis. Genome assembly and annotation were performed by the Filarial worms Sequencing Project of the Broad Institute of Harvard and MIT (http://www.broadinstitute.org/). New Data -------- We have incorporated nearly 44,000 SNPs identified in a study by Andersen et al. This study characterized C. elegans genetic variation using high-throughput selective sequencing of a worldwide collection of 200 wild strains. 1.Andersen EC, Gerke JP, Shapiro JA, Crissman JR, Ghosh R, Bloom JS, Felix MA, Kruglyak L. Nat Genet. 2012 Jan 29;44(3):285-90. doi: 10.1038/ng.1050. "Chromosome-scale selective sweeps shape Caenorhabditis elegans genomic diversity." Genome sequence updates: ----------------------- Many changes have been made to the reference genome sequence of C. elegans, using data from three studies (see references below). This resulted in 558 Insertions, 230 Deletions, and 614 Substitutions. There were 87 improvements to gene structures that were enabled by the sequence corrections; most of these were changes to correct a poor structure near a frameshift.This work is fully described in the following article on the WormBase BLOG: http://blog.wormbase.org/2012/11/c-elegans-genome-reference-sequence-changes References: 1. McGrath PT, Xu Y, Ailion M, Garrison JL, Butcher RA, Bargmann CI. Nature. 2011 Aug 17;477(7364):321-5. doi: 10.1038/nature10378. "Parallel evolution of domesticated Caenorhabditis species targets pheromone receptor genes." 2. Weber KP, De S, Kozarewa I, Turner DJ, Babu MM, de Bono M. PLoS One. 2010 Nov 11;5(11):e13922. "Whole genome sequencing highlights genetic changes associated with laboratory domestication of C. elegans" 3. Doitsidou M, Poole RJ, Sarin S, Bigelow H, Hobert O. PLoS One. 2010 Nov 8;5(11):e15435. "C. elegans mutant identification with a one-step whole-genome-sequencing and SNP mapping strategy." Model Changes: ------------------------------------ Model changes for this release are documente here: http://wiki.wormbase.org/index.php/WS235_Models.wrm For more info mail help@wormbase.org -===================================================================================- Quick installation guide for UNIX/Linux systems ----------------------------------------------- 1. Create a new directory to contain your copy of WormBase, e.g. /users/yourname/wormbase 2. Unpack and untar all of the database.*.tar.gz files into this directory. You will need approximately 2-3 Gb of disk space. 3. Obtain and install a suitable acedb binary for your system (available from www.acedb.org). 4. Use the acedb 'xace' program to open your database, e.g. type 'xace /users/yourname/wormbase' at the command prompt. 5. See the acedb website for more information about acedb and using xace. ____________ END _____________