Difference between revisions of "WS233"
From WormBaseWiki
Jump to navigationJump to search (Created page with ' <pre> New release of WormBase WS232 WS232 was built by Michael Paulini [michael.paulini@wormbase.org] -=======================================================================…') |
|||
Line 1: | Line 1: | ||
+ | __TOC__ | ||
+ | |||
+ | == Release notes for WS233 == | ||
<pre> | <pre> | ||
− | |||
+ | New release of WormBase WS233 | ||
− | + | WS233 was built by mt3 | |
− | + | -===================================================================================- | |
+ | The WS233 build directory includes: | ||
+ | species/ DIR - contains a sub dir for each WormBase species (G_SPECIES) with the following files: | ||
+ | - G_SPECIES.WS233.genomic.fa.gz - Unmasked genomic DNA | ||
+ | - G_SPECIES.WS233.genomic_masked.fa.gz - Hard-masked (repeats replaced with Ns) genomic DNA | ||
+ | - G_SPECIES.WS233.genomic_softmasked.fa.gz - Soft-masked (repeats lower-cased) genomic DNA | ||
+ | - G_SPECIES.WS233.protein.fa.gz - Current live protein set | ||
+ | - G_SPECIES.WS233.cds_transcripts.fa.gz - Spliced cDNA sequence for the CDS portion of protein-coding transcripts | ||
+ | - G_SPECIES.WS233.ncrna_transcripts.fa.gz - Spliced cDNA sequence for non-coding RNA transcripts | ||
+ | - G_SPECIES.WS233.intergenic_sequences.fa.gz - DNA sequence between pairs of adjacent genes | ||
+ | - G_SPECIES.WS233.annotations.gff[2|3].gz - Sequence features in either GFF2 or GFF3 format | ||
+ | - G_SPECIES.WS233.ests.fa.gz - ESTs and mRNA sequences extracted from the public databases | ||
+ | - G_SPECIES.WS233.best_blastp_hits.txt.gz - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteins | ||
+ | - G_SPECIES.WS233.*pep_package.tar.gz - latest version of the [worm|brig|bren|rema|jap|ppa]pep package (if updated since last release) | ||
+ | - annotation/ - contains additional annotations: | ||
+ | - G_SPECIES.WS233.confirmed_genes.txt.gz - DNA sequences of all genes confirmed by EST &/or cDNA | ||
+ | - G_SPECIES.WS233.cDNA2orf.txt.gz - Latest set of ORF connections to each cDNA (EST, OST, mRNA) | ||
+ | - G_SPECIES.WS233.geneIDs.txtgz - list of all current gene identifiers with CGC & molecular names (when known) | ||
+ | - G_SPECIES.WS233.PCR_product2gene.txt.gz - Mappings between PCR products and overlapping Genes | ||
+ | - G_SPECIES.WS233.*oligo_mapping.txt.gz - Oligo array mapping files | ||
+ | - G_SPECIES.WS233.knockout_consortium_alleles.xml.gz - Table of Knockout Consortium alleles | ||
+ | - G_SPECIES.WS233.SRA_gene_expression.tar.gz - Tables of gene expression values computed from SRA RNASeq data | ||
+ | acedb DIR - Everything needed to generate a local copy of the The Primary database | ||
+ | - database.WS233.*.tar.gz - compressed acedb database for new release | ||
+ | - models.wrm.WS233 - the latest database schema (also in above database files) | ||
+ | - WS233-WS232.dbcomp - log file reporting difference from last release | ||
+ | - *Non_C_elegans_BLASTX/ - This directory contains the blastx data for non-elegans species | ||
+ | (reduces the size of the main database) | ||
+ | COMPARATIVE_ANALYSIS DIR - comparative analysis files | ||
+ | - compara.WS233.tar.bz2 - gene-tree and alignment GFF files | ||
+ | - wormpep_clw.WS233.sql.bz2 - ClustalW protein multiple alignments | ||
+ | ONTOLOGY DIR - gene_associations, obo files for (phenotype GO anatomy) and associated association files | ||
Release notes on the web: | Release notes on the web: | ||
− | + | ------------------------- | |
− | |||
http://www.wormbase.org/wiki/index.php/Release_Schedule | http://www.wormbase.org/wiki/index.php/Release_Schedule | ||
− | |||
− | |||
− | |||
− | |||
− | |||
C. elegans Chromosomal Changes: | C. elegans Chromosomal Changes: | ||
− | + | -------------------- | |
− | |||
There are no changes to the chromosome sequences in this release. | There are no changes to the chromosome sequences in this release. | ||
− | C. elegans Gene data set (Live C. elegans genes | + | C. elegans Gene data set (Live C. elegans genes 47559) |
− | + | ------------------------------------------ | |
− | + | Molecular_info 45917 (96.5%) | |
− | Molecular_info | + | Concise_description 5982 (12.6%) |
+ | Human_disease_relevance 88 (0.2%) | ||
+ | Reference 17219 (36.2%) | ||
+ | WormBase_approved_Gene_name 26890 (56.5%) | ||
+ | RNAi_result 24881 (52.3%) | ||
+ | Microarray_results 23981 (50.4%) | ||
+ | SAGE_transcript 19223 (40.4%) | ||
C. elegans | C. elegans | ||
− | |||
Wormpep data set: | Wormpep data set: | ||
+ | ---------------------------- | ||
+ | There are 26011 CDSs, from 20554 protein-coding genes | ||
− | + | The 26011 sequences contain 34548564 base pairs in total. | |
+ | Modified entries 12 | ||
+ | Deleted entries 21 | ||
+ | New entries 45 | ||
+ | Reappeared entries 4 | ||
− | The 25987 | + | Net change +28 |
− | + | The difference (24) between the total CDS's of this (26011) and the last build (25987) does not equal the net change 28 | |
− | + | Please investigate! ! | |
− | |||
− | |||
− | |||
− | |||
C. elegans Genome sequence composition: | C. elegans Genome sequence composition: | ||
+ | ---------------------------- | ||
+ | WS233 WS232 change | ||
+ | ---------------------------------------------- | ||
+ | a 32367418 32367418 +0 | ||
+ | c 17780787 17780787 +0 | ||
+ | g 17756985 17756985 +0 | ||
+ | t 32367086 32367086 +0 | ||
+ | n 0 0 +0 | ||
+ | - 0 0 +0 | ||
− | + | Total 100272276 100272276 +0 | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | Total 100272276 100272276 +0 | ||
Pristionchus pacificus Genome sequence composition: | Pristionchus pacificus Genome sequence composition: | ||
− | + | ---------------------------- | |
− | + | 172773083 total | |
− | 172773083 total a 43813958 c 32811034 g 32828589 t 43810996 - 0 n 19508506 | + | a 43813958 |
+ | c 32811034 | ||
+ | g 32828589 | ||
+ | t 43810996 | ||
+ | - 0 | ||
+ | n 19508506 | ||
Caenorhabditis remanei Genome sequence composition: | Caenorhabditis remanei Genome sequence composition: | ||
− | + | ---------------------------- | |
− | + | 145500347 total | |
− | 145500347 total a 42927857 c 26293828 g 26276020 t 42923178 - 0 n 7079464 | + | a 42927857 |
+ | c 26293828 | ||
+ | g 26276020 | ||
+ | t 42923178 | ||
+ | - 0 | ||
+ | n 7079464 | ||
Caenorhabditis japonica Genome sequence composition: | Caenorhabditis japonica Genome sequence composition: | ||
− | + | ---------------------------- | |
− | + | 166565019 total | |
− | 166565019 total a 46865690 c 30244493 g 30234317 t 46807519 - 0 n 12413000 | + | a 46865690 |
+ | c 30244493 | ||
+ | g 30234317 | ||
+ | t 46807519 | ||
+ | - 0 | ||
+ | n 12413000 | ||
Caenorhabditis briggsae Genome sequence composition: | Caenorhabditis briggsae Genome sequence composition: | ||
− | + | ---------------------------- | |
− | + | 108419768 total | |
− | 108419768 total a 32984239 c 19684682 g 19693545 t 33054090 - 62600 n 2940612 | + | a 32984239 |
+ | c 19684682 | ||
+ | g 19693545 | ||
+ | t 33054090 | ||
+ | - 62600 | ||
+ | n 2940612 | ||
Caenorhabditis brenneri Genome sequence composition: | Caenorhabditis brenneri Genome sequence composition: | ||
+ | ---------------------------- | ||
+ | 190421492 total | ||
+ | a 52222485 | ||
+ | c 32837458 | ||
+ | g 32882838 | ||
+ | t 52164077 | ||
+ | - 0 | ||
+ | n 20314634 | ||
− | |||
Tier II Gene counts | Tier II Gene counts | ||
− | + | --------------------------------------------- | |
− | + | pristionchus Gene count 24216 (Coding 24216) | |
− | pristionchus Gene count 24216 (Coding 24216) remanei Gene count 32414 (Coding 31445) japonica Gene count | + | remanei Gene count 32414 (Coding 31445) |
− | + | japonica Gene count 29964 (Coding 29964) | |
− | + | briggsae Gene count 23027 (Coding 21936) | |
− | brenneri Gene count | + | brenneri Gene count 32362 (Coding 30667) |
+ | --------------------------------------------- | ||
+ | ------------------------------------------------- | ||
Pristionchus pacificus Protein Stats: | Pristionchus pacificus Protein Stats: | ||
− | + | ------------------------------------------------- | |
− | |||
Status of entries: Confidence level of prediction (based on the amount of transcript evidence) | Status of entries: Confidence level of prediction (based on the amount of transcript evidence) | ||
+ | ------------------------------------------------- | ||
+ | Confirmed 229 (0.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) | ||
+ | Partially_confirmed 4982 (20.6%) Some, but not all exon bases are covered by transcript evidence | ||
+ | Predicted 19006 (78.5%) No transcriptional evidence at all | ||
− | |||
− | |||
Gene <-> CDS,Transcript,Pseudogene connections | Gene <-> CDS,Transcript,Pseudogene connections | ||
− | + | ---------------------------------------------- | |
− | + | Pristionchus pacificus entries with WormBase-approved Gene name 3259 | |
− | Pristionchus pacificus entries with WormBase-approved Gene name | ||
+ | ------------------------------------------------- | ||
Caenorhabditis remanei Protein Stats: | Caenorhabditis remanei Protein Stats: | ||
− | + | ------------------------------------------------- | |
− | |||
Status of entries: Confidence level of prediction (based on the amount of transcript evidence) | Status of entries: Confidence level of prediction (based on the amount of transcript evidence) | ||
+ | ------------------------------------------------- | ||
+ | Confirmed 961 (3.1%) Every base of every exon has transcription evidence (mRNA, EST etc.) | ||
+ | Partially_confirmed 5658 (18.0%) Some, but not all exon bases are covered by transcript evidence | ||
+ | Predicted 24831 (79.0%) No transcriptional evidence at all | ||
− | |||
− | |||
Gene <-> CDS,Transcript,Pseudogene connections | Gene <-> CDS,Transcript,Pseudogene connections | ||
− | + | ---------------------------------------------- | |
− | + | Caenorhabditis remanei entries with WormBase-approved Gene name 6018 | |
− | Caenorhabditis remanei entries with WormBase-approved Gene name | ||
+ | ------------------------------------------------- | ||
Caenorhabditis japonica Protein Stats: | Caenorhabditis japonica Protein Stats: | ||
− | + | ------------------------------------------------- | |
− | |||
Status of entries: Confidence level of prediction (based on the amount of transcript evidence) | Status of entries: Confidence level of prediction (based on the amount of transcript evidence) | ||
+ | ------------------------------------------------- | ||
+ | Confirmed 1637 (4.5%) Every base of every exon has transcription evidence (mRNA, EST etc.) | ||
+ | Partially_confirmed 5200 (14.4%) Some, but not all exon bases are covered by transcript evidence | ||
+ | Predicted 29197 (81.0%) No transcriptional evidence at all | ||
− | |||
− | |||
Gene <-> CDS,Transcript,Pseudogene connections | Gene <-> CDS,Transcript,Pseudogene connections | ||
− | + | ---------------------------------------------- | |
− | + | Caenorhabditis japonica entries with WormBase-approved Gene name 4973 | |
− | Caenorhabditis japonica entries with WormBase-approved Gene name | ||
+ | ------------------------------------------------- | ||
Caenorhabditis briggsae Protein Stats: | Caenorhabditis briggsae Protein Stats: | ||
− | + | ------------------------------------------------- | |
− | |||
Status of entries: Confidence level of prediction (based on the amount of transcript evidence) | Status of entries: Confidence level of prediction (based on the amount of transcript evidence) | ||
+ | ------------------------------------------------- | ||
+ | Confirmed 54 (0.2%) Every base of every exon has transcription evidence (mRNA, EST etc.) | ||
+ | Partially_confirmed 859 (3.9%) Some, but not all exon bases are covered by transcript evidence | ||
+ | Predicted 21048 (95.8%) No transcriptional evidence at all | ||
− | |||
− | |||
Status of entries: Protein Accessions | Status of entries: Protein Accessions | ||
− | + | ------------------------------------- | |
− | + | UniProtKB accessions 21662 (98.6%) | |
− | UniProtKB accessions 21662 (98.6%) | ||
− | |||
Gene <-> CDS,Transcript,Pseudogene connections | Gene <-> CDS,Transcript,Pseudogene connections | ||
− | + | ---------------------------------------------- | |
− | + | Caenorhabditis briggsae entries with WormBase-approved Gene name 6101 | |
− | Caenorhabditis briggsae entries with WormBase-approved Gene name | ||
+ | ------------------------------------------------- | ||
Caenorhabditis brenneri Protein Stats: | Caenorhabditis brenneri Protein Stats: | ||
− | + | ------------------------------------------------- | |
− | |||
Status of entries: Confidence level of prediction (based on the amount of transcript evidence) | Status of entries: Confidence level of prediction (based on the amount of transcript evidence) | ||
+ | ------------------------------------------------- | ||
+ | Confirmed 1510 (4.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) | ||
+ | Partially_confirmed 5638 (18.4%) Some, but not all exon bases are covered by transcript evidence | ||
+ | Predicted 23522 (76.7%) No transcriptional evidence at all | ||
− | |||
− | |||
Gene <-> CDS,Transcript,Pseudogene connections | Gene <-> CDS,Transcript,Pseudogene connections | ||
− | + | ---------------------------------------------- | |
− | + | Caenorhabditis brenneri entries with WormBase-approved Gene name 3513 | |
− | Caenorhabditis brenneri entries with WormBase-approved Gene name | ||
+ | ------------------------------------------------- | ||
Caenorhabditis elegans Protein Stats: | Caenorhabditis elegans Protein Stats: | ||
− | + | ------------------------------------------------- | |
− | |||
Status of entries: Confidence level of prediction (based on the amount of transcript evidence) | Status of entries: Confidence level of prediction (based on the amount of transcript evidence) | ||
+ | ------------------------------------------------- | ||
+ | Confirmed 12443 (47.8%) Every base of every exon has transcription evidence (mRNA, EST etc.) | ||
+ | Partially_confirmed 11492 (44.2%) Some, but not all exon bases are covered by transcript evidence | ||
+ | Predicted 2076 (8.0%) No transcriptional evidence at all | ||
− | |||
− | |||
Status of entries: Protein Accessions | Status of entries: Protein Accessions | ||
− | + | ------------------------------------- | |
− | + | UniProtKB accessions 25947 (99.8%) | |
− | UniProtKB accessions | ||
− | |||
Status of entries: Protein_ID's in EMBL | Status of entries: Protein_ID's in EMBL | ||
− | + | --------------------------------------- | |
− | + | Protein_id 25999 (100.0%) | |
− | Protein_id | ||
− | |||
Gene <-> CDS,Transcript,Pseudogene connections | Gene <-> CDS,Transcript,Pseudogene connections | ||
− | + | ---------------------------------------------- | |
− | + | Caenorhabditis elegans entries with WormBase-approved Gene name 25293 | |
− | Caenorhabditis elegans entries with WormBase-approved Gene name | ||
C. elegans Operons Stats | C. elegans Operons Stats | ||
− | + | --------------------------------------------- | |
− | |||
Description: These exist as closely spaced gene clusters similar to bacterial operons | Description: These exist as closely spaced gene clusters similar to bacterial operons | ||
+ | --------------------------------------------- | ||
+ | | Live Operons 1390 | | ||
+ | | Genes in Operons 3634 | | ||
+ | --------------------------------------------- | ||
− | + | GO Annotation Stats WS233 | |
− | + | -------------------------------------- | |
− | |||
− | |||
− | |||
− | |||
− | GO Annotation Stats | ||
− | |||
GO_codes - used for assigning evidence | GO_codes - used for assigning evidence | ||
− | + | -------------------------------------- | |
− | + | IC Inferred by Curator | |
− | IC Inferred by Curator IDA Inferred from Direct Assay IEA Inferred from Electronic Annotation IEP Inferred from Expression Pattern IGI Inferred from Genetic Interaction IMP Inferred from Mutant Phenotype IPI Inferred from Physical Interaction ISS Inferred from Sequence (or Structural) Similarity NAS Non-traceable Author Statement ND No Biological Data available RCA Inferred from Reviewed Computational Analysis | + | IDA Inferred from Direct Assay |
− | + | IEA Inferred from Electronic Annotation | |
− | + | IEP Inferred from Expression Pattern | |
+ | IGI Inferred from Genetic Interaction | ||
+ | IMP Inferred from Mutant Phenotype | ||
+ | IPI Inferred from Physical Interaction | ||
+ | ISS Inferred from Sequence (or Structural) Similarity | ||
+ | NAS Non-traceable Author Statement | ||
+ | ND No Biological Data available | ||
+ | RCA Inferred from Reviewed Computational Analysis | ||
TAS Traceable Author Statement | TAS Traceable Author Statement | ||
+ | ------------------------------------------------ | ||
− | + | Total number of Gene::GO connections: 263501 | |
− | Total number of Gene::GO connections: | ||
− | |||
Genes Stats: | Genes Stats: | ||
− | + | ---------------- | |
− | + | Genes with GO_term connections 90767 | |
− | Genes with GO_term connections | + | IEA GO_code present 84543 |
− | IEA GO_code present | + | non-IEA GO_code present 6220 |
− | non-IEA GO_code present | ||
− | |||
Source of the mapping data | Source of the mapping data | ||
− | Source: *RNAi (GFF mapping overlaps) | + | Source: *RNAi (GFF mapping overlaps) 26630 |
− | *citace | + | *citace 2561 |
− | *Inherited (motif & phenotype) | + | *Inherited (motif & phenotype) 15134 |
− | |||
GO_terms Stats: | GO_terms Stats: | ||
− | + | --------------- | |
− | + | Total No. GO_terms 30611 | |
− | Total No. GO_terms | + | GO_terms connected to Genes 3586 |
− | GO_terms connected to Genes | + | GO annotations connected with IEA 1842 |
− | GO annotations connected with IEA | + | GO annotations connected with non-IEA 1727 |
− | GO annotations connected with non-IEA | + | Breakdown IC - 6 IDA - 498 ISS - 151 |
− | Breakdown IC - 6 IDA - | + | IEP - 11 IGI - 147 IMP - 810 |
− | RCA - 0 TAS - 18 | + | IPI - 83 NAS - 1 ND - 1 |
+ | RCA - 0 TAS - 18 | ||
-===================================================================================- | -===================================================================================- | ||
− | |||
Useful Stats: | Useful Stats: | ||
+ | --------- | ||
+ | Genes with Sequence and WormBase-approved Gene names | ||
+ | WS233 49157 (25293 elegans / 6101 briggsae / 6018 remanei / 4973 japonica / 3513 brenneri / 3259 pristionchus) | ||
− | |||
+ | -===================================================================================- | ||
− | |||
New Data: | New Data: | ||
+ | --------- | ||
+ | Transcription Factor Binding sites | ||
− | + | The modENCODE Transcription Factor data has been added as 321,212 new | |
+ | Feature objects with the Method tag 'TF_binding_site_region'. This | ||
+ | data was derived from the Snyder Lab project to map the binding sites | ||
+ | of a selection of transcription factors in various life-stages and | ||
+ | conditions using ChIP-Seq. | ||
+ | The region where the transcription factor binds has been called by | ||
+ | modENCODE using peak-finding software on the ChIP-Seq results, the | ||
+ | binding sites are therefore currently known only approximately, | ||
+ | typically in a region of about 200 bases, or larger. | ||
− | + | These data have a GFF line with the "type" column containing the SO | |
+ | term 'TF_binding_site' and with the name of the transcription factor | ||
+ | and the ID of the transcription factor object in the database, for | ||
+ | example: | ||
− | + | CHROMOSOME_III TF_binding_site_region TF_binding_site 14696 15618 . + . Feature "WBsf401679" ; TF_ID "WBTranscriptionFactor000025" ; TF_name "DAF-16" | |
+ | Where binding sites are known exactly, the Feature object's Method tag | ||
+ | will be changed to 'TF_binding_site'. There are some existing exactly | ||
+ | known sites, from other projects, having Feature objects with the | ||
+ | Method tag 'TF_binding_site'. | ||
− | |||
− | |||
+ | Genome Sequence error sites | ||
− | |||
+ | Features marking known genome sequence error locations that have | ||
+ | already been corrected by changing the C. elegans reference genome | ||
+ | sequence now have the Method tag 'Corrected_genome_sequence_error'. | ||
− | + | These have a GFF line with the "type" column containing the SO term | |
+ | 'base_call_error_correction', for example: | ||
+ | CHROMOSOME_III RNASeq base_call_error_correction 661017 661017 . + . Feature "WBsf047774" | ||
− | |||
+ | We have been working on marking possible genome sequence errors in the | ||
+ | C. elegans reference genome using data from the following projects: | ||
− | |||
+ | - Weber KP et al. (2010) PLoS One "Whole genome sequencing highlights | ||
+ | genetic changes associated with laboratory ...." | ||
− | + | PMID 21085631 | |
+ | - Doitsidou M et al. (2010) PLoS One "C. elegans mutant identification | ||
+ | with a one-step whole-genome-sequencing and ...." | ||
− | + | PMID 21079745 | |
+ | - McGrath PT et al. (2011) Nature "Parallel evolution of domesticated | ||
+ | Caenorhabditis species targets pheromone ...." | ||
− | + | PMID 21849976 | |
+ | This data has been used to create Feature objects with a Method tag of | ||
+ | 'Genome_sequence_error', taking the total of these locations in the | ||
+ | C. elegans database to 2,428. | ||
− | + | These data have a GFF line with the "type" column containing the SO | |
+ | term 'possible_base_call_error', for example: | ||
− | + | CHROMOSOME_III RNASeq possible_base_call_error 38559 38560 . + . Feature "WBsf268625" | |
− | + | No changes have been made yet to the C. elegans reference genome | |
+ | sequence to correct these locations. We intend to correct the sites | ||
+ | that can be shown to influence coding genes in a future release of | ||
+ | WormBase. There are 77 indel errors in the C. elegans genome that | ||
+ | affect the structure of a coding gene. | ||
+ | There are two possible error sites on the C. elegans mitochondrial | ||
+ | sequence. As the WormBase consortium does not own the mitochondrial | ||
+ | sequence, we will mark these sites with a Feature object with the | ||
+ | Method tag 'Genome_sequence_error', but we will not be changing this | ||
+ | sequence. | ||
− | |||
+ | Genome sequence updates: | ||
+ | ----------------------- | ||
− | + | None this release | |
+ | New Fixes: | ||
+ | ---------- | ||
− | |||
+ | Known Problems: | ||
+ | --------------- | ||
− | |||
+ | Other Changes: | ||
+ | -------------- | ||
− | + | The life-cycle data in all objects has been changed from being held as | |
+ | the name of the life-cycle stage to the ID number of the life-cycle | ||
+ | stage. | ||
+ | For example: 'L1 larva' has been changed to be held as the ID | ||
+ | 'WBls:0000024' in all objects that refer to it. | ||
− | |||
+ | Model Changes: | ||
+ | ------------------------------------ | ||
− | + | In this release there are changes to a number of classes: | |
+ | ?Strain | ||
+ | ?RNAi | ||
+ | ?Life _stage | ||
+ | #Interactor_info | ||
+ | ?Interaction | ||
+ | ?Transgene | ||
+ | ?Rearrangement | ||
+ | ?PCR_product | ||
+ | ?Anatomy_term | ||
+ | ?Paper | ||
+ | ?Process | ||
− | More infromation and a human readable diff can be found here: http://wiki.wormbase.org/index.php/WS232_Models.wrm | + | More infromation and a human readable diff can be found here: |
+ | http://wiki.wormbase.org/index.php/WS232_Models.wrm | ||
− | + | For more info mail help@wormbase.org | |
− | For more info mail help@wormbase.org -===================================================================================- | + | -===================================================================================- |
Quick installation guide for UNIX/Linux systems | Quick installation guide for UNIX/Linux systems | ||
+ | ----------------------------------------------- | ||
+ | 1. Create a new directory to contain your copy of WormBase, | ||
+ | e.g. /users/yourname/wormbase | ||
− | + | 2. Unpack and untar all of the database.*.tar.gz files into | |
− | + | this directory. You will need approximately 2-3 Gb of disk space. | |
− | |||
− | |||
− | |||
− | + | 3. Obtain and install a suitable acedb binary for your system | |
+ | (available from www.acedb.org). | ||
− | + | 4. Use the acedb 'xace' program to open your database, e.g. | |
+ | type 'xace /users/yourname/wormbase' at the command prompt. | ||
+ | 5. See the acedb website for more information about acedb and | ||
+ | using xace. | ||
+ | ____________ END _____________ | ||
− | |||
</pre> | </pre> |
Latest revision as of 14:17, 13 September 2012
Contents
Release notes for WS233
New release of WormBase WS233 WS233 was built by mt3 -===================================================================================- The WS233 build directory includes: species/ DIR - contains a sub dir for each WormBase species (G_SPECIES) with the following files: - G_SPECIES.WS233.genomic.fa.gz - Unmasked genomic DNA - G_SPECIES.WS233.genomic_masked.fa.gz - Hard-masked (repeats replaced with Ns) genomic DNA - G_SPECIES.WS233.genomic_softmasked.fa.gz - Soft-masked (repeats lower-cased) genomic DNA - G_SPECIES.WS233.protein.fa.gz - Current live protein set - G_SPECIES.WS233.cds_transcripts.fa.gz - Spliced cDNA sequence for the CDS portion of protein-coding transcripts - G_SPECIES.WS233.ncrna_transcripts.fa.gz - Spliced cDNA sequence for non-coding RNA transcripts - G_SPECIES.WS233.intergenic_sequences.fa.gz - DNA sequence between pairs of adjacent genes - G_SPECIES.WS233.annotations.gff[2|3].gz - Sequence features in either GFF2 or GFF3 format - G_SPECIES.WS233.ests.fa.gz - ESTs and mRNA sequences extracted from the public databases - G_SPECIES.WS233.best_blastp_hits.txt.gz - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteins - G_SPECIES.WS233.*pep_package.tar.gz - latest version of the [worm|brig|bren|rema|jap|ppa]pep package (if updated since last release) - annotation/ - contains additional annotations: - G_SPECIES.WS233.confirmed_genes.txt.gz - DNA sequences of all genes confirmed by EST &/or cDNA - G_SPECIES.WS233.cDNA2orf.txt.gz - Latest set of ORF connections to each cDNA (EST, OST, mRNA) - G_SPECIES.WS233.geneIDs.txtgz - list of all current gene identifiers with CGC & molecular names (when known) - G_SPECIES.WS233.PCR_product2gene.txt.gz - Mappings between PCR products and overlapping Genes - G_SPECIES.WS233.*oligo_mapping.txt.gz - Oligo array mapping files - G_SPECIES.WS233.knockout_consortium_alleles.xml.gz - Table of Knockout Consortium alleles - G_SPECIES.WS233.SRA_gene_expression.tar.gz - Tables of gene expression values computed from SRA RNASeq data acedb DIR - Everything needed to generate a local copy of the The Primary database - database.WS233.*.tar.gz - compressed acedb database for new release - models.wrm.WS233 - the latest database schema (also in above database files) - WS233-WS232.dbcomp - log file reporting difference from last release - *Non_C_elegans_BLASTX/ - This directory contains the blastx data for non-elegans species (reduces the size of the main database) COMPARATIVE_ANALYSIS DIR - comparative analysis files - compara.WS233.tar.bz2 - gene-tree and alignment GFF files - wormpep_clw.WS233.sql.bz2 - ClustalW protein multiple alignments ONTOLOGY DIR - gene_associations, obo files for (phenotype GO anatomy) and associated association files Release notes on the web: ------------------------- http://www.wormbase.org/wiki/index.php/Release_Schedule C. elegans Chromosomal Changes: -------------------- There are no changes to the chromosome sequences in this release. C. elegans Gene data set (Live C. elegans genes 47559) ------------------------------------------ Molecular_info 45917 (96.5%) Concise_description 5982 (12.6%) Human_disease_relevance 88 (0.2%) Reference 17219 (36.2%) WormBase_approved_Gene_name 26890 (56.5%) RNAi_result 24881 (52.3%) Microarray_results 23981 (50.4%) SAGE_transcript 19223 (40.4%) C. elegans Wormpep data set: ---------------------------- There are 26011 CDSs, from 20554 protein-coding genes The 26011 sequences contain 34548564 base pairs in total. Modified entries 12 Deleted entries 21 New entries 45 Reappeared entries 4 Net change +28 The difference (24) between the total CDS's of this (26011) and the last build (25987) does not equal the net change 28 Please investigate! ! C. elegans Genome sequence composition: ---------------------------- WS233 WS232 change ---------------------------------------------- a 32367418 32367418 +0 c 17780787 17780787 +0 g 17756985 17756985 +0 t 32367086 32367086 +0 n 0 0 +0 - 0 0 +0 Total 100272276 100272276 +0 Pristionchus pacificus Genome sequence composition: ---------------------------- 172773083 total a 43813958 c 32811034 g 32828589 t 43810996 - 0 n 19508506 Caenorhabditis remanei Genome sequence composition: ---------------------------- 145500347 total a 42927857 c 26293828 g 26276020 t 42923178 - 0 n 7079464 Caenorhabditis japonica Genome sequence composition: ---------------------------- 166565019 total a 46865690 c 30244493 g 30234317 t 46807519 - 0 n 12413000 Caenorhabditis briggsae Genome sequence composition: ---------------------------- 108419768 total a 32984239 c 19684682 g 19693545 t 33054090 - 62600 n 2940612 Caenorhabditis brenneri Genome sequence composition: ---------------------------- 190421492 total a 52222485 c 32837458 g 32882838 t 52164077 - 0 n 20314634 Tier II Gene counts --------------------------------------------- pristionchus Gene count 24216 (Coding 24216) remanei Gene count 32414 (Coding 31445) japonica Gene count 29964 (Coding 29964) briggsae Gene count 23027 (Coding 21936) brenneri Gene count 32362 (Coding 30667) --------------------------------------------- ------------------------------------------------- Pristionchus pacificus Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 229 (0.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 4982 (20.6%) Some, but not all exon bases are covered by transcript evidence Predicted 19006 (78.5%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Pristionchus pacificus entries with WormBase-approved Gene name 3259 ------------------------------------------------- Caenorhabditis remanei Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 961 (3.1%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5658 (18.0%) Some, but not all exon bases are covered by transcript evidence Predicted 24831 (79.0%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis remanei entries with WormBase-approved Gene name 6018 ------------------------------------------------- Caenorhabditis japonica Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 1637 (4.5%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5200 (14.4%) Some, but not all exon bases are covered by transcript evidence Predicted 29197 (81.0%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis japonica entries with WormBase-approved Gene name 4973 ------------------------------------------------- Caenorhabditis briggsae Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 54 (0.2%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 859 (3.9%) Some, but not all exon bases are covered by transcript evidence Predicted 21048 (95.8%) No transcriptional evidence at all Status of entries: Protein Accessions ------------------------------------- UniProtKB accessions 21662 (98.6%) Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis briggsae entries with WormBase-approved Gene name 6101 ------------------------------------------------- Caenorhabditis brenneri Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 1510 (4.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5638 (18.4%) Some, but not all exon bases are covered by transcript evidence Predicted 23522 (76.7%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis brenneri entries with WormBase-approved Gene name 3513 ------------------------------------------------- Caenorhabditis elegans Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 12443 (47.8%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 11492 (44.2%) Some, but not all exon bases are covered by transcript evidence Predicted 2076 (8.0%) No transcriptional evidence at all Status of entries: Protein Accessions ------------------------------------- UniProtKB accessions 25947 (99.8%) Status of entries: Protein_ID's in EMBL --------------------------------------- Protein_id 25999 (100.0%) Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis elegans entries with WormBase-approved Gene name 25293 C. elegans Operons Stats --------------------------------------------- Description: These exist as closely spaced gene clusters similar to bacterial operons --------------------------------------------- | Live Operons 1390 | | Genes in Operons 3634 | --------------------------------------------- GO Annotation Stats WS233 -------------------------------------- GO_codes - used for assigning evidence -------------------------------------- IC Inferred by Curator IDA Inferred from Direct Assay IEA Inferred from Electronic Annotation IEP Inferred from Expression Pattern IGI Inferred from Genetic Interaction IMP Inferred from Mutant Phenotype IPI Inferred from Physical Interaction ISS Inferred from Sequence (or Structural) Similarity NAS Non-traceable Author Statement ND No Biological Data available RCA Inferred from Reviewed Computational Analysis TAS Traceable Author Statement ------------------------------------------------ Total number of Gene::GO connections: 263501 Genes Stats: ---------------- Genes with GO_term connections 90767 IEA GO_code present 84543 non-IEA GO_code present 6220 Source of the mapping data Source: *RNAi (GFF mapping overlaps) 26630 *citace 2561 *Inherited (motif & phenotype) 15134 GO_terms Stats: --------------- Total No. GO_terms 30611 GO_terms connected to Genes 3586 GO annotations connected with IEA 1842 GO annotations connected with non-IEA 1727 Breakdown IC - 6 IDA - 498 ISS - 151 IEP - 11 IGI - 147 IMP - 810 IPI - 83 NAS - 1 ND - 1 RCA - 0 TAS - 18 -===================================================================================- Useful Stats: --------- Genes with Sequence and WormBase-approved Gene names WS233 49157 (25293 elegans / 6101 briggsae / 6018 remanei / 4973 japonica / 3513 brenneri / 3259 pristionchus) -===================================================================================- New Data: --------- Transcription Factor Binding sites The modENCODE Transcription Factor data has been added as 321,212 new Feature objects with the Method tag 'TF_binding_site_region'. This data was derived from the Snyder Lab project to map the binding sites of a selection of transcription factors in various life-stages and conditions using ChIP-Seq. The region where the transcription factor binds has been called by modENCODE using peak-finding software on the ChIP-Seq results, the binding sites are therefore currently known only approximately, typically in a region of about 200 bases, or larger. These data have a GFF line with the "type" column containing the SO term 'TF_binding_site' and with the name of the transcription factor and the ID of the transcription factor object in the database, for example: CHROMOSOME_III TF_binding_site_region TF_binding_site 14696 15618 . + . Feature "WBsf401679" ; TF_ID "WBTranscriptionFactor000025" ; TF_name "DAF-16" Where binding sites are known exactly, the Feature object's Method tag will be changed to 'TF_binding_site'. There are some existing exactly known sites, from other projects, having Feature objects with the Method tag 'TF_binding_site'. Genome Sequence error sites Features marking known genome sequence error locations that have already been corrected by changing the C. elegans reference genome sequence now have the Method tag 'Corrected_genome_sequence_error'. These have a GFF line with the "type" column containing the SO term 'base_call_error_correction', for example: CHROMOSOME_III RNASeq base_call_error_correction 661017 661017 . + . Feature "WBsf047774" We have been working on marking possible genome sequence errors in the C. elegans reference genome using data from the following projects: - Weber KP et al. (2010) PLoS One "Whole genome sequencing highlights genetic changes associated with laboratory ...." PMID 21085631 - Doitsidou M et al. (2010) PLoS One "C. elegans mutant identification with a one-step whole-genome-sequencing and ...." PMID 21079745 - McGrath PT et al. (2011) Nature "Parallel evolution of domesticated Caenorhabditis species targets pheromone ...." PMID 21849976 This data has been used to create Feature objects with a Method tag of 'Genome_sequence_error', taking the total of these locations in the C. elegans database to 2,428. These data have a GFF line with the "type" column containing the SO term 'possible_base_call_error', for example: CHROMOSOME_III RNASeq possible_base_call_error 38559 38560 . + . Feature "WBsf268625" No changes have been made yet to the C. elegans reference genome sequence to correct these locations. We intend to correct the sites that can be shown to influence coding genes in a future release of WormBase. There are 77 indel errors in the C. elegans genome that affect the structure of a coding gene. There are two possible error sites on the C. elegans mitochondrial sequence. As the WormBase consortium does not own the mitochondrial sequence, we will mark these sites with a Feature object with the Method tag 'Genome_sequence_error', but we will not be changing this sequence. Genome sequence updates: ----------------------- None this release New Fixes: ---------- Known Problems: --------------- Other Changes: -------------- The life-cycle data in all objects has been changed from being held as the name of the life-cycle stage to the ID number of the life-cycle stage. For example: 'L1 larva' has been changed to be held as the ID 'WBls:0000024' in all objects that refer to it. Model Changes: ------------------------------------ In this release there are changes to a number of classes: ?Strain ?RNAi ?Life _stage #Interactor_info ?Interaction ?Transgene ?Rearrangement ?PCR_product ?Anatomy_term ?Paper ?Process More infromation and a human readable diff can be found here: http://wiki.wormbase.org/index.php/WS232_Models.wrm For more info mail help@wormbase.org -===================================================================================- Quick installation guide for UNIX/Linux systems ----------------------------------------------- 1. Create a new directory to contain your copy of WormBase, e.g. /users/yourname/wormbase 2. Unpack and untar all of the database.*.tar.gz files into this directory. You will need approximately 2-3 Gb of disk space. 3. Obtain and install a suitable acedb binary for your system (available from www.acedb.org). 4. Use the acedb 'xace' program to open your database, e.g. type 'xace /users/yourname/wormbase' at the command prompt. 5. See the acedb website for more information about acedb and using xace. ____________ END _____________