Difference between revisions of "WS220"
From WormBaseWiki
Jump to navigationJump to searchLine 472: | Line 472: | ||
− | For more info mail | + | For more info mail help@wormbase.org |
-===================================================================================- | -===================================================================================- | ||
Latest revision as of 11:10, 21 December 2011
New release of WormBase WS220, Wormpep220 and Wormrna220 Mon Oct 25 13:22:36 BST 2010 WS220 was built by gw3 -===================================================================================- The WS220 build directory includes: genomes DIR - contains a sub dir for each WormBase species with sequence, gff, and agp data genomes/b_malayi: - genome_feature_tables/ sequences/ genomes/c_brenneri: - genome_feature_tables/ sequences/ genomes/c_briggsae: - genome_feature_tables/ sequences/ genomes/c_elegans: - annotation/ genome_feature_tables/ sequences/ genomes/c_japonica: - genome_feature_tables/ sequences/ genomes/c_remanei: - genome_feature_tables/ sequences/ genomes/h_bacteriophora: - genome_feature_tables/ sequences/ genomes/h_contortus: - genome_feature_tables/ sequences/ genomes/m_hapla: - genome_feature_tables/ sequences/ genomes/m_incognita: - sequences/ genomes/p_pacificus: - genome_feature_tables/ sequences/ *annotation/ - contains additional annotations i) confirmed_genes.WS220.gz - DNA sequences of all genes confirmed by EST &/or cDNA ii) cDNA2orf.WS220.gz - Latest set of ORF connections to each cDNA (EST, OST, mRNA) iii) geneIDs.WS220.gz - list of all current gene identifiers with CGC & molecular names (when known) iv) PCR_product2gene.WS220.gz - Mappings between PCR products and overlapping Genes v) oligo_mapping.gz - V *genome_feature_tables/ - contains the main .gff files and supplementary .gff data *sequences/ - contains dna/ protein/ rna/ sub dirs sequences/protein - WormBase protein set for species + history etc. vi) wormpep220.tar.gz - full Wormpep distribution corresponding to WS220 vii) wormrna220.tar.gz - latest WormRNA release containing non-coding RNA's in the genome viii) best_blastp_hits_species.WS220.gz - for each C. elegans WormPep protein, lists Best blastp match to human, fly, yeast, C. briggsae, and SwissProt & TrEMBL proteins. sequences/dna - WormBase dna data genomic sequence (raw, soft_masked masked), agp ix) intergenic_sequences.dna.gz sequences/rna - WormBase rna gene data. acedb DIR - Everything needed to generate a local copy of the The Primary database x) database.WS220.*.tar.gz - compressed acedb database for new release xi) models.wrm.WS220 - the latest database schema (also in above database files) xii) WS220-WS219.dbcomp - log file reporting difference from last release *Non_C_elegans_BLASTX/ - This directory contains the blastx data for non-elegans species (reduces the size of the main database) COMPARATIVE_ANALYSIS DIR - compara.tar.bz2 wormpep217_clw.sql.bz2 ONTOLOGY DIR - gene_associations, obo files for (phenotype GO anatomy) and associated association files Release notes on the web: ------------------------- http://www.wormbase.org/wiki/index.php/Release_Schedule C. elegans Synchronisation with GenBank / EMBL: ------------------------------------ No synchronisation issues C. elegans Chromosomal Changes: -------------------- There are no changes to the chromosome sequences in this release. C. elegans Gene data set (Live C.elegans genes 47360) ------------------------------------------ Molecular_info 45681 (96.5%) Concise_description 5704 (12%) Reference 14129 (29.8%) WormBase_approved_Gene_name 26069 (55%) RNAi_result 24623 (52%) Microarray_results 22109 (46.7%) SAGE_transcript 19157 (40.4%) C. elegans Wormpep data set: ---------------------------- There are 20405 CDS in autoace, 24842 when counting 4437 alternate splice forms. The 24842 sequences contain 10,928,467 base pairs in total. Modified entries 9 Deleted entries 18 New entries 34 Reappeared entries 1 Net change +17 C. elegans Genome sequence composition: ---------------------------- WS220 WS219 change ---------------------------------------------- a 32367418 32367418 +0 c 17780787 17780787 +0 g 17756985 17756985 +0 t 32367086 32367086 +0 n 0 0 +0 - 0 0 +0 Total 100272276 100272276 +0 Pristionchus pacificus Genome sequence composition: ---------------------------- 169822619 total a 41799168 c 31168435 g 31196239 t 41802890 - 0 n 23855887 Caenorhabditis remanei Genome sequence composition: ---------------------------- 145500347 total a 42927857 c 26293828 g 26276020 t 42923178 - 0 n 7079464 Caenorhabditis japonica Genome sequence composition: ---------------------------- 163282347 total a 39053092 c 25603225 g 25576971 t 39126103 - 0 n 33922956 Caenorhabditis briggsae Genome sequence composition: ---------------------------- 108478630 total a 33004189 c 19675861 g 19707411 t 33049803 - 0 n 3041366 Caenorhabditis brenneri Genome sequence composition: ---------------------------- 190426650 total a 52223359 c 32838518 g 32883836 t 52164943 - 0 n 20315994 Tier II Gene counts --------------------------------------------- pristionchus Gene count 29638 (Coding 29639) remanei Gene count 32431 (Coding 31476) heterorhabditis Gene count 0 (Coding 0) japonica Gene count 27177 (Coding 25870) briggsae Gene count 23039 (Coding 21991) brenneri Gene count 32292 (Coding 30707) --------------------------------------------- ------------------------------------------------- Pristionchus pacificus Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 425 (1.4%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5309 (17.9%) Some, but not all exon bases are covered by transcript evidence Predicted 23905 (80.7%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Pristionchus pacificus entries with WormBase-approved Gene name 2797 ------------------------------------------------- Caenorhabditis remanei Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 956 (3.0%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5662 (18.0%) Some, but not all exon bases are covered by transcript evidence Predicted 24858 (79.0%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis remanei entries with WormBase-approved Gene name 5466 ------------------------------------------------- Caenorhabditis japonica Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 1182 (4.6%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 4974 (19.2%) Some, but not all exon bases are covered by transcript evidence Predicted 19714 (76.2%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis japonica entries with WormBase-approved Gene name 4797 ------------------------------------------------- Caenorhabditis briggsae Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 52 (0.2%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 856 (3.9%) Some, but not all exon bases are covered by transcript evidence Predicted 21083 (95.9%) No transcriptional evidence at all Status of entries: Protein Accessions ------------------------------------- UniProtKB accessions 21703 (98.7%) Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis briggsae entries with WormBase-approved Gene name 5504 ------------------------------------------------- Caenorhabditis brenneri Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 1517 (4.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5638 (18.4%) Some, but not all exon bases are covered by transcript evidence Predicted 23550 (76.7%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis brenneri entries with WormBase-approved Gene name 3094 ------------------------------------------------- Caenorhabditis elegans Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 11663 (46.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 11004 (44.3%) Some, but not all exon bases are covered by transcript evidence Predicted 2175 (8.8%) No transcriptional evidence at all Status of entries: Protein Accessions ------------------------------------- UniProtKB accessions 24647 (99.2%) Status of entries: Protein_ID's in EMBL --------------------------------------- Protein_id 24647 (99.2%) Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis elegans entries with WormBase-approved Gene name 24437 C. elegans Operons Stats --------------------------------------------- Description: These exist as closely spaced gene clusters similar to bacterial operons --------------------------------------------- | Live Operons 1288 | | Genes in Operons 3342 | --------------------------------------------- GO Annotation Stats WS220 -------------------------------------- GO_codes - used for assigning evidence -------------------------------------- IC Inferred by Curator IDA Inferred from Direct Assay IEA Inferred from Electronic Annotation IEP Inferred from Expression Pattern IGI Inferred from Genetic Interaction IMP Inferred from Mutant Phenotype IPI Inferred from Physical Interaction ISS Inferred from Sequence (or Structural) Similarity NAS Non-traceable Author Statement ND No Biological Data available RCA Inferred from Reviewed Computational Analysis TAS Traceable Author Statement ------------------------------------------------ Total number of Gene::GO connections: 266435 Genes Stats: ---------------- Genes with GO_term connections 89198 IEA GO_code present 83214 non-IEA GO_code present 5980 Source of the mapping data Source: *RNAi (GFF mapping overlaps) 25242 *citace 2187 *Inherited (motif & phenotype) 14528 GO_terms Stats: --------------- Total No. GO_terms 30471 GO_terms connected to Genes 3250 GO annotations connected with IEA 1819 GO annotations connected with non-IEA 1426 Breakdown IC - 2 IDA - 351 ISS - 128 IEP - 9 IGI - 116 IMP - 729 IPI - 68 NAS - 1 ND - 1 RCA - 0 TAS - 21 -===================================================================================- Useful Stats: --------- Genes with Sequence and CGC name WS220 46095 (24437 elegans / 5504 briggsae / 5466 remanei / 4797 japonica / 3094 brenneri / 2797 pristionchus) -===================================================================================- New Data: --------- The Caenorhabditis sp.3 gene models and assembly was synched with the GenBank/ENA/DDBJ version, removing E.coli contaminated contigs. There are 304 'HOT' regions from the modENCODE project. These are regions where there does not appear to be a concentration of transcrtption factor binding site motifs, but many transcription factoes are found to bind to the region by ChIP-Seq experiments. There are 7237 new ncRNA genes identified from the modENCODE project. The Caenorhabditis species 3 (strain PS1010) genome and predicted genes are available with the other Tier III species on the FTP site. It the time of this database release the official Linnean name was not publicly available and so the files for this species are named by the abbreviation 'c_an'. Genome sequence updates: ----------------------- New Fixes: ---------- Known Problems: --------------- Other Changes: -------------- Proposed Changes / Forthcoming Data: ------------------------------------- We expect to add 23,000 3' UTR features from the modENCODE project to release WS221. (These are UTR spans as displayed on the UTRome project website) splicing is not accounted for in the feature. Prposed model changes for WS221 ------------------------------- New Class ////////////////////////////////////////////////////////////////////////////////////////// // // Transcription_factor class This describes a type of transcription factor (and Pol II) // Gary W. 14/10/2010 // ////////////////////////////////////////////////////////////////////////////////////////// ?Transcription_factor Name UNIQUE Text Position_matrix ?Position_Matrix XREF Transcription_factor #Evidence Product_of ?Gene XREF Transcription_factor Remark Text #Evidence Binding_site ?Feature XREF Transcription_factor Gene Class Modifications Transcription_factor ?Transcription_factor XREF Product_of Feature Class Modifications Score Float // this would be a log score as indicated by the analysis used in gff dump Transcription_factor UNIQUE ?Transcription_factor XREF Binding_site Position_Matrix Class Modifications Transcription_factor UNIQUE ?Transcription_factor XREF Position_matrix #Evidence //gw3 Strain Class Modifications < Made_by Text --- > Made_by ?Person //updated from Text > Other_name ?Text #Evidence Contact ?Person Wild_isolate // to identify wild isolates - since genotype is free text Isolation GPS UNIQUE Float UNIQUE Float // Latitude +/-DD.DDDDD Longitude +/-DD.DDDDD Place UNIQUE ?Text Landscape UNIQUE Urban_garden Wild_forest Wild_grassland Agricultural_land Oasis Rural_garden Substrate UNIQUE ?Text // e.g. snail, soil, apple, rotting fruit Associated_organisms ?Species Life_stage ?Life_stage Log_size_of_population UNIQUE Float Sampled_by Text // the person who found the worms Isolated_by ?Person // the person who isolated the worms from the sample Model Changes: ------------------------------------ ?Strain class Removed:Extended_genotype ?Variation Comment: Removed this unused tag structure as data can be stored in current tags and the additional information/context can be resolved at the display level. For more info mail help@wormbase.org -===================================================================================- Quick installation guide for UNIX/Linux systems ----------------------------------------------- 1. Create a new directory to contain your copy of WormBase, e.g. /users/yourname/wormbase 2. Unpack and untar all of the database.*.tar.gz files into this directory. You will need approximately 2-3 Gb of disk space. 3. Obtain and install a suitable acedb binary for your system (available from www.acedb.org). 4. Use the acedb 'xace' program to open your database, e.g. type 'xace /users/yourname/wormbase' at the command prompt. 5. See the acedb website for more information about acedb and using xace. ____________ END _____________