Difference between revisions of "WS222"
From WormBaseWiki
Jump to navigationJump to searchLine 436: | Line 436: | ||
== UniProt XRefs == | == UniProt XRefs == | ||
UniProt Xrefs were missing from a lot of Proteins and CDSes. Uniprot is chasing up why they went missing. Meanwhile the UniProt Xrefs of WS221 were used and can be found at ftp://ftp.sanger.ac.uk/pub2/wormbase/WS222/acedb/patches/cds_embl_data.ace.gz | UniProt Xrefs were missing from a lot of Proteins and CDSes. Uniprot is chasing up why they went missing. Meanwhile the UniProt Xrefs of WS221 were used and can be found at ftp://ftp.sanger.ac.uk/pub2/wormbase/WS222/acedb/patches/cds_embl_data.ace.gz | ||
+ | = Other new data = | ||
+ | == WGS data === | ||
+ | * 2722 SNPs identified through WGS by the Gene Knockout Consortium have been submitted to dbSNP. The ss# are included in the WormBase records. The dbSNP records will be available in their next Build (Jan - Mar 2011). |
Revision as of 10:05, 22 December 2010
Contents
Release Letter
New release of WormBase WS222, Wormpep222 and Wormrna222 Mon Dec 20 12:07:08 GMT 2010 WS222 was built by Michael Paulini (michael.paulini@wormbase.org) -===================================================================================- The WS222 build directory includes: genomes DIR - contains a sub dir for each WormBase species with sequence, gff, and agp data genomes/b_malayi: - genome_feature_tables/ sequences/ genomes/c_brenneri: - genome_feature_tables/ sequences/ genomes/c_briggsae: - genome_feature_tables/ sequences/ genomes/c_elegans: - annotation/ genome_feature_tables/ sequences/ genomes/c_japonica: - genome_feature_tables/ sequences/ genomes/c_remanei: - genome_feature_tables/ sequences/ genomes/h_bacteriophora: - genome_feature_tables/ sequences/ genomes/h_contortus: - genome_feature_tables/ sequences/ genomes/m_hapla: - genome_feature_tables/ sequences/ genomes/m_incognita: - sequences/ genomes/p_pacificus: - genome_feature_tables/ sequences/ *annotation/ - contains additional annotations i) confirmed_genes.WS222.gz - DNA sequences of all genes confirmed by EST &/or cDNA ii) cDNA2orf.WS222.gz - Latest set of ORF connections to each cDNA (EST, OST, mRNA) iii) geneIDs.WS222.gz - list of all current gene identifiers with CGC & molecular names (when known) iv) PCR_product2gene.WS222.gz - Mappings between PCR products and overlapping Genes v) oligo_mapping.gz - V *genome_feature_tables/ - contains the main .gff files and supplementary .gff data *sequences/ - contains dna/ protein/ rna/ sub dirs sequences/protein - WormBase protein set for species + history etc. vi) wormpep222.tar.gz - full Wormpep distribution corresponding to WS222 vii) wormrna222.tar.gz - latest WormRNA release containing non-coding RNA's in the genome viii) best_blastp_hits_species.WS222.gz - for each C. elegans WormPep protein, lists Best blastp match to human, fly, yeast, C. briggsae, and SwissProt & TrEMBL proteins. sequences/dna - WormBase dna data genomic sequence (raw, soft_masked masked), agp ix) intergenic_sequences.dna.gz sequences/rna - WormBase rna gene data. acedb DIR - Everything needed to generate a local copy of the The Primary database x) database.WS222.*.tar.gz - compressed acedb database for new release xi) models.wrm.WS222 - the latest database schema (also in above database files) xii) WS222-WS221.dbcomp - log file reporting difference from last release *Non_C_elegans_BLASTX/ - This directory contains the blastx data for non-elegans species (reduces the size of the main database) COMPARATIVE_ANALYSIS DIR - compara.tar.bz2 wormpep217_clw.sql.bz2 ONTOLOGY DIR - gene_associations, obo files for (phenotype GO anatomy) and associated association files Release notes on the web: ------------------------- http://www.wormbase.org/wiki/index.php/Release_Schedule C. elegans Synchronisation with GenBank / EMBL: ------------------------------------ No synchronisation issues C. elegans Chromosomal Changes: -------------------- There are no changes to the chromosome sequences in this release. C. elegans Gene data set (Live C.elegans genes 47376) ------------------------------------------ Molecular_info 45627 (96.3%) Concise_description 5724 (12.1%) Reference 14134 (29.8%) WormBase_approved_Gene_name 26094 (55.1%) RNAi_result 24602 (51.9%) Microarray_results 22090 (46.6%) SAGE_transcript 19164 (40.5%) C. elegans Wormpep data set: ---------------------------- There are 20424 CDS in autoace, 24938 when counting 4514 alternate splice forms. The 24938 sequences contain 10,954,721 base pairs in total. Modified entries 61 Deleted entries 34 New entries 82 Reappeared entries 1 Net change +49 The differnce between the total CDS's of this (24938) and the last build (24890) does not equal the net change 49 Please investigate! ! C. elegans Genome sequence composition: ---------------------------- WS222 WS221 change ---------------------------------------------- a 32367418 32367418 +0 c 17780787 17780787 +0 g 17756985 17756985 +0 t 32367086 32367086 +0 n 0 0 +0 - 0 0 +0 Total 100272276 100272276 +0 Pristionchus pacificus Genome sequence composition: ---------------------------- 172773083 total a 43813958 c 32811034 g 32828589 t 43810996 - 0 n 19508506 Caenorhabditis remanei Genome sequence composition: ---------------------------- 145500347 total a 42927857 c 26293828 g 26276020 t 42923178 - 0 n 7079464 Caenorhabditis japonica Genome sequence composition: ---------------------------- 163282347 total a 39053092 c 25603225 g 25576971 t 39126103 - 0 n 33922956 Caenorhabditis briggsae Genome sequence composition: ---------------------------- 108478630 total a 33004189 c 19675861 g 19707411 t 33049803 - 0 n 3041366 Caenorhabditis brenneri Genome sequence composition: ---------------------------- 190487923 total a 52239259 c 32853644 g 32897666 t 52181360 - 0 n 20315994 Tier II Gene counts --------------------------------------------- pristionchus Gene count 24216 (Coding 24216) remanei Gene count 32431 (Coding 31471) japonica Gene count 27177 (Coding 25870) briggsae Gene count 23038 (Coding 21967) brenneri Gene count 32259 (Coding 30670) --------------------------------------------- ------------------------------------------------- Pristionchus pacificus Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 229 (0.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 4982 (20.6%) Some, but not all exon bases are covered by transcript evidence Predicted 19006 (78.5%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Pristionchus pacificus entries with WormBase-approved Gene name 3069 ------------------------------------------------- Caenorhabditis remanei Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 956 (3.0%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5662 (18.0%) Some, but not all exon bases are covered by transcript evidence Predicted 24858 (79.0%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis remanei entries with WormBase-approved Gene name 5482 ------------------------------------------------- Caenorhabditis japonica Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 1182 (4.6%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 4974 (19.2%) Some, but not all exon bases are covered by transcript evidence Predicted 19714 (76.2%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis japonica entries with WormBase-approved Gene name 4816 ------------------------------------------------- Caenorhabditis briggsae Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 52 (0.2%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 856 (3.9%) Some, but not all exon bases are covered by transcript evidence Predicted 21083 (95.9%) No transcriptional evidence at all Status of entries: Protein Accessions ------------------------------------- UniProtKB accessions 21703 (98.7%) Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis briggsae entries with WormBase-approved Gene name 5527 ------------------------------------------------- Caenorhabditis brenneri Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 1512 (4.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5635 (18.4%) Some, but not all exon bases are covered by transcript evidence Predicted 23526 (76.7%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis brenneri entries with WormBase-approved Gene name 3106 ------------------------------------------------- Caenorhabditis elegans Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 11743 (47.1%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 11020 (44.2%) Some, but not all exon bases are covered by transcript evidence Predicted 2175 (8.7%) No transcriptional evidence at all Status of entries: Protein Accessions ------------------------------------- UniProtKB accessions 24749 (99.2%) Status of entries: Protein_ID's in EMBL --------------------------------------- Protein_id 24749 (99.2%) Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis elegans entries with WormBase-approved Gene name 24468 C. elegans Operons Stats --------------------------------------------- Description: These exist as closely spaced gene clusters similar to bacterial operons --------------------------------------------- | Live Operons 1288 | | Genes in Operons 3342 | --------------------------------------------- GO Annotation Stats WS222 -------------------------------------- GO_codes - used for assigning evidence -------------------------------------- IC Inferred by Curator IDA Inferred from Direct Assay IEA Inferred from Electronic Annotation IEP Inferred from Expression Pattern IGI Inferred from Genetic Interaction IMP Inferred from Mutant Phenotype IPI Inferred from Physical Interaction ISS Inferred from Sequence (or Structural) Similarity NAS Non-traceable Author Statement ND No Biological Data available RCA Inferred from Reviewed Computational Analysis TAS Traceable Author Statement ------------------------------------------------ Total number of Gene::GO connections: 259012 Genes Stats: ---------------- Genes with GO_term connections 86747 IEA GO_code present 80723 non-IEA GO_code present 6020 Source of the mapping data Source: *RNAi (GFF mapping overlaps) 25788 *citace 2224 *Inherited (motif & phenotype) 14421 GO_terms Stats: --------------- Total No. GO_terms 30477 GO_terms connected to Genes 3287 GO annotations connected with IEA 1843 GO annotations connected with non-IEA 1439 Breakdown IC - 3 IDA - 356 ISS - 131 IEP - 9 IGI - 116 IMP - 732 IPI - 69 NAS - 1 ND - 1 RCA - 0 TAS - 20 -===================================================================================- Useful Stats: --------- Genes with Sequence and CGC name WS222 46468 (24468 elegans / 5527 briggsae / 5482 remanei / 4816 japonica / 3106 brenneri / 3069 pristionchus) -===================================================================================- New Data: --------- includes ~21000 mapped C.elegans 3' UTR features from the modENCODE project. Genome sequence updates: ----------------------- New Fixes: ---------- Known Problems: --------------- 3-prime UTR mapping data for the ACeDB database is provided as a patch: ftp://ftp.sanger.ac.uk/pub2/wormbase/WS222/acedb/patches/feature_three_prime_UTR.ace.gz Other Changes: -------------- Proposed Changes / Forthcoming Data: ------------------------------------- * Simple additions were made to the Transposon_family and Strain classes * The Picture class was substantially re-worked * Small additional changes that should have gone in before. Additional information can be found here: http://wiki.wormbase.org/index.php/WS223_Models.wrm Model Changes: ------------------------------------ ?Variation tag removal - Mary Ann Tuli ---------------------- Due to a change in how we process CGH alleles there is no longer any need to calculate the 5' and 3' gap (the gap between the CGH_deleted_probes and Flanking_sequences). I would therefore like to remove FivePrimeGap & ThreePrimeGap from the ?Variation model. < FivePrimeGap < ThreePrimeGap For more info mail help@wormbase.org -===================================================================================- Quick installation guide for UNIX/Linux systems ----------------------------------------------- 1. Create a new directory to contain your copy of WormBase, e.g. /users/yourname/wormbase 2. Unpack and untar all of the database.*.tar.gz files into this directory. You will need approximately 2-3 Gb of disk space. 3. Obtain and install a suitable acedb binary for your system (available from www.acedb.org). 4. Use the acedb 'xace' program to open your database, e.g. type 'xace /users/yourname/wormbase' at the command prompt. 5. See the acedb website for more information about acedb and using xace. ____________ END _____________
Patches
UTRome mappings
- ACeDB patch ftp://ftp.sanger.ac.uk/pub2/wormbase/WS222/acedb/patches/feature_three_prime_UTR.ace.gz
- GFFs and related data are already patched
UniProt XRefs
UniProt Xrefs were missing from a lot of Proteins and CDSes. Uniprot is chasing up why they went missing. Meanwhile the UniProt Xrefs of WS221 were used and can be found at ftp://ftp.sanger.ac.uk/pub2/wormbase/WS222/acedb/patches/cds_embl_data.ace.gz
Other new data
WGS data =
- 2722 SNPs identified through WGS by the Gene Knockout Consortium have been submitted to dbSNP. The ss# are included in the WormBase records. The dbSNP records will be available in their next Build (Jan - Mar 2011).