Difference between revisions of "WS226"
From WormBaseWiki
Jump to navigationJump to search (Created page with '<pre> New release of WormBase WS226 WS226 was built by gw3 -===================================================================================- The WS226 build directory includ…') |
|||
Line 625: | Line 625: | ||
− | For more info mail | + | For more info mail help@wormbase.org |
-===================================================================================- | -===================================================================================- | ||
Latest revision as of 11:15, 21 December 2011
New release of WormBase WS226 WS226 was built by gw3 -===================================================================================- The WS226 build directory includes: species/ DIR - contains a sub dir for each WormBase species (G_SPECIES) with the following files: - G_SPECIES.WS226.genomic.fa.gz - Unmasked genomic DNA - G_SPECIES.WS226.genomic_masked.fa.gz - Hard-masked (repeats replaced with Ns) genomic DNA - G_SPECIES.WS226.genomic_softmasked.fa.gz - Soft-masked (repeats lower-cased) genomic DNA - G_SPECIES.WS226.protein.fa.gz - Current live protein set - G_SPECIES.WS226.cds_transcripts.fa.gz - Spliced cDNA sequence for the CDS portion of protein-coding transcripts - G_SPECIES.WS226.ncrna_transcripts.fa.gz - Spliced cDNA sequence for non-coding RNA transcripts - G_SPECIES.WS226.intergenic_sequences.fa.gz - DNA sequence between pairs of adjacent genes - G_SPECIES.WS226.annotations.gff[2|3].gz - Sequence features in either GFF2 or GFF3 format - G_SPECIES.WS226.ests.fa.gz - ESTs and mRNA sequences extracted from the public databases - G_SPECIES.WS226.best_blastp_hits.txt.gz - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteains - G_SPECIES.WS226.*pep_package.tar.gz - latest version of the [worm|brig|bren|rema|jap|ppa]pep package (if updated since last release) - annotation/ - contains additional annotations: - G_SPECIES.WS226.confirmed_genes.txt.gz - DNA sequences of all genes confirmed by EST &/or cDNA - G_SPECIES.WS226.cDNA2orf.txt.gz - Latest set of ORF connections to each cDNA (EST, OST, mRNA) - G_SPECIES.WS226.geneIDs.txtgz - list of all current gene identifiers with CGC & molecular names (when known) - G_SPECIES.WS226.PCR_product2gene.txt.gz - Mappings between PCR products and overlapping Genes - G_SPECIES.WS226.*oligo_mapping.txt.gz - Oligo array mapping files - G_SPECIES.WS226.knockout_consortium_alleles.xml.gz - Table of Knockout Consortium alleles - G_SPECIES.WS226.SRA_gene_expression.tar.gz - Tables of gene expression values computed from SRA RNASeq data acedb DIR - Everything needed to generate a local copy of the The Primary database - database.WS226.*.tar.gz - compressed acedb database for new release - models.wrm.WS226 - the latest database schema (also in above database files) - WS226-WS225.dbcomp - log file reporting difference from last release - *Non_C_elegans_BLASTX/ - This directory contains the blastx data for non-elegans species (reduces the size of the main database) COMPARATIVE_ANALYSIS DIR - comparative analysis files - compara.WS226.tar.bz2 - gene-tree and alignment GFF files - wormpep_clw.WS226.sql.bz2 - ClustalW protein multiple alignments ONTOLOGY DIR - gene_associations, obo files for (phenotype GO anatomy) and associated association files Release notes on the web: ------------------------- http://www.wormbase.org/wiki/index.php/Release_Schedule C. elegans Synchronisation with GenBank / EMBL: ------------------------------------ No synchronisation issues C. elegans Chromosomal Changes: -------------------- There are no changes to the chromosome sequences in this release. C. elegans Gene data set (Live C. elegans genes 47395) ------------------------------------------ Molecular_info 45728 (96.5%) Concise_description 5777 (12.2%) Reference 14160 (29.9%) WormBase_approved_Gene_name 26208 (55.3%) RNAi_result 24628 (52%) Microarray_results 23962 (50.6%) SAGE_transcript 19144 (40.4%) C. elegans Wormpep data set: ---------------------------- There are 20439 CDS in autoace, 25171 when counting 4732 alternate splice forms. The 25171 sequences contain 11,044,670 base pairs in total. Modified entries 86 Deleted entries 90 New entries 231 Reappeared entries 3 Net change +144 C. elegans Genome sequence composition: ---------------------------- WS226 WS225 change ---------------------------------------------- a 32367418 32367418 +0 c 17780787 17780787 +0 g 17756985 17756985 +0 t 32367086 32367086 +0 n 0 0 +0 - 0 0 +0 Total 100272276 100272276 +0 Pristionchus pacificus Genome sequence composition: ---------------------------- 172773083 total a 43813958 c 32811034 g 32828589 t 43810996 - 0 n 19508506 Caenorhabditis remanei Genome sequence composition: ---------------------------- 145500347 total a 42927857 c 26293828 g 26276020 t 42923178 - 0 n 7079464 Caenorhabditis japonica Genome sequence composition: ---------------------------- 163282347 total a 39053092 c 25603225 g 25576971 t 39126103 - 0 n 33922956 Caenorhabditis briggsae Genome sequence composition: ---------------------------- 108419768 total a 32984239 c 19684682 g 19693545 t 33054090 - 0 n 3003212 Caenorhabditis brenneri Genome sequence composition: ---------------------------- 190484472 total a 52238342 c 32852873 g 32896829 t 52180434 - 0 n 20315994 Tier II Gene counts --------------------------------------------- pristionchus Gene count 24216 (Coding 24216) remanei Gene count 32431 (Coding 31471) japonica Gene count 27177 (Coding 25870) briggsae Gene count 23050 (Coding 21963) brenneri Gene count 32259 (Coding 30669) --------------------------------------------- ------------------------------------------------- Pristionchus pacificus Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 229 (0.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 4982 (20.6%) Some, but not all exon bases are covered by transcript evidence Predicted 19006 (78.5%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Pristionchus pacificus entries with WormBase-approved Gene name 3110 ------------------------------------------------- Caenorhabditis remanei Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 956 (3.0%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5662 (18.0%) Some, but not all exon bases are covered by transcript evidence Predicted 24858 (79.0%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis remanei entries with WormBase-approved Gene name 5574 ------------------------------------------------- Caenorhabditis japonica Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 1182 (4.6%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 4974 (19.2%) Some, but not all exon bases are covered by transcript evidence Predicted 19714 (76.2%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis japonica entries with WormBase-approved Gene name 4897 ------------------------------------------------- Caenorhabditis briggsae Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 53 (0.2%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 854 (3.9%) Some, but not all exon bases are covered by transcript evidence Predicted 21080 (95.9%) No transcriptional evidence at all Status of entries: Protein Accessions ------------------------------------- UniProtKB accessions 21683 (98.6%) Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis briggsae entries with WormBase-approved Gene name 5619 ------------------------------------------------- Caenorhabditis brenneri Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 1510 (4.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5638 (18.4%) Some, but not all exon bases are covered by transcript evidence Predicted 23524 (76.7%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis brenneri entries with WormBase-approved Gene name 3175 ------------------------------------------------- Caenorhabditis elegans Protein Stats: ------------------------------------------------- Status of entries: Confidence level of prediction (based on the amount of transcript evidence) ------------------------------------------------- Confirmed 11994 (47.7%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 11171 (44.4%) Some, but not all exon bases are covered by transcript evidence Predicted 2006 (8.0%) No transcriptional evidence at all Status of entries: Protein Accessions ------------------------------------- UniProtKB accessions 24665 (98.0%) Status of entries: Protein_ID's in EMBL --------------------------------------- Protein_id 25002 (99.3%) Gene <-> CDS,Transcript,Pseudogene connections ---------------------------------------------- Caenorhabditis elegans entries with WormBase-approved Gene name 24588 C. elegans Operons Stats --------------------------------------------- Description: These exist as closely spaced gene clusters similar to bacterial operons --------------------------------------------- | Live Operons 1253 | | Genes in Operons 3350 | --------------------------------------------- GO Annotation Stats WS226 -------------------------------------- GO_codes - used for assigning evidence -------------------------------------- IC Inferred by Curator IDA Inferred from Direct Assay IEA Inferred from Electronic Annotation IEP Inferred from Expression Pattern IGI Inferred from Genetic Interaction IMP Inferred from Mutant Phenotype IPI Inferred from Physical Interaction ISS Inferred from Sequence (or Structural) Similarity NAS Non-traceable Author Statement ND No Biological Data available RCA Inferred from Reviewed Computational Analysis TAS Traceable Author Statement ------------------------------------------------ Total number of Gene::GO connections: 252614 Genes Stats: ---------------- Genes with GO_term connections 87536 IEA GO_code present 81607 non-IEA GO_code present 5925 Source of the mapping data Source: *RNAi (GFF mapping overlaps) 24596 *citace 2326 *Inherited (motif & phenotype) 15067 GO_terms Stats: --------------- Total No. GO_terms 30521 GO_terms connected to Genes 3433 GO annotations connected with IEA 1888 GO annotations connected with non-IEA 1536 Breakdown IC - 3 IDA - 405 ISS - 137 IEP - 9 IGI - 127 IMP - 756 IPI - 75 NAS - 2 ND - 1 RCA - 0 TAS - 20 -===================================================================================- Useful Stats: --------- Genes with Sequence and WormBase-approved Gene names WS226 46963 (24588 elegans / 5619 briggsae / 5574 remanei / 4897 japonica / 3175 brenneri / 3110 pristionchus) -===================================================================================- New Data: --------- New gene model predictions -------------------------- The set of Aggregate Integrated gene models submitted to the modENCODE DCC by LaDeana Hillier (Waterston Lab) in March 2010 and released by the modENCODE DCC on 23rd May 2011 has been added to WS226. New SNP data ------------ WGS SNP data from the Jarriault lab (PMID 20610404). This dataset consists of nearly 6000 SNPs from 3 elegans strains. They have been assigned the allele prefix snx- Nearly half of these SNPs affect predicted CDSs. New Caenorhabditis species added to WormBase: --------------------------------------------- Caenorhabditis species 7,9 and 11 draft genomes, sequenced by the Genome Institute of the Washington University were added to WormBase and will be made available through GBrowse and BLAST/BLAT. A draft gene set based on ab initio Augustus predictions is provided, and will be replaced in one of the next WormBase releases by a RNAseq based gene set. All data of the three new Caenorhabditis species should be considered DRAFT quality, and has not been submitted to public nucleotide repositories. new parasitic nematode Strongyloides ratti: ------------------------------------------- A draft genome of S.ratti has been made available to the public by the parasite genomics group of the Wellcome Trust Sanger Institute. It was incorporated into WormBase together with draft EST/cDNA-guided Augustus gene predictions and will be accessible through GBrowse as well as BLAST/BLAT. A revised gene-set is being prepared by the parasite genomics group and replace the draft set as soon as it is available. In addition a revised genomic assembly is in production. The data available through WormBase should be considered of DRAFT quality. Caenorhabditis brenneri assembly sync: -------------------------------------- The C.brenneri assembly and gene set has been synced with the ENA/DDBJ/GenBank and any changes will be submitted regularly in line with C.elegans, C.briggsae and C.remanei. Two contigs not available through the public repositories will be retired in WS227. Genome sequence updates: ----------------------- None New Fixes: ---------- None Known Problems: --------------- None Other Changes: -------------- None Proposed Changes / Forthcoming Data: ------------------------------------- * gene updates for S.ratti, Caenorhabditis species 7,9 and 11 * Caenorhabditis species 5 genome ?Transgene model clean up ------------------------- 1. Remove Supporting_data Movie 2. Replace the following tags under Reporter_product with one tag called Reporter ?Text GFP YFP CFP Venus DsRed mCherry RFP LacZ Other_reporter ?Text * Keep Gene ?Gene as is 3. Remove all sub tags (Gamma_ray, X_ray etc.) under Integrated_by. Reformat this tag as Integrated_by UNIQUE ?Text ?Rearrangement -------------- Addition of Introgression to Type hash. This will allow me to more fully curate introgressed regions e.g. qqR1(X,CB4856>N2) ?Person ------- Addition of Affiliation Text // used for storing institute affiliations that aren't in their Address #Address -------- Fax Text Make fax not unique ?Homology_group --------------- remove the HOPS and RIO tags, as we don't actually have the data in the database and add some eggNOG specific types. Homology_group Group_type UNIQUE COG COG_type #Homology_type COG_code #COG_codes eggNOG eggNOG_type #Homology_type eggNOG_code #COG_codes Homology_type KOG TWOG FOG LSE NOG //eggNOG - standard cluster euNOG //eggNOG - eukaryote cluster meNOG //eggNOG - metazoan cluster ?Sequence --------- Add a Checksum tag to the Sequence class: ?Sequence Checksum MD5 Text // Checksum of upper-cased sequence CRC64 Text // Checksum of upper-cased sequence ?Molecule --------- Rearrangement added to the Affects_phenotype_of list of objects ?Molecule Name ?Text Affects_phenotype_of Rearrangement ?Rearrangement ?Phenotype #Ev ?Gene ----- Human_disease_relevance "Text" #Evidence Used for storing computationally derived OMIM data from publications to move this from it's current storage place in the Concise_description which was causing a break in the flow (readability) of the description. ?Species -------- Removal of unused tags and the addition of other name storage tags. ?Species Other_name ?Text NCBITaxonomyID UNIQUE Int Short_name UNIQUE Text // e.g. 'C. elegans' G_species UNIQUE Text // e.g. 'c_elegans' Properties ?Database_properties // descriptions of sequences and acedb information New class: Database_properties - the final model has not been decided but this was the initial proposal ------------------------------------------------------------------------------------------------------- Describes information associated with a Species in the acedb database ?Database_properties Title Text Species UNIQUE ?Species Strain UNIQUE ?Strain Assembly UNIQUE ?Sequence_collection Sequences Chromosome_prefix UNIQUE Text // these are all copied from Species.pm Pep_prefix UNIQUE Text Pepdir_prefix UNIQUE Text CDS_regex UNIQUE Text Seq_name_regex UNIQUE Text CDS_regex_noend UNIQUE Text Wormpep_prefix UNIQUE Text Assembly_type UNIQUE chromosome contig Seq_db UNIQUE Text Wormpep_files Text Upload_db_name Text Mitochondrion ?Sequence New class: Sequence_collection ------------------------------ Holds a collection of sequences and some descriptions ?Sequence_collection Origin Name ?Text // name that the author gave this collection Species UNIQUE ?Species Strain UNIQUE ?Strain Laboratory ?Laboratory Evidence #Evidence Database ?Database ?Database_field ?Accession_number History First_WS_release Int // first WormBase release this assembly was used Latest_WS_release Int // latest release where it was used Supercedes UNIQUE ?Sequence_collection Superceded_by UNIQUE ?Sequence_collection Remark Text Status UNIQUE Live #Evidence Dead #Evidence Sequences ?Sequence #Splice_confirmation -------------------- Added: RNASeq ?Analysis Int Mass_spec ?Mass_spec_peptide #Splice_confirmation RNASeq ?Analysis Int Mass_spec ?Mass_spec_peptide ?Feature -------- Defined_by_analysis ?Analysis Int - to hold number of an analysis observations Model Changes: ------------------------------------ WS226 models The models file has been re-tagged in line with the new release schedule. Two changes to the ?Strain class. Introduction of an Other_name tag to hold old names associated with a species, (pre-classification names) etc. Removal of an XREF to protein. Added ?Strain/?Species to a minimal set of classes to help the new website architecture * ?Clone * ?Condition * ?Expr_profile * ?Expression_cluster * ?Feature_data * ?Microarray_experiment * ?PCR_product * ?SAGE_experiment * ?Transcription_factor For more info mail help@wormbase.org -===================================================================================- Quick installation guide for UNIX/Linux systems ----------------------------------------------- 1. Create a new directory to contain your copy of WormBase, e.g. /users/yourname/wormbase 2. Unpack and untar all of the database.*.tar.gz files into this directory. You will need approximately 2-3 Gb of disk space. 3. Obtain and install a suitable acedb binary for your system (available from www.acedb.org). 4. Use the acedb 'xace' program to open your database, e.g. type 'xace /users/yourname/wormbase' at the command prompt. 5. See the acedb website for more information about acedb and using xace. ____________ END _____________