WS229

From WormBaseWiki
Revision as of 16:52, 6 December 2011 by Mh6 (talk | contribs) (make it readable)
Jump to navigationJump to search
New release of WormBase WS229

WS229 was built by Mary Ann Tuli
====================
The WS229 build directory includes: species/ DIR - contains a sub dir for each WormBase species (G_SPECIES) with the following files: - G_SPECIES.WS229.genomic.fa.gz - Unmasked genomic DNA - G_SPECIES.WS229.genomic_masked.fa.gz - Hard-masked (repeats replaced with Ns) genomic DNA - G_SPECIES.WS229.genomic_softmasked.fa.gz - Soft-masked (repeats lower-cased) genomic DNA - G_SPECIES.WS229.protein.fa.gz - Current live protein set - G_SPECIES.WS229.cds_transcripts.fa.gz - Spliced cDNA sequence for the CDS portion of protein-coding transcripts - G_SPECIES.WS229.ncrna_transcripts.fa.gz - Spliced cDNA sequence for non-coding RNA transcripts - G_SPECIES.WS229.intergenic_sequences.fa.gz - DNA sequence between pairs of adjacent genes - G_SPECIES.WS229.annotations.gff[2|3].gz - Sequence features in either GFF2 or GFF3 format - G_SPECIES.WS229.ests.fa.gz - ESTs and mRNA sequences extracted from the public databases - G_SPECIES.WS229.best_blastp_hits.txt.gz - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteains - G_SPECIES.WS229.*pep_package.tar.gz - latest version of the [worm|brig|bren|rema|jap|ppa]pep package (if updated since last release) - annotation/ - contains additional annotations: - G_SPECIES.WS229.confirmed_genes.txt.gz - DNA sequences of all genes confirmed by EST &/or cDNA - G_SPECIES.WS229.cDNA2orf.txt.gz - Latest set of ORF connections to each cDNA (EST, OST, mRNA) - G_SPECIES.WS229.geneIDs.txtgz - list of all current gene identifiers with CGC & molecular names (when known) - G_SPECIES.WS229.PCR_product2gene.txt.gz - Mappings between PCR products and overlapping Genes - G_SPECIES.WS229.*oligo_mapping.txt.gz - Oligo array mapping files - G_SPECIES.WS229.knockout_consortium_alleles.xml.gz - Table of Knockout Consortium alleles - G_SPECIES.WS229.SRA_gene_expression.tar.gz - Tables of gene expression values computed from SRA RNASeq data acedb DIR - Everything needed to generate a local copy of the The Primary database - database.WS229.*.tar.gz - compressed acedb database for new release - models.wrm.WS229 - the latest database schema (also in above database files) - WS229-WS228.dbcomp - log file reporting difference from last release - *Non_C_elegans_BLASTX/ - This directory contains the blastx data for non-elegans species (reduces the size of the main database) COMPARATIVE_ANALYSIS DIR - comparative analysis files - compara.WS229.tar.bz2 - gene-tree and alignment GFF files - wormpep_clw.WS229.sql.bz2 - ClustalW protein multiple alignments ONTOLOGY DIR - gene_associations, obo files for (phenotype GO anatomy) and associated association files Release notes on the web:
http://www.wormbase.org/wiki/index.php/Release_Schedule C. elegans Synchronisation with GenBank / EMBL:
No synchronisation issues C. elegans Chromosomal Changes:
There are no changes to the chromosome sequences in this release. C. elegans Gene data set (Live C. elegans genes 47512)
Molecular_info 45859 (96.5%) Concise_description 5901 (12.4%) Reference 14295 (30.1%) WormBase_approved_Gene_name 26474 (55.7%) RNAi_result 24684 (52%) Microarray_results 23986 (50.5%) SAGE_transcript 19199 (40.4%) C. elegans Wormpep data set:
There are 25547 CDSs, from 20514 protein-coding genes The 25547 sequences contain base pairs in total. Modified entries 115 Deleted entries 78 New entries 234 Reappeared entries 6 Net change +162 C. elegans Genome sequence composition:
WS229 WS228 change
a 32367418 32367418 +0 c 17780787 17780787 +0 g 17756985 17756985 +0 t 32367086 32367086 +0 n 0 0 +0 - 0 0 +0 Total 100272276 100272276 +0 Pristionchus pacificus Genome sequence composition:
172773083 total a 43813958 c 32811034 g 32828589 t 43810996 - 0 n 19508506 Caenorhabditis remanei Genome sequence composition:
145500347 total a 42927857 c 26293828 g 26276020 t 42923178 - 0 n 7079464 Caenorhabditis japonica Genome sequence composition:
166565019 total a 46865690 c 30244493 g 30234317 t 46807519 - 0 n 12413000 Caenorhabditis briggsae Genome sequence composition:
108419768 total a 32984239 c 19684682 g 19693545 t 33054090 - 0 n 3003212 Caenorhabditis brenneri Genome sequence composition:
190421492 total a 52222485 c 32837458 g 32882838 t 52164077 - 0 n 20314634 Tier II Gene counts
pristionchus Gene count 24216 (Coding 24216) remanei Gene count 32431 (Coding 31471) japonica Gene count 29962 (Coding 29962) briggsae Gene count 23048 (Coding 21962) brenneri Gene count 32257 (Coding 30667)

Pristionchus pacificus Protein Stats:
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
Confirmed 229 (0.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 4982 (20.6%) Some, but not all exon bases are covered by transcript evidence Predicted 19006 (78.5%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections
Pristionchus pacificus entries with WormBase-approved Gene name 3202
Caenorhabditis remanei Protein Stats:
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
Confirmed 956 (3.0%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5662 (18.0%) Some, but not all exon bases are covered by transcript evidence Predicted 24858 (79.0%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections
Caenorhabditis remanei entries with WormBase-approved Gene name 5741
Caenorhabditis japonica Protein Stats:
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
Confirmed 176 (0.5%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 578 (1.6%) Some, but not all exon bases are covered by transcript evidence Predicted 35351 (97.9%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections
Caenorhabditis japonica entries with WormBase-approved Gene name 4735
Caenorhabditis briggsae Protein Stats:
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
Confirmed 53 (0.2%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 853 (3.9%) Some, but not all exon bases are covered by transcript evidence Predicted 21080 (95.9%) No transcriptional evidence at all Status of entries: Protein Accessions
UniProtKB accessions 21682 (98.6%) Gene <-> CDS,Transcript,Pseudogene connections
Caenorhabditis briggsae entries with WormBase-approved Gene name 5801
Caenorhabditis brenneri Protein Stats:
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
Confirmed 1510 (4.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5638 (18.4%) Some, but not all exon bases are covered by transcript evidence Predicted 23522 (76.7%) No transcriptional evidence at all Gene <-> CDS,Transcript,Pseudogene connections
Caenorhabditis brenneri entries with WormBase-approved Gene name 3303
Caenorhabditis elegans Protein Stats:
Status of entries: Confidence level of prediction (based on the amount of transcript evidence)
Confirmed 12227 (47.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 11272 (44.1%) Some, but not all exon bases are covered by transcript evidence Predicted 2048 (8.0%) No transcriptional evidence at all Status of entries: Protein Accessions
UniProtKB accessions 24534 (96.0%) Status of entries: Protein_ID's in EMBL
Protein_id 25296 (99.0%) Gene <-> CDS,Transcript,Pseudogene connections
Caenorhabditis elegans entries with WormBase-approved Gene name 24866 C. elegans Operons Stats
Description: These exist as closely spaced gene clusters similar to bacterial operons
| Live Operons 1390 | | Genes in Operons 3634 |
GO Annotation Stats WS229
GO_codes - used for assigning evidence
IC Inferred by Curator IDA Inferred from Direct Assay IEA Inferred from Electronic Annotation IEP Inferred from Expression Pattern IGI Inferred from Genetic Interaction IMP Inferred from Mutant Phenotype IPI Inferred from Physical Interaction ISS Inferred from Sequence (or Structural) Similarity NAS Non-traceable Author Statement ND No Biological Data available RCA Inferred from Reviewed Computational Analysis TAS Traceable Author Statement
Total number of Gene::GO connections: 294649 Genes Stats:
Genes with GO_term connections 95546 IEA GO_code present 89484 non-IEA GO_code present 6058 Source of the mapping data Source: *RNAi (GFF mapping overlaps) 25751 *citace 2456 *Inherited (motif & phenotype) 15080 GO_terms Stats:
Total No. GO_terms 30571 GO_terms connected to Genes 3438 GO annotations connected with IEA 1786 GO annotations connected with non-IEA 1640 Breakdown IC - 4 IDA - 445 ISS - 150 IEP - 11 IGI - 137 IMP - 792 IPI - 79 NAS - 2 ND - 1 RCA - 0 TAS - 18 -===================================================================================- Useful Stats:
Genes with Sequence and WormBase-approved Gene names WS229 47648 (24866 elegans / 5801 briggsae / 5741 remanei / 4735 japonica / 3303 brenneri / 3202 pristionchus) -===================================================================================- New Data:
1) DNAse1 sites 7095 Features marking DNAse1 hypersensitive sites have been added to the database. These have a Method of "DNAseI_hypersensitive_site". They are from the paper: Genome-scale identification of Caenorhabditis elegans regulatory elements by tiling-array mapping of DNase I hypersensitive sites Shi B, Guo X, Wu T, Sheng S, Wang J, Skogerbo G, Zhu X, Chen R BMC Genomics 2009, 10:92 doi:10.1186/1471-2164-10-92 PMID 19243610 http://www.biomedcentral.com/1471-2164/10/92 The hypersensitive sites found fall into three categories: those cut only by 240 U/ml, only by 480 U/ml and those cut by both. 2) Ascaris suum A draft genome including a preliminary gene set, based on "Ascaris suum draft genome",Rex et.al,doi:10.1038/nature10553 has been added. GFF3 and fasta files of the protein set and genome can be downloaded from ftp.wormbase.org in addition to accessing it through GBrowse and blast server. Ortholog C.elegans genes are also annotated. 3) Heterorhabditis bactieriophora: A H.bacteriophora genome based on a draft assembly from WashU St.Louis has been added and is available as fasta and GFF3 for download and viewable on GBrowse and blast. A gene set is expected from the H.bacteriphora curation group for a future release. 4) Bursephelenchus xylophilus A draft genome of B.xylophilus based on "Genomic Insights into the Origin of Parasitism in the Emerging Plant Pathogen Bursephelenchus xylophilus", Kikuchi et al., doi:10.1371/journal.ppat.1002219 has been added. It also includes a gene set provided by the authors and is available as fasta and GFF3 file for download from ftp.wormbase.org in addition to accessing it through GBrowse and blast. Other Changes:
1) Genome Sequence error sites A set of errors in the reference sequence have been identified by comparison of the N2 genome to the sister strain LSJ2 (which diverged from N2 before N2 was split up into lab specific strains): Nature. 2011 Aug 17;477(7364):321-5. doi: 10.1038/nature10378. Parallel evolution of domesticated Caenorhabditis species targets pheromone receptor genes. McGrath PT, Xu Y, Ailion M, Garrison JL, Butcher RA, Bargmann CI. This has resulted in 882 Features with the Method "Genome_sequence_error" being made to mark the positions of the insertion or deletion errors. The reference sequence has not yet been changed as a consequence because changes to the reference sequence make difficulties for people who require a stable genome sequence coordinate system. There are 39 CDS models that will be changed as a consequence of these sequence errors and there are 15 Pseudogenes that will be re-examined because of these errors. 2) GI numbers have been updated to use the Nov 2011 GI protein IDs. Model Changes:
Additions: 1) ?Strain - new tags: Sample_history Text Date_first_frozen UNIQUE DateType 2) ?Gene/?CDS/?Pseudogene/?Transcript - new tag: RNASeq_FPKM ?Life_stage Float #Evidence Notes: FPKM is the cufflinks measure of RNASeq transcript abundance - "Fragments Per Kilobase of exon per Million fragments mapped". Is like RPKM, but makes a correction for when using paired-reads. Modifications: 3) ?Sequence Checksum MD5 Text //checksums should only be created for an upper-cased sequence. CRC64 Text //checksums should only be created for an upper-cased sequence. becomes: Checksum MD5 UNIQUE Text //checksums should only be created for an upper-cased sequence. CRC64 UNIQUE Text //checksums should only be created for an upper-cased sequence. 4) ?Transcript - Cosmetic change to allow ACeDB code to dump the data stored in this tag correctly. Brief_identification UNIQUE Text // [020306 kj] becomes: Brief_identification UNIQUE ?Text For more info mail worm@sanger.ac.uk
=======================
Quick installation guide for UNIX/Linux systems
1. Create a new directory to contain your copy of WormBase, e.g. /users/yourname/wormbase 2. Unpack and untar all of the database.*.tar.gz files into this directory. You will need approximately 2-3 Gb of disk space. 3. Obtain and install a suitable acedb binary for your system (available from www.acedb.org). 4. Use the acedb 'xace' program to open your database, e.g. type 'xace /users/yourname/wormbase' at the command prompt. 5. See the acedb website for more information about acedb and using xace. ____________ END _____________