WS233

From WormBaseWiki
Revision as of 14:14, 13 September 2012 by Pdavis (talk | contribs) (Created page with ' <pre> New release of WormBase WS232 WS232 was built by Michael Paulini [michael.paulini@wormbase.org] -=======================================================================…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search
New release of WormBase WS232



WS232 was built by Michael Paulini [michael.paulini@wormbase.org] -===================================================================================- The WS232 build directory includes: species/ DIR - contains a sub dir for each WormBase species (G_SPECIES) with the following files: - G_SPECIES.WS232.genomic.fa.gz - Unmasked genomic DNA - G_SPECIES.WS232.genomic_masked.fa.gz - Hard-masked (repeats replaced with Ns) genomic DNA - G_SPECIES.WS232.genomic_softmasked.fa.gz - Soft-masked (repeats lower-cased) genomic DNA - G_SPECIES.WS232.protein.fa.gz - Current live protein set - G_SPECIES.WS232.cds_transcripts.fa.gz - Spliced cDNA sequence for the CDS portion of protein-coding transcripts - G_SPECIES.WS232.ncrna_transcripts.fa.gz - Spliced cDNA sequence for non-coding RNA transcripts - G_SPECIES.WS232.intergenic_sequences.fa.gz - DNA sequence between pairs of adjacent genes - G_SPECIES.WS232.annotations.gff[2|3].gz - Sequence features in either GFF2 or GFF3 format - G_SPECIES.WS232.ests.fa.gz - ESTs and mRNA sequences extracted from the public databases - G_SPECIES.WS232.best_blastp_hits.txt.gz - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteins - G_SPECIES.WS232.pep_package.tar.gz - latest version of the [worm|brig|bren|rema|jap|ppa]pep package (if updated since last release) - annotation/ - contains additional annotations: - G_SPECIES.WS232.confirmed_genes.txt.gz - DNA sequences of all genes confirmed by EST &/or cDNA - G_SPECIES.WS232.cDNA2orf.txt.gz - Latest set of ORF connections to each cDNA (EST, OST, mRNA) - G_SPECIES.WS232.geneIDs.txtgz - list of all current gene identifiers with CGC & molecular names (when known) - G_SPECIES.WS232.PCR_product2gene.txt.gz - Mappings between PCR products and overlapping Genes - G_SPECIES.WS232.oligo_mapping.txt.gz - Oligo array mapping files - G_SPECIES.WS232.knockout_consortium_alleles.xml.gz - Table of Knockout Consortium alleles - G_SPECIES.WS232.SRA_gene_expression.tar.gz - Tables of gene expression values computed from SRA RNASeq data acedb DIR - Everything needed to generate a local copy of the The Primary database - database.WS232.*.tar.gz - compressed acedb database for new release - models.wrm.WS232 - the latest database schema (also in above database files) - WS232-WS231.dbcomp - log file reporting difference from last release - *Non_C_elegans_BLASTX/ - This directory contains the blastx data for non-elegans species (reduces the size of the main database) COMPARATIVE_ANALYSIS DIR - comparative analysis files - compara.WS232.tar.bz2 - gene-tree and alignment GFF files - wormpep_clw.WS232.sql.bz2 - ClustalW protein multiple alignments ONTOLOGY DIR - gene_associations, obo files for (phenotype GO anatomy) and associated association files


Release notes on the web:


http://www.wormbase.org/wiki/index.php/Release_Schedule


C. elegans Synchronisation with GenBank / EMBL:


No synchronisation issues


C. elegans Chromosomal Changes:


There are no changes to the chromosome sequences in this release.


C. elegans Gene data set (Live C. elegans genes 47556)


Molecular_info 45911 (96.5%) Concise_description 5971 (12.6%) Reference 17204 (36.2%) WormBase_approved_Gene_name 26801 (56.4%) RNAi_result 24690 (51.9%) Microarray_results 23990 (50.4%) SAGE_transcript 19220 (40.4%)


C. elegans


Wormpep data set:


There are 25987 CDSs, from 20553 protein-coding genes


The 25987 sequences contain 34451067 base pairs in total.


Modified entries 77 Deleted entries 58 New entries 197 Reappeared entries 7


Net change +146


C. elegans Genome sequence composition:


WS232 WS231 change


a 32367418 32367418 +0 c 17780787 17780787 +0 g 17756985 17756985 +0 t 32367086 32367086 +0 n 0 0 +0 - 0 0 +0


Total 100272276 100272276 +0


Pristionchus pacificus Genome sequence composition:


172773083 total a 43813958 c 32811034 g 32828589 t 43810996 - 0 n 19508506


Caenorhabditis remanei Genome sequence composition:


145500347 total a 42927857 c 26293828 g 26276020 t 42923178 - 0 n 7079464


Caenorhabditis japonica Genome sequence composition:


166565019 total a 46865690 c 30244493 g 30234317 t 46807519 - 0 n 12413000


Caenorhabditis briggsae Genome sequence composition:


108419768 total a 32984239 c 19684682 g 19693545 t 33054090 - 62600 n 2940612


Caenorhabditis brenneri Genome sequence composition:


190421492 total a 52222485 c 32837458 g 32882838 t 52164077 - 0 n 20314634


Tier II Gene counts


pristionchus Gene count 24216 (Coding 24216) remanei Gene count 32414 (Coding 31445) japonica Gene count 29962 (Coding 29962) briggsae Gene count 23027 (Coding 21936)


brenneri Gene count 32331 (Coding 30667)




Pristionchus pacificus Protein Stats:


Status of entries: Confidence level of prediction (based on the amount of transcript evidence)


Confirmed 229 (0.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 4982 (20.6%) Some, but not all exon bases are covered by transcript evidence Predicted 19006 (78.5%) No transcriptional evidence at all


Gene <-> CDS,Transcript,Pseudogene connections


Pristionchus pacificus entries with WormBase-approved Gene name 3248




Caenorhabditis remanei Protein Stats:


Status of entries: Confidence level of prediction (based on the amount of transcript evidence)


Confirmed 961 (3.1%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5658 (18.0%) Some, but not all exon bases are covered by transcript evidence Predicted 24831 (79.0%) No transcriptional evidence at all


Gene <-> CDS,Transcript,Pseudogene connections


Caenorhabditis remanei entries with WormBase-approved Gene name 5982




Caenorhabditis japonica Protein Stats:


Status of entries: Confidence level of prediction (based on the amount of transcript evidence)


Confirmed 176 (0.5%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 578 (1.6%) Some, but not all exon bases are covered by transcript evidence Predicted 35351 (97.9%) No transcriptional evidence at all


Gene <-> CDS,Transcript,Pseudogene connections


Caenorhabditis japonica entries with WormBase-approved Gene name 4945




Caenorhabditis briggsae Protein Stats:


Status of entries: Confidence level of prediction (based on the amount of transcript evidence)


Confirmed 54 (0.2%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 859 (3.9%) Some, but not all exon bases are covered by transcript evidence Predicted 21048 (95.8%) No transcriptional evidence at all


Status of entries: Protein Accessions


UniProtKB accessions 21662 (98.6%)


Gene <-> CDS,Transcript,Pseudogene connections


Caenorhabditis briggsae entries with WormBase-approved Gene name 6064




Caenorhabditis brenneri Protein Stats:


Status of entries: Confidence level of prediction (based on the amount of transcript evidence)


Confirmed 1510 (4.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 5638 (18.4%) Some, but not all exon bases are covered by transcript evidence Predicted 23522 (76.7%) No transcriptional evidence at all


Gene <-> CDS,Transcript,Pseudogene connections


Caenorhabditis brenneri entries with WormBase-approved Gene name 3484




Caenorhabditis elegans Protein Stats:


Status of entries: Confidence level of prediction (based on the amount of transcript evidence)


Confirmed 12440 (47.9%) Every base of every exon has transcription evidence (mRNA, EST etc.) Partially_confirmed 11472 (44.1%) Some, but not all exon bases are covered by transcript evidence Predicted 2075 (8.0%) No transcriptional evidence at all


Status of entries: Protein Accessions


UniProtKB accessions 25814 (99.3%)


Status of entries: Protein_ID's in EMBL


Protein_id 25975 (100.0%)


Gene <-> CDS,Transcript,Pseudogene connections


Caenorhabditis elegans entries with WormBase-approved Gene name 25201


C. elegans Operons Stats


Description: These exist as closely spaced gene clusters similar to bacterial operons


| Live Operons 1390 |


| Genes in Operons 3634 |


GO Annotation Stats WS232


GO_codes - used for assigning evidence


IC Inferred by Curator IDA Inferred from Direct Assay IEA Inferred from Electronic Annotation IEP Inferred from Expression Pattern IGI Inferred from Genetic Interaction IMP Inferred from Mutant Phenotype IPI Inferred from Physical Interaction ISS Inferred from Sequence (or Structural) Similarity NAS Non-traceable Author Statement ND No Biological Data available RCA Inferred from Reviewed Computational Analysis


TAS Traceable Author Statement


Total number of Gene::GO connections: 290838


Genes Stats:


Genes with GO_term connections 96609
IEA GO_code present 91223
non-IEA GO_code present 5382


Source of the mapping data
Source: *RNAi (GFF mapping overlaps) 20107
*citace 2560
*Inherited (motif & phenotype) 15076


GO_terms Stats:


Total No. GO_terms 30606
GO_terms connected to Genes 3572
GO annotations connected with IEA 1833
GO annotations connected with non-IEA 1722
Breakdown IC - 6 IDA - 501 ISS - 148 IEP - 11 IGI - 148 IMP - 803 IPI - 84 NAS - 1 ND - 1
RCA - 0 TAS - 18


-===================================================================================-


Useful Stats:


Genes with Sequence and WormBase-approved Gene names WS232 48924 (25201 elegans / 6064 briggsae / 5982 remanei / 4945 japonica / 3484 brenneri / 3248 pristionchus)


-===================================================================================-


New Data:


New Transcriptionally Active Region Features


The Tiling Array data specifying Transcriptionallly Active Regions (TARs) from David Miller's lab: http://intermine.modencode.org/query/experiment.do?experiment=Identification+of+tissue+and+stage-specific+transcribed+sequences+with+expression+profile+maps has been added to the database as Feature_data objects with a GFF source of 'TranscriptionallyActiveRegion'.


Genome sequence updates:


C.angaria update


The C.angaria assembly has been updated to the latest genome draft available from the Schwarz lab at CalTech. It also includes a new set of gene predictions provided by CalTech, and can be downloaded from ftp://ftp.wormbase.org/pub/wormbase/releases/WS232/species/c_angaria


New Fixes:


Known Problems:


Other Changes:


C.elegans genetic map frozen


The genetic map of C. elegans has changed very little in the past three years in terms of recombinational map distances and marker gene locations. It is therefore being frozen, from WS232 onward. In future, new genetic loci, deficiencies and duplications will continue to be added to the genetic map, but these will simply be interpolated into the existing map.


Renaming of naturally occurring variation data


From WS232 on, all naturally occurring variation data objects will be identified by their WBVariationID. Two datasets are currently named in accordance with this policy in WS232:


1) PMID 21849976 McGrath PT et al. (2011) Nature "Parallel evolution of domesticated Caenorhabditis species targets pheromone ...."


2) Cutter/Stein - four distinct C.elegans strains vs. N2.


Proposed Changes / Forthcoming Data:


Renaming of all naturally occurring variations already held in WormBase will be complete in WS233.


Model Changes:


1.) ?Interaction and #Interactor_info


Interactor_info


Remove: Antibody Remark


Rename: Antibody_info -> Antibody


?Interaction Add: Antibody_remark ?Text


More infromation and a human readable diff can be found here: http://wiki.wormbase.org/index.php/WS232_Models.wrm


For more info mail help@wormbase.org -===================================================================================-


Quick installation guide for UNIX/Linux systems


    Create a new directory to contain your copy of WormBase, e.g. /users/yourname/wormbase

    Unpack and untar all of the database.*.tar.gz files into this directory. You will need approximately 2-3 Gb of disk space.

    Obtain and install a suitable acedb binary for your system (available from www.acedb.org).

    Use the acedb 'xace' program to open your database, e.g. type 'xace /users/yourname/wormbase' at the command prompt.

    See the acedb website for more information about acedb and using xace.



__________END___________