WS284

From WormBaseWiki
Jump to navigationJump to search
New release of WormBase WS284

WS284 was built by Stavros Diamantakis

-==============================================================================-
-========= FTP site structure =================================================-
-==============================================================================-
The WS284 build directory includes:
species/ DIR              -  contains a sub dir for each WormBase species (G_SPECIES)
species/G_SPECIES DIR     -  contains a sub dir for each NCBI genome sequencing BioProject (BIOPROJECT) for the species, with the following files:
     - G_SPECIES.BIOPROJECT.WS284.genomic.fa.gz                  - Unmasked genomic DNA
     - G_SPECIES.BIOPROJECT.WS284.genomic_masked.fa.gz           - Hard-masked (repeats replaced with Ns) genomic DNA
     - G_SPECIES.BIOPROJECT.WS284.genomic_softmasked.fa.gz       - Soft-masked (repeats lower-cased) genomic DNA
     - G_SPECIES.BIOPROJECT.WS284.protein.fa.gz                  - Current live protein set
     - G_SPECIES.BIOPROJECT.WS284.CDS_transcripts.fa.gz          - Spliced cDNA sequence for the CDS portion of protein-coding transcripts
     - G_SPECIES.BIOPROJECT.WS284.mRNA_transcripts.fa.gz         - Spliced cDNA sequence for the full-length (including UTRs) mRNA for transcripts
     - G_SPECIES.BIOPROJECT.WS284.ncrna_transcripts.fa.gz        - Spliced cDNA sequence for non-coding RNA transcripts
     - G_SPECIES.BIOPROJECT.WS284.pseudogenic_transcripts.fa.gz  - Spliced cDNA sequence for pseudogenic transcripts
     - G_SPECIES.BIOPROJECT.WS284.transposon_transcripts.fa.gz   - Spliced cDNA sequence for mRNAs and pseudogenes located in Transposons
     - G_SPECIES.BIOPROJECT.WS284.transposons.fa.gz              - DNA sequence of curated and predicted Transposons
     - G_SPECIES.BIOPROJECT.WS284.transposon_cds.pep.gz          - Protein sequence of curated CDSs associated with Transposons
     - G_SPECIES.BIOPROJECT.WS284.intergenic_sequences.fa.gz     - DNA sequence between pairs of adjacent genes
     - G_SPECIES.BIOPROJECT.WS284.annotations.gff[2|3].gz        - Sequence features in either GFF2 or GFF3 format
     - G_SPECIES.BIOPROJECT.WS284.protein_annotation.gff3.gz     - Sequence features in proteins in GFF3 format
     - G_SPECIES.BIOPROJECT.WS284.canonical_geneset.gtf.gz       - Genes, transcripts and CDSs in GTF (GFF2) format
     - G_SPECIES.BIOPROJECT.WS284.ests.fa.gz                     - ESTs and mRNA sequences extracted from the public databases
     - G_SPECIES.BIOPROJECT.WS284.best_blastp_hits.txt.gz        - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteins
     - G_SPECIES.BIOPROJECT.WS284.*pep_package.tar.gz            - latest version of the [worm|brig|bren|rema|jap|ppa|brug]pep package (if updated since last release)
     - annotation/                    - contains additional annotations:
        - G_SPECIES.BIOPROJECT.WS284.confirmed_genes.txt.gz              - DNA sequences of all genes confirmed by EST &/or cDNA
        - G_SPECIES.BIOPROJECT.WS284.cDNA2orf.txt.gz                     - Latest set of ORF connections to each cDNA (EST, OST, mRNA)
        - G_SPECIES.BIOPROJECT.WS284.geneIDs.txtgz                       - list of all current gene identifiers with CGC & molecular names (when known)
        - G_SPECIES.BIOPROJECT.WS284.PCR_product2gene.txt.gz             - Mappings between PCR products and overlapping Genes
        - G_SPECIES.BIOPROJECT.WS284.*oligo_mapping.txt.gz               - Oligo array mapping files
        - G_SPECIES.BIOPROJECT.WS284.knockout_consortium_alleles.xml.gz  - Table of Knockout Consortium alleles
        - G_SPECIES.BIOPROJECT.WS284.SRA_gene_expression.tar.gz          - Tables of gene expression values computed from SRA RNASeq data
        - G_SPECIES.BIOPROJECT.WS284.TSS.wig.tar.gz                      - Wiggle plot files of Transcription Start Sites from the papers WBPaper00042246, WBPaper00042529, WBPaper00042354
        - G_SPECIES.BIOPROJECT.WS284.repeats.fa..gz                      - Latest version of the repeat library for the genome, suitable for use with RepeatMasker
acedb DIR                -  Everything needed to generate a local copy of the The Primary database
     - database.WS284.*.tar.gz   - compressed acedb database for new release
     - models.wrm.WS284          - the latest database schema (also in above database files)
     - WS284-WS283.dbcomp        - log file reporting difference from last release
     - *Non_C_elegans_BLASTX/     - This directory contains the blastx data for non-elegans species
                                                    (reduces the size of the main database)
MULTI_SPECIES DIR - miscellaneous files with data for multiple species
     - wormpep_clw.WS284.sql.bz2 - ClustalW protein multiple alignments
ONTOLOGY DIR             - gene_associations, obo files for (phenotype GO anatomy) and associated association files


Release notes on the web:

http://www.wormbase.org/about/release_schedule -=====================================================================================- -=========== C. elegans data summary =================================================- -=====================================================================================- Genome version
The version of the C. elegans reference genome included with this release is: Version Name: WBcel235 INSDC accession: GCA_000002985.3 UNSC name: ce11 This version has been present in WormBase since release WS235 C. elegans gene data (49176 genes in total)
Protein-coding (19981 genes): Curated description 5098 (25.5%) Automated description 19842 (99.3%) Human disease association 3499 (17.5%) Approved Gene name 10613 (53.1%) Reference 11537 (57.7%) RNAi results 18338 (91.8%) Microarray results 19774 (99.0%) Expression patterns 19414 (97.2%) Variations 19977 (100.0%) Interaction data 16116 (80.7%) Non-coding RNA and pseudogene (27670 genes): Curated description 219 (0.8%) Automated description 7162 (25.9%) Human disease association 29 (0.1%) Approved Gene name 16526 (59.7%) Reference 5901 (21.3%) RNAi results 861 (3.1%) Microarray results 2268 (8.2%) Expression patterns 896 (3.2%) Variations 27634 (99.9%) Interaction data 943 (3.4%) Uncloned (1525 genes): Curated description 779 (51.1%) Automated description 116 (7.6%) Human disease association 10 (0.7%) Approved Gene name 1525 (100.0%) Reference 1122 (73.6%) RNAi results 0 (0.0%) Microarray results 0 (0.0%) Expression patterns 18 (1.2%) Variations 1185 (77.7%) Interaction data 140 (9.2%) Wormpep data set:
There are 28556 CDSs, from 19978 protein-coding loci The 28556 sequences contain 40753491 base pairs in total. Modified entries 12 Deleted entries 15 New entries 25 Reappeared entries 0 Net change +10 C. elegans Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
Confirmed 23524 (82.4%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq) Partially_confirmed 4045 (14.2%) Some, but not all exon bases are covered by transcript evidence Predicted 987 (3.5%) No coverage by mRNA/EST/RNASeq evidence C. elegans Operons Stats
Live Operons 1385 Genes in Operons 3688 C. elegans GO annotation status
GO_codes - used for assigning evidence IBA Inferred by Biological aspect of Ancestor IC Inferred by Curator IDA Inferred from Direct Assay IEA Inferred from Electronic Annotation IEP Inferred from Expression Pattern IGI Inferred from Genetic Interaction IKR Inferred from Key Residues IMP Inferred from Mutant Phenotype IPI Inferred from Physical Interaction IRD Inferred from Rapid Divergence ISM Inferred from Sequence Model ISO Inferred from Sequence Orthology ISS Inferred from Sequence (or Structural) Similarity NAS Non-traceable Author Statement ND No Biological Data available RCA Inferred from Reviewed Computational Analysis TAS Traceable Author Statement Number of gene<->GO_term associations 133555 Breakdown by annotation provider: WormBase 18751 UniProt 56189 GO_Central 29928 InterPro 21495 GOC 2632 IntAct 2388 RHEA 1993 SynGO 36 CACAO 36 ParkinsonsUK-UCL 30 MGI 29 BHF-UCL 23 HGNC-UCL 14 CAFA 7 ARUK-UCL 2 HGNC 1 FlyBase 1 Breakdown by evidence code: IEA 73131 Interpro2GO 21496 Other 51635 non-IEA 60424 EXP 1 HDA 386 HEP 319 IBA 29915 IC 116 IDA 7790 IEP 170 IGI 4672 IKR 5 IMP 10169 IPI 4232 ISM 9 ISO 1 ISS 1867 NAS 180 ND 406 RCA 13 TAS 173 Genes Stats: Genes with GO_term connections 14379 Non-IEA-only annotation 1250 IEA-only annotation 5024 Both IEA and non-IEA annotations 8105 GO_term Stats: Distinct GO_terms connected to Genes 7046 Associated by non-IEA only 3870 Associated by IEA only 803 Associated by both IEA and non-IEA 2373 -=============================================================================- -=========== Other core species data summary =================================- -=============================================================================- Assembly versions
Brugia malayi B_malayi-4.0 (project PRJNA10729, current since WS252) Caenorhabditis brenneri C_brenneri-6.0.1b (project PRJNA20035, current since WS227) Caenorhabditis briggsae CB4 (project PRJNA10731, current since WS254) Caenorhabditis japonica C_japonica-7.0.1 (project PRJNA12591, current since WS227) Caenorhabditis remanei C_remanei-15.0.1 (project PRJNA53967, current since WS185) Onchocerca volvulus O_volvulus_Cameroon_v3 (project PRJEB513, current since WS241) Pristionchus pacificus P_pacificus-El_Paco (project PRJNA12644, current since WS263) Strongyloides ratti S_ratti_ED321_v5_0_4 (project PRJEB125, current since WS247) Trichuris muris T_muris-TMUE3.1 (project PRJEB126, current since WS264) Approved gene symbols
Brugia malayi 4163 Caenorhabditis brenneri 8528 Caenorhabditis briggsae 8002 Caenorhabditis japonica 6824 Caenorhabditis remanei 8492 Onchocerca volvulus 3213 Pristionchus pacificus 4341 Strongyloides ratti 109 Trichuris muris 0 Gene counts
Brugia malayi 11687 (10936 coding) Caenorhabditis brenneri 33291 (30705 coding) Caenorhabditis briggsae 23202 (21024 coding) Caenorhabditis japonica 32410 (29935 coding) Caenorhabditis remanei 59263 (57627 coding) Onchocerca volvulus 12605 (12109 coding) Pristionchus pacificus 26343 (26342 coding) Strongyloides ratti 12973 (12464 coding) Trichuris muris 15754 (14995 coding) Brugia malayi Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
Confirmed 8148 (53.6%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq) Partially_confirmed 5704 (37.5%) Some, but not all exon bases are covered by transcript evidence Predicted 1352 (8.9%) No coverage by mRNA/EST/RNASeq evidence Caenorhabditis brenneri Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
Confirmed 1593 (5.2%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq) Partially_confirmed 5706 (18.6%) Some, but not all exon bases are covered by transcript evidence Predicted 23426 (76.2%) No coverage by mRNA/EST/RNASeq evidence Caenorhabditis briggsae Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
Confirmed 16526 (68.8%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq) Partially_confirmed 4559 (19.0%) Some, but not all exon bases are covered by transcript evidence Predicted 2923 (12.2%) No coverage by mRNA/EST/RNASeq evidence Caenorhabditis japonica Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
Confirmed 9390 (26.1%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq) Partially_confirmed 12173 (33.8%) Some, but not all exon bases are covered by transcript evidence Predicted 14413 (40.1%) No coverage by mRNA/EST/RNASeq evidence Caenorhabditis remanei Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
Confirmed 962 (3.1%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq) Partially_confirmed 5665 (18.0%) Some, but not all exon bases are covered by transcript evidence Predicted 24830 (78.9%) No coverage by mRNA/EST/RNASeq evidence Onchocerca volvulus Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
Confirmed 6186 (50.6%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq) Partially_confirmed 5581 (45.7%) Some, but not all exon bases are covered by transcript evidence Predicted 457 (3.7%) No coverage by mRNA/EST/RNASeq evidence Pristionchus pacificus Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
Confirmed 382 (1.4%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq) Partially_confirmed 4992 (18.9%) Some, but not all exon bases are covered by transcript evidence Predicted 21059 (79.7%) No coverage by mRNA/EST/RNASeq evidence Strongyloides ratti Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
Confirmed 877 (7.0%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq) Partially_confirmed 2342 (18.8%) Some, but not all exon bases are covered by transcript evidence Predicted 9265 (74.2%) No coverage by mRNA/EST/RNASeq evidence Trichuris muris Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
Confirmed 6262 (41.8%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq) Partially_confirmed 3901 (26.0%) Some, but not all exon bases are covered by transcript evidence Predicted 4832 (32.2%) No coverage by mRNA/EST/RNASeq evidence -==============================================================================- -=========== News for this release ============================================- -==============================================================================- New data sets
90 miRNA gene clusters were added from MirGeneDB 2.1 138 MirGeneDB gene cross-references added to WormBase C. elegans genes 28115 EBI AlphaFold protein cross-references added to WormBase CDSs and Proteins New/updated reference genomes
Proposed Changes / Forthcoming Data
Model Changes
Model changes for this release are documented here: http://wiki.wormbase.org/index.php/WS284_Models.wrm For more information mail help@wormbase.org -==============================================================================- -=========== Installation guide ===============================================- -==============================================================================- Quick installation guide for UNIX/Linux systems
1. Create a new directory to contain your copy of WormBase, e.g. /users/yourname/wormbase 2. Unpack and untar all of the database.*.tar.gz files into this directory. You will need approximately 50-60 Gb of disk space. 3. Obtain and install a suitable acedb binary for your system 4. Use the acedb 'xace' program to open your database, e.g. type 'xace /users/yourname/wormbase' at the command prompt. 5. See the acedb website for more information about acedb and using xace. ____________ END _____________ </pre?