WS289

From WormBaseWiki
Revision as of 21:53, 11 May 2023 by Sdiamantakis (talk | contribs) (Created page with "New release of WormBase WS288 WS288 was built by Stavros Diamantakis with the input from the WormBase Community and Team. Thank you! -=======================================...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

New release of WormBase WS288

WS288 was built by Stavros Diamantakis with the input from the WormBase Community and Team. Thank you!

-==============================================================================- -========= FTP site structure =================================================- -==============================================================================- The WS288 build directory includes: species/ DIR - contains a sub dir for each WormBase species (G_SPECIES) species/G_SPECIES DIR - contains a sub dir for each NCBI genome sequencing BioProject (BIOPROJECT) for the species, with the following files:

    - G_SPECIES.BIOPROJECT.WS288.genomic.fa.gz                  - Unmasked genomic DNA
    - G_SPECIES.BIOPROJECT.WS288.genomic_masked.fa.gz           - Hard-masked (repeats replaced with Ns) genomic DNA
    - G_SPECIES.BIOPROJECT.WS288.genomic_softmasked.fa.gz       - Soft-masked (repeats lower-cased) genomic DNA
    - G_SPECIES.BIOPROJECT.WS288.protein.fa.gz                  - Current live protein set
    - G_SPECIES.BIOPROJECT.WS288.CDS_transcripts.fa.gz          - Spliced cDNA sequence for the CDS portion of protein-coding transcripts
    - G_SPECIES.BIOPROJECT.WS288.mRNA_transcripts.fa.gz         - Spliced cDNA sequence for the full-length (including UTRs) mRNA for transcripts
    - G_SPECIES.BIOPROJECT.WS288.ncrna_transcripts.fa.gz        - Spliced cDNA sequence for non-coding RNA transcripts
    - G_SPECIES.BIOPROJECT.WS288.pseudogenic_transcripts.fa.gz  - Spliced cDNA sequence for pseudogenic transcripts
    - G_SPECIES.BIOPROJECT.WS288.transposon_transcripts.fa.gz   - Spliced cDNA sequence for mRNAs and pseudogenes located in Transposons
    - G_SPECIES.BIOPROJECT.WS288.transposons.fa.gz              - DNA sequence of curated and predicted Transposons
    - G_SPECIES.BIOPROJECT.WS288.transposon_cds.pep.gz          - Protein sequence of curated CDSs associated with Transposons
    - G_SPECIES.BIOPROJECT.WS288.intergenic_sequences.fa.gz     - DNA sequence between pairs of adjacent genes
    - G_SPECIES.BIOPROJECT.WS288.annotations.gff[2|3].gz        - Sequence features in either GFF2 or GFF3 format
    - G_SPECIES.BIOPROJECT.WS288.protein_annotation.gff3.gz     - Sequence features in proteins in GFF3 format
    - G_SPECIES.BIOPROJECT.WS288.canonical_geneset.gtf.gz       - Genes, transcripts and CDSs in GTF (GFF2) format
    - G_SPECIES.BIOPROJECT.WS288.ests.fa.gz                     - ESTs and mRNA sequences extracted from the public databases
    - G_SPECIES.BIOPROJECT.WS288.best_blastp_hits.txt.gz        - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteins
    - G_SPECIES.BIOPROJECT.WS288.*pep_package.tar.gz            - latest version of the [worm|brig|bren|rema|jap|ppa|brug]pep package (if updated since last release)
    - annotation/                    - contains additional annotations:
       - G_SPECIES.BIOPROJECT.WS288.confirmed_genes.txt.gz              - DNA sequences of all genes confirmed by EST &/or cDNA
       - G_SPECIES.BIOPROJECT.WS288.cDNA2orf.txt.gz                     - Latest set of ORF connections to each cDNA (EST, OST, mRNA)
       - G_SPECIES.BIOPROJECT.WS288.geneIDs.txtgz                       - list of all current gene identifiers with CGC & molecular names (when known)
       - G_SPECIES.BIOPROJECT.WS288.PCR_product2gene.txt.gz             - Mappings between PCR products and overlapping Genes
       - G_SPECIES.BIOPROJECT.WS288.*oligo_mapping.txt.gz               - Oligo array mapping files
       - G_SPECIES.BIOPROJECT.WS288.knockout_consortium_alleles.xml.gz  - Table of Knockout Consortium alleles
       - G_SPECIES.BIOPROJECT.WS288.SRA_gene_expression.tar.gz          - Tables of gene expression values computed from SRA RNASeq data
       - G_SPECIES.BIOPROJECT.WS288.TSS.wig.tar.gz                      - Wiggle plot files of Transcription Start Sites from the papers WBPaper00042246, WBPaper00042529, WBPaper00042354
       - G_SPECIES.BIOPROJECT.WS288.repeats.fa..gz                      - Latest version of the repeat library for the genome, suitable for use with RepeatMasker

acedb DIR - Everything needed to generate a local copy of the The Primary database

    - database.WS288.*.tar.gz   - compressed acedb database for new release
    - models.wrm.WS288          - the latest database schema (also in above database files)
    - WS288-WS287.dbcomp        - log file reporting difference from last release
    - *Non_C_elegans_BLASTX/     - This directory contains the blastx data for non-elegans species
                                                   (reduces the size of the main database)

MULTI_SPECIES DIR - miscellaneous files with data for multiple species

    - wormpep_clw.WS288.sql.bz2 - ClustalW protein multiple alignments

ONTOLOGY DIR - gene_associations, obo files for (phenotype GO anatomy) and associated association files


Release notes on the web:


http://www.wormbase.org/about/release_schedule


-=====================================================================================- -=========== C. elegans data summary =================================================- -=====================================================================================-

Genome version


The version of the C. elegans reference genome included with this release is:

Version Name: WBcel235 INSDC accession: GCA_000002985.3 UNSC name: ce11

This version has been present in WormBase since release WS235


C. elegans gene data (49177 genes in total)


Protein-coding (19984 genes):

 Curated description         5098    (25.5%)
 Automated description      19874    (99.4%)
 Human disease association   3388    (17.0%)
 Approved Gene name         10769    (53.9%)
 Reference                  11827    (59.2%)
 RNAi results               18344    (91.8%)
 Microarray results         19776    (99.0%)
 Expression patterns        19648    (98.3%)
 Variations                 19980   (100.0%)
 Interaction data           16200    (81.1%)

Non-coding RNA and pseudogene (27668 genes):

 Curated description          219     (0.8%)
 Automated description       8858    (32.0%)
 Human disease association     16     (0.1%)
 Approved Gene name         16526    (59.7%)
 Reference                   5952    (21.5%)
 RNAi results                 861     (3.1%)
 Microarray results          2267     (8.2%)
 Expression patterns          899     (3.2%)
 Variations                 27632    (99.9%)
 Interaction data             941     (3.4%)

Uncloned (1525 genes):

 Curated description          778    (51.0%)
 Automated description        116     (7.6%)
 Human disease association     10     (0.7%)
 Approved Gene name          1524    (99.9%)
 Reference                   1124    (73.7%)
 RNAi results                   0     (0.0%)
 Microarray results             0     (0.0%)
 Expression patterns           18     (1.2%)
 Variations                  1184    (77.6%)
 Interaction data             122     (8.0%)


Wormpep data set:


There are 28564 CDSs, from 19981 protein-coding loci

The 28564 sequences contain 40758087 base pairs in total.

Modified entries 1 Deleted entries 5 New entries 2 Reappeared entries 1

Net change -2

C. elegans Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)


Confirmed 23532 (82.4%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq) Partially_confirmed 4045 (14.2%) Some, but not all exon bases are covered by transcript evidence Predicted 987 (3.5%) No coverage by mRNA/EST/RNASeq evidence

C. elegans Operons Stats


Live Operons 1385 Genes in Operons 3688

C. elegans GO annotation status


GO_codes - used for assigning evidence

 IBA Inferred by Biological aspect of Ancestor
 IC  Inferred by Curator
 IDA Inferred from Direct Assay
 IEA Inferred from Electronic Annotation
 IEP Inferred from Expression Pattern
 IGI Inferred from Genetic Interaction
 IKR Inferred from Key Residues
 IMP Inferred from Mutant Phenotype
 IPI Inferred from Physical Interaction
 IRD Inferred from Rapid Divergence
 ISM Inferred from Sequence Model
 ISO Inferred from Sequence Orthology
 ISS Inferred from Sequence (or Structural) Similarity
 NAS Non-traceable Author Statement
 ND  No Biological Data available
 RCA Inferred from Reviewed Computational Analysis
 TAS Traceable Author Statement

Number of gene<->GO_term associations 126735

 Breakdown by annotation provider:
   WormBase             18749
   UniProt              52222
   GO_Central           28770
   InterPro             20895
   IntAct                2399
   GOC                   2329
   RHEA                  1156
   SynGO                   68
   CACAO                   35
   ParkinsonsUK-UCL        30
   MGI                     29
   BHF-UCL                 23
   HGNC-UCL                14
   CAFA                     7
   DisProt                  5
   ARUK-UCL                 2
   HGNC                     1
   FlyBase                  1
 Breakdown by evidence code:
   IEA     67031
     Interpro2GO 20895
     Other       46136
   non-IEA 59704
     EXP       2
     HDA     386
     HEP     320
     IBA   28749
     IC      117
     IDA    7874
     IEP     169
     IGI    4671
     IKR       5
     IMP   10426
     IPI    4274
     ISM       9
     ISO       1
     ISS    1906
     NAS     180
     ND      429
     RCA      13
     TAS     173

Genes Stats:

 Genes with GO_term connections  14217 
   Non-IEA-only annotation             1380
   IEA-only annotation                 4879
   Both IEA and non-IEA annotations    7958

GO_term Stats:

 Distinct GO_terms connected to Genes   7030
   Associated by non-IEA only               3900
   Associated by IEA only                    835
   Associated by both IEA and non-IEA       2295

-=============================================================================- -=========== Other core species data summary =================================- -=============================================================================-

Assembly versions


Brugia malayi B_malayi-4.0 (project PRJNA10729, current since WS252) Caenorhabditis brenneri C_brenneri-6.0.1b (project PRJNA20035, current since WS227) Caenorhabditis briggsae CB4 (project PRJNA10731, current since WS254) Caenorhabditis japonica C_japonica-7.0.1 (project PRJNA12591, current since WS227) Caenorhabditis remanei C_remanei-15.0.1 (project PRJNA53967, current since WS185) Onchocerca volvulus O_volvulus_Cameroon_v3 (project PRJEB513, current since WS241) Pristionchus pacificus P_pacificus-El_Paco (project PRJNA12644, current since WS263) Strongyloides ratti S_ratti_ED321_v5_0_4 (project PRJEB125, current since WS247) Trichuris muris T_muris-TMUE3.1 (project PRJEB126, current since WS264)

Approved gene symbols


Brugia malayi 4163 Caenorhabditis brenneri 8528 Caenorhabditis briggsae 8002 Caenorhabditis japonica 6824 Caenorhabditis remanei 8492 Onchocerca volvulus 3213 Pristionchus pacificus 4341 Strongyloides ratti 109 Trichuris muris 0

Gene counts


Brugia malayi 11687 (10936 coding) Caenorhabditis brenneri 33291 (30705 coding) Caenorhabditis briggsae 23202 (21024 coding) Caenorhabditis japonica 32410 (29935 coding) Caenorhabditis remanei 59263 (57627 coding) Onchocerca volvulus 12605 (12109 coding) Pristionchus pacificus 26343 (26342 coding) Strongyloides ratti 12977 (12464 coding) Trichuris muris 15755 (14995 coding)

Brugia malayi Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)


Confirmed 8148 (53.6%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq) Partially_confirmed 5704 (37.5%) Some, but not all exon bases are covered by transcript evidence Predicted 1352 (8.9%) No coverage by mRNA/EST/RNASeq evidence

Caenorhabditis brenneri Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)


Confirmed 1593 (5.2%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq) Partially_confirmed 5706 (18.6%) Some, but not all exon bases are covered by transcript evidence Predicted 23426 (76.2%) No coverage by mRNA/EST/RNASeq evidence

Caenorhabditis briggsae Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)


Confirmed 16526 (68.8%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq) Partially_confirmed 4559 (19.0%) Some, but not all exon bases are covered by transcript evidence Predicted 2923 (12.2%) No coverage by mRNA/EST/RNASeq evidence

Caenorhabditis japonica Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)


Confirmed 9390 (26.1%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq) Partially_confirmed 12173 (33.8%) Some, but not all exon bases are covered by transcript evidence Predicted 14413 (40.1%) No coverage by mRNA/EST/RNASeq evidence

Caenorhabditis remanei Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)


Confirmed 962 (3.1%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq) Partially_confirmed 5665 (18.0%) Some, but not all exon bases are covered by transcript evidence Predicted 24830 (78.9%) No coverage by mRNA/EST/RNASeq evidence

Onchocerca volvulus Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)


Confirmed 6186 (50.6%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq) Partially_confirmed 5581 (45.7%) Some, but not all exon bases are covered by transcript evidence Predicted 457 (3.7%) No coverage by mRNA/EST/RNASeq evidence

Pristionchus pacificus Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)


Confirmed 382 (1.4%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq) Partially_confirmed 4992 (18.9%) Some, but not all exon bases are covered by transcript evidence Predicted 21059 (79.7%) No coverage by mRNA/EST/RNASeq evidence

Strongyloides ratti Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)


Confirmed 877 (7.0%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq) Partially_confirmed 2342 (18.8%) Some, but not all exon bases are covered by transcript evidence Predicted 9265 (74.2%) No coverage by mRNA/EST/RNASeq evidence

Trichuris muris Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)


Confirmed 6262 (41.8%) Every base of every exon has transcription evidence (mRNA/EST/RNASeq) Partially_confirmed 3901 (26.0%) Some, but not all exon bases are covered by transcript evidence Predicted 4832 (32.2%) No coverage by mRNA/EST/RNASeq evidence


-==============================================================================- -=========== News for this release ============================================- -==============================================================================-

New data sets



New/updated reference genomes



Proposed Changes / Forthcoming Data



Model Changes


Model changes for this release are documented here:

http://wiki.wormbase.org/index.php/WS288_Models.wrm

For more information mail help@wormbase.org


____________ END _____________