WS289

From WormBaseWiki
Jump to navigationJump to search
New release of WormBase WS288

WS288 was built by Stavros Diamantakis with the input from the WormBase Community and Team. Thank you!

-==============================================================================-
-========= FTP site structure =================================================-
-==============================================================================-
The WS288 build directory includes:
species/ DIR              -  contains a sub dir for each WormBase species (G_SPECIES)
species/G_SPECIES DIR     -  contains a sub dir for each NCBI genome sequencing BioProject (BIOPROJECT) for the species, with the following files:
     - G_SPECIES.BIOPROJECT.WS288.genomic.fa.gz                  - Unmasked genomic DNA
     - G_SPECIES.BIOPROJECT.WS288.genomic_masked.fa.gz           - Hard-masked (repeats replaced with Ns) genomic DNA
     - G_SPECIES.BIOPROJECT.WS288.genomic_softmasked.fa.gz       - Soft-masked (repeats lower-cased) genomic DNA
     - G_SPECIES.BIOPROJECT.WS288.protein.fa.gz                  - Current live protein set
     - G_SPECIES.BIOPROJECT.WS288.CDS_transcripts.fa.gz          - Spliced cDNA sequence for the CDS portion of protein-coding transcripts
     - G_SPECIES.BIOPROJECT.WS288.mRNA_transcripts.fa.gz         - Spliced cDNA sequence for the full-length (including UTRs) mRNA for transcripts
     - G_SPECIES.BIOPROJECT.WS288.ncrna_transcripts.fa.gz        - Spliced cDNA sequence for non-coding RNA transcripts
     - G_SPECIES.BIOPROJECT.WS288.pseudogenic_transcripts.fa.gz  - Spliced cDNA sequence for pseudogenic transcripts
     - G_SPECIES.BIOPROJECT.WS288.transposon_transcripts.fa.gz   - Spliced cDNA sequence for mRNAs and pseudogenes located in Transposons
     - G_SPECIES.BIOPROJECT.WS288.transposons.fa.gz              - DNA sequence of curated and predicted Transposons
     - G_SPECIES.BIOPROJECT.WS288.transposon_cds.pep.gz          - Protein sequence of curated CDSs associated with Transposons
     - G_SPECIES.BIOPROJECT.WS288.intergenic_sequences.fa.gz     - DNA sequence between pairs of adjacent genes
     - G_SPECIES.BIOPROJECT.WS288.annotations.gff[2|3].gz        - Sequence features in either GFF2 or GFF3 format
     - G_SPECIES.BIOPROJECT.WS288.protein_annotation.gff3.gz     - Sequence features in proteins in GFF3 format
     - G_SPECIES.BIOPROJECT.WS288.canonical_geneset.gtf.gz       - Genes, transcripts and CDSs in GTF (GFF2) format
     - G_SPECIES.BIOPROJECT.WS288.ests.fa.gz                     - ESTs and mRNA sequences extracted from the public databases
     - G_SPECIES.BIOPROJECT.WS288.best_blastp_hits.txt.gz        - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteins
     - G_SPECIES.BIOPROJECT.WS288.*pep_package.tar.gz            - latest version of the [worm|brig|bren|rema|jap|ppa|brug]pep package (if updated since last release)
     - annotation/                    - contains additional annotations:
        - G_SPECIES.BIOPROJECT.WS288.confirmed_genes.txt.gz              - DNA sequences of all genes confirmed by EST &/or cDNA
        - G_SPECIES.BIOPROJECT.WS288.cDNA2orf.txt.gz                     - Latest set of ORF connections to each cDNA (EST, OST, mRNA)
        - G_SPECIES.BIOPROJECT.WS288.geneIDs.txtgz                       - list of all current gene identifiers with CGC & molecular names (when known)
        - G_SPECIES.BIOPROJECT.WS288.PCR_product2gene.txt.gz             - Mappings between PCR products and overlapping Genes
        - G_SPECIES.BIOPROJECT.WS288.*oligo_mapping.txt.gz               - Oligo array mapping files
        - G_SPECIES.BIOPROJECT.WS288.knockout_consortium_alleles.xml.gz  - Table of Knockout Consortium alleles
        - G_SPECIES.BIOPROJECT.WS288.SRA_gene_expression.tar.gz          - Tables of gene expression values computed from SRA RNASeq data
        - G_SPECIES.BIOPROJECT.WS288.TSS.wig.tar.gz                      - Wiggle plot files of Transcription Start Sites from the papers WBPaper00042246, WBPaper00042529, WBPaper00042354
        - G_SPECIES.BIOPROJECT.WS288.repeats.fa..gz                      - Latest version of the repeat library for the genome, suitable for use with RepeatMasker
acedb DIR                -  Everything needed to generate a local copy of the The Primary database
     - database.WS288.*.tar.gz   - compressed acedb database for new release
     - models.wrm.WS288          - the latest database schema (also in above database files)
     - WS288-WS287.dbcomp        - log file reporting difference from last release
     - *Non_C_elegans_BLASTX/     - This directory contains the blastx data for non-elegans species
                                                    (reduces the size of the main database)
MULTI_SPECIES DIR - miscellaneous files with data for multiple species
     - wormpep_clw.WS288.sql.bz2 - ClustalW protein multiple alignments
ONTOLOGY DIR             - gene_associations, obo files for (phenotype GO anatomy) and associated association files


Release notes on the web:
-------------------------
http://www.wormbase.org/about/release_schedule


-=====================================================================================-
-=========== C. elegans data summary =================================================-
-=====================================================================================-

Genome version
--------------

The version of the C. elegans reference genome included with this release is:

Version Name: WBcel235
INSDC accession: GCA_000002985.3
UNSC name: ce11

This version has been present in WormBase since release WS235


C. elegans gene data (49177 genes in total)
----------------------------------------------

Protein-coding (19984 genes):
  Curated description         5098    (25.5%)
  Automated description      19874    (99.4%)
  Human disease association   3388    (17.0%)
  Approved Gene name         10769    (53.9%)
  Reference                  11827    (59.2%)
  RNAi results               18344    (91.8%)
  Microarray results         19776    (99.0%)
  Expression patterns        19648    (98.3%)
  Variations                 19980   (100.0%)
  Interaction data           16200    (81.1%)

Non-coding RNA and pseudogene (27668 genes):
  Curated description          219     (0.8%)
  Automated description       8858    (32.0%)
  Human disease association     16     (0.1%)
  Approved Gene name         16526    (59.7%)
  Reference                   5952    (21.5%)
  RNAi results                 861     (3.1%)
  Microarray results          2267     (8.2%)
  Expression patterns          899     (3.2%)
  Variations                 27632    (99.9%)
  Interaction data             941     (3.4%)

Uncloned (1525 genes):
  Curated description          778    (51.0%)
  Automated description        116     (7.6%)
  Human disease association     10     (0.7%)
  Approved Gene name          1524    (99.9%)
  Reference                   1124    (73.7%)
  RNAi results                   0     (0.0%)
  Microarray results             0     (0.0%)
  Expression patterns           18     (1.2%)
  Variations                  1184    (77.6%)
  Interaction data             122     (8.0%)



Wormpep data set:
----------------------------

There are 28564 CDSs, from 19981 protein-coding loci

The 28564 sequences contain 40758087 base pairs in total.

Modified entries      1
Deleted entries       5
New entries           2
Reappeared entries    1

Net change  -2

C. elegans Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed             23532 (82.4%)	Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    4045 (14.2%)	Some, but not all exon bases are covered by transcript evidence
Predicted               987 (3.5%)	No coverage by mRNA/EST/RNASeq evidence

C. elegans Operons Stats
------------------------
Live Operons        1385
Genes in Operons    3688

C. elegans GO annotation status
-------------------------------

GO_codes - used for assigning evidence
  IBA Inferred by Biological aspect of Ancestor
  IC  Inferred by Curator
  IDA Inferred from Direct Assay
  IEA Inferred from Electronic Annotation
  IEP Inferred from Expression Pattern
  IGI Inferred from Genetic Interaction
  IKR Inferred from Key Residues
  IMP Inferred from Mutant Phenotype
  IPI Inferred from Physical Interaction
  IRD Inferred from Rapid Divergence
  ISM Inferred from Sequence Model
  ISO Inferred from Sequence Orthology
  ISS Inferred from Sequence (or Structural) Similarity
  NAS Non-traceable Author Statement
  ND  No Biological Data available
  RCA Inferred from Reviewed Computational Analysis
  TAS Traceable Author Statement

Number of gene<->GO_term associations    126735
  Breakdown by annotation provider:
    WormBase             18749
    UniProt              52222
    GO_Central           28770
    InterPro             20895
    IntAct                2399
    GOC                   2329
    RHEA                  1156
    SynGO                   68
    CACAO                   35
    ParkinsonsUK-UCL        30
    MGI                     29
    BHF-UCL                 23
    HGNC-UCL                14
    CAFA                     7
    DisProt                  5
    ARUK-UCL                 2
    HGNC                     1
    FlyBase                  1
  Breakdown by evidence code:
    IEA     67031
      Interpro2GO 20895
      Other       46136
    non-IEA 59704
      EXP       2
      HDA     386
      HEP     320
      IBA   28749
      IC      117
      IDA    7874
      IEP     169
      IGI    4671
      IKR       5
      IMP   10426
      IPI    4274
      ISM       9
      ISO       1
      ISS    1906
      NAS     180
      ND      429
      RCA      13
      TAS     173

Genes Stats:
  Genes with GO_term connections  14217 
    Non-IEA-only annotation             1380
    IEA-only annotation                 4879
    Both IEA and non-IEA annotations    7958

GO_term Stats:
  Distinct GO_terms connected to Genes   7030
    Associated by non-IEA only               3900
    Associated by IEA only                    835
    Associated by both IEA and non-IEA       2295

-=============================================================================-
-=========== Other core species data summary =================================-
-=============================================================================-

Assembly versions
----------------
Brugia malayi                B_malayi-4.0              (project PRJNA10729, current since WS252)
Caenorhabditis brenneri      C_brenneri-6.0.1b         (project PRJNA20035, current since WS227)
Caenorhabditis briggsae      CB4                       (project PRJNA10731, current since WS254)
Caenorhabditis japonica      C_japonica-7.0.1          (project PRJNA12591, current since WS227)
Caenorhabditis remanei       C_remanei-15.0.1          (project PRJNA53967, current since WS185)
Onchocerca volvulus          O_volvulus_Cameroon_v3    (project PRJEB513, current since WS241)
Pristionchus pacificus       P_pacificus-El_Paco       (project PRJNA12644, current since WS263)
Strongyloides ratti          S_ratti_ED321_v5_0_4      (project PRJEB125, current since WS247)
Trichuris muris              T_muris-TMUE3.1           (project PRJEB126, current since WS264)

Approved gene symbols
---------------------
Brugia malayi                 4163
Caenorhabditis brenneri       8528
Caenorhabditis briggsae       8002
Caenorhabditis japonica       6824
Caenorhabditis remanei        8492
Onchocerca volvulus           3213
Pristionchus pacificus        4341
Strongyloides ratti            109
Trichuris muris                  0

Gene counts
-----------
Brugia malayi                11687 (10936 coding)
Caenorhabditis brenneri      33291 (30705 coding)
Caenorhabditis briggsae      23202 (21024 coding)
Caenorhabditis japonica      32410 (29935 coding)
Caenorhabditis remanei       59263 (57627 coding)
Onchocerca volvulus          12605 (12109 coding)
Pristionchus pacificus       26343 (26342 coding)
Strongyloides ratti          12977 (12464 coding)
Trichuris muris              15755 (14995 coding)

Brugia malayi Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed              8148 (53.6%)	Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    5704 (37.5%)	Some, but not all exon bases are covered by transcript evidence
Predicted              1352 (8.9%)	No coverage by mRNA/EST/RNASeq evidence

Caenorhabditis brenneri Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed              1593 (5.2%)	Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    5706 (18.6%)	Some, but not all exon bases are covered by transcript evidence
Predicted             23426 (76.2%)	No coverage by mRNA/EST/RNASeq evidence

Caenorhabditis briggsae Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed             16526 (68.8%)	Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    4559 (19.0%)	Some, but not all exon bases are covered by transcript evidence
Predicted              2923 (12.2%)	No coverage by mRNA/EST/RNASeq evidence

Caenorhabditis japonica Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed              9390 (26.1%)	Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed   12173 (33.8%)	Some, but not all exon bases are covered by transcript evidence
Predicted             14413 (40.1%)	No coverage by mRNA/EST/RNASeq evidence

Caenorhabditis remanei Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed               962 (3.1%)	Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    5665 (18.0%)	Some, but not all exon bases are covered by transcript evidence
Predicted             24830 (78.9%)	No coverage by mRNA/EST/RNASeq evidence

Onchocerca volvulus Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed              6186 (50.6%)	Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    5581 (45.7%)	Some, but not all exon bases are covered by transcript evidence
Predicted               457 (3.7%)	No coverage by mRNA/EST/RNASeq evidence

Pristionchus pacificus Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed               382 (1.4%)	Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    4992 (18.9%)	Some, but not all exon bases are covered by transcript evidence
Predicted             21059 (79.7%)	No coverage by mRNA/EST/RNASeq evidence

Strongyloides ratti Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed               877 (7.0%)	Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    2342 (18.8%)	Some, but not all exon bases are covered by transcript evidence
Predicted              9265 (74.2%)	No coverage by mRNA/EST/RNASeq evidence

Trichuris muris Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed              6262 (41.8%)	Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    3901 (26.0%)	Some, but not all exon bases are covered by transcript evidence
Predicted              4832 (32.2%)	No coverage by mRNA/EST/RNASeq evidence


-==============================================================================-
-=========== News for this release ============================================-
-==============================================================================-

New data sets
--------------


New/updated reference genomes
------------------------------------


Proposed Changes / Forthcoming Data
------------------------------------


Model Changes
--------------

Model changes for this release are documented here:

http://wiki.wormbase.org/index.php/WS288_Models.wrm

For more information mail help@wormbase.org



____________  END _____________