WS284

From WormBaseWiki
Jump to navigationJump to search
New release of WormBase WS284

WS284 was built by Stavros Diamantakis

-==============================================================================-
-========= FTP site structure =================================================-
-==============================================================================-
The WS284 build directory includes:
species/ DIR              -  contains a sub dir for each WormBase species (G_SPECIES)
species/G_SPECIES DIR     -  contains a sub dir for each NCBI genome sequencing BioProject (BIOPROJECT) for the species, with the following files:
     - G_SPECIES.BIOPROJECT.WS284.genomic.fa.gz                  - Unmasked genomic DNA
     - G_SPECIES.BIOPROJECT.WS284.genomic_masked.fa.gz           - Hard-masked (repeats replaced with Ns) genomic DNA
     - G_SPECIES.BIOPROJECT.WS284.genomic_softmasked.fa.gz       - Soft-masked (repeats lower-cased) genomic DNA
     - G_SPECIES.BIOPROJECT.WS284.protein.fa.gz                  - Current live protein set
     - G_SPECIES.BIOPROJECT.WS284.CDS_transcripts.fa.gz          - Spliced cDNA sequence for the CDS portion of protein-coding transcripts
     - G_SPECIES.BIOPROJECT.WS284.mRNA_transcripts.fa.gz         - Spliced cDNA sequence for the full-length (including UTRs) mRNA for transcripts
     - G_SPECIES.BIOPROJECT.WS284.ncrna_transcripts.fa.gz        - Spliced cDNA sequence for non-coding RNA transcripts
     - G_SPECIES.BIOPROJECT.WS284.pseudogenic_transcripts.fa.gz  - Spliced cDNA sequence for pseudogenic transcripts
     - G_SPECIES.BIOPROJECT.WS284.transposon_transcripts.fa.gz   - Spliced cDNA sequence for mRNAs and pseudogenes located in Transposons
     - G_SPECIES.BIOPROJECT.WS284.transposons.fa.gz              - DNA sequence of curated and predicted Transposons
     - G_SPECIES.BIOPROJECT.WS284.transposon_cds.pep.gz          - Protein sequence of curated CDSs associated with Transposons
     - G_SPECIES.BIOPROJECT.WS284.intergenic_sequences.fa.gz     - DNA sequence between pairs of adjacent genes
     - G_SPECIES.BIOPROJECT.WS284.annotations.gff[2|3].gz        - Sequence features in either GFF2 or GFF3 format
     - G_SPECIES.BIOPROJECT.WS284.protein_annotation.gff3.gz     - Sequence features in proteins in GFF3 format
     - G_SPECIES.BIOPROJECT.WS284.canonical_geneset.gtf.gz       - Genes, transcripts and CDSs in GTF (GFF2) format
     - G_SPECIES.BIOPROJECT.WS284.ests.fa.gz                     - ESTs and mRNA sequences extracted from the public databases
     - G_SPECIES.BIOPROJECT.WS284.best_blastp_hits.txt.gz        - Best blastp matches to human, fly, yeast, and non-WormBase Uniprot proteins
     - G_SPECIES.BIOPROJECT.WS284.*pep_package.tar.gz            - latest version of the [worm|brig|bren|rema|jap|ppa|brug]pep package (if updated since last release)
     - annotation/                    - contains additional annotations:
        - G_SPECIES.BIOPROJECT.WS284.confirmed_genes.txt.gz              - DNA sequences of all genes confirmed by EST &/or cDNA
        - G_SPECIES.BIOPROJECT.WS284.cDNA2orf.txt.gz                     - Latest set of ORF connections to each cDNA (EST, OST, mRNA)
        - G_SPECIES.BIOPROJECT.WS284.geneIDs.txtgz                       - list of all current gene identifiers with CGC & molecular names (when known)
        - G_SPECIES.BIOPROJECT.WS284.PCR_product2gene.txt.gz             - Mappings between PCR products and overlapping Genes
        - G_SPECIES.BIOPROJECT.WS284.*oligo_mapping.txt.gz               - Oligo array mapping files
        - G_SPECIES.BIOPROJECT.WS284.knockout_consortium_alleles.xml.gz  - Table of Knockout Consortium alleles
        - G_SPECIES.BIOPROJECT.WS284.SRA_gene_expression.tar.gz          - Tables of gene expression values computed from SRA RNASeq data
        - G_SPECIES.BIOPROJECT.WS284.TSS.wig.tar.gz                      - Wiggle plot files of Transcription Start Sites from the papers WBPaper00042246, WBPaper00042529, WBPaper00042354
        - G_SPECIES.BIOPROJECT.WS284.repeats.fa..gz                      - Latest version of the repeat library for the genome, suitable for use with RepeatMasker
acedb DIR                -  Everything needed to generate a local copy of the The Primary database
     - database.WS284.*.tar.gz   - compressed acedb database for new release
     - models.wrm.WS284          - the latest database schema (also in above database files)
     - WS284-WS283.dbcomp        - log file reporting difference from last release
     - *Non_C_elegans_BLASTX/     - This directory contains the blastx data for non-elegans species
                                                    (reduces the size of the main database)
MULTI_SPECIES DIR - miscellaneous files with data for multiple species
     - wormpep_clw.WS284.sql.bz2 - ClustalW protein multiple alignments
ONTOLOGY DIR             - gene_associations, obo files for (phenotype GO anatomy) and associated association files


Release notes on the web:
-------------------------
http://www.wormbase.org/about/release_schedule


-=====================================================================================-
-=========== C. elegans data summary =================================================-
-=====================================================================================-

Genome version
--------------

The version of the C. elegans reference genome included with this release is:

Version Name: WBcel235
INSDC accession: GCA_000002985.3
UNSC name: ce11

This version has been present in WormBase since release WS235


C. elegans gene data (49176 genes in total)
----------------------------------------------

Protein-coding (19981 genes):
  Curated description         5098    (25.5%)
  Automated description      19842    (99.3%)
  Human disease association   3499    (17.5%)
  Approved Gene name         10613    (53.1%)
  Reference                  11537    (57.7%)
  RNAi results               18338    (91.8%)
  Microarray results         19774    (99.0%)
  Expression patterns        19414    (97.2%)
  Variations                 19977   (100.0%)
  Interaction data           16116    (80.7%)

Non-coding RNA and pseudogene (27670 genes):
  Curated description          219     (0.8%)
  Automated description       7162    (25.9%)
  Human disease association     29     (0.1%)
  Approved Gene name         16526    (59.7%)
  Reference                   5901    (21.3%)
  RNAi results                 861     (3.1%)
  Microarray results          2268     (8.2%)
  Expression patterns          896     (3.2%)
  Variations                 27634    (99.9%)
  Interaction data             943     (3.4%)

Uncloned (1525 genes):
  Curated description          779    (51.1%)
  Automated description        116     (7.6%)
  Human disease association     10     (0.7%)
  Approved Gene name          1525   (100.0%)
  Reference                   1122    (73.6%)
  RNAi results                   0     (0.0%)
  Microarray results             0     (0.0%)
  Expression patterns           18     (1.2%)
  Variations                  1185    (77.7%)
  Interaction data             140     (9.2%)



Wormpep data set:
----------------------------

There are 28556 CDSs, from 19978 protein-coding loci

The 28556 sequences contain 40753491 base pairs in total.

Modified entries      12
Deleted entries       15
New entries           25
Reappeared entries    0

Net change  +10

C. elegans Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed             23524 (82.4%)     Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    4045 (14.2%)     Some, but not all exon bases are covered by transcript evidence
Predicted               987 (3.5%)      No coverage by mRNA/EST/RNASeq evidence

C. elegans Operons Stats
------------------------
Live Operons        1385
Genes in Operons    3688

C. elegans GO annotation status
-------------------------------

GO_codes - used for assigning evidence
  IBA Inferred by Biological aspect of Ancestor
  IC  Inferred by Curator
  IDA Inferred from Direct Assay
  IEA Inferred from Electronic Annotation
  IEP Inferred from Expression Pattern
  IGI Inferred from Genetic Interaction
  IKR Inferred from Key Residues
  IMP Inferred from Mutant Phenotype
  IPI Inferred from Physical Interaction
  IRD Inferred from Rapid Divergence
  ISM Inferred from Sequence Model
  ISO Inferred from Sequence Orthology
  ISS Inferred from Sequence (or Structural) Similarity
  NAS Non-traceable Author Statement
  ND  No Biological Data available
  RCA Inferred from Reviewed Computational Analysis
  TAS Traceable Author Statement

Number of gene<->GO_term associations    133555
  Breakdown by annotation provider:
    WormBase             18751
    UniProt              56189
    GO_Central           29928
    InterPro             21495
    GOC                   2632
    IntAct                2388
    RHEA                  1993
    SynGO                   36
    CACAO                   36
    ParkinsonsUK-UCL        30
    MGI                     29
    BHF-UCL                 23
    HGNC-UCL                14
    CAFA                     7
    ARUK-UCL                 2
    HGNC                     1
    FlyBase                  1
  Breakdown by evidence code:
    IEA     73131
      Interpro2GO 21496
      Other       51635
    non-IEA 60424
      EXP       1
      HDA     386
      HEP     319
      IBA   29915
      IC      116
      IDA    7790
      IEP     170
      IGI    4672
      IKR       5
      IMP   10169
      IPI    4232
      ISM       9
      ISO       1
      ISS    1867
      NAS     180
      ND      406
      RCA      13
      TAS     173

Genes Stats:
  Genes with GO_term connections  14379 
    Non-IEA-only annotation             1250
    IEA-only annotation                 5024
    Both IEA and non-IEA annotations    8105

GO_term Stats:
  Distinct GO_terms connected to Genes   7046
    Associated by non-IEA only               3870
    Associated by IEA only                    803
    Associated by both IEA and non-IEA       2373

-=============================================================================-
-=========== Other core species data summary =================================-
-=============================================================================-

Assembly versions
----------------
Brugia malayi                B_malayi-4.0              (project PRJNA10729, current since WS252)
Caenorhabditis brenneri      C_brenneri-6.0.1b         (project PRJNA20035, current since WS227)
Caenorhabditis briggsae      CB4                       (project PRJNA10731, current since WS254)
Caenorhabditis japonica      C_japonica-7.0.1          (project PRJNA12591, current since WS227)
Caenorhabditis remanei       C_remanei-15.0.1          (project PRJNA53967, current since WS185)
Onchocerca volvulus          O_volvulus_Cameroon_v3    (project PRJEB513, current since WS241)
Pristionchus pacificus       P_pacificus-El_Paco       (project PRJNA12644, current since WS263)
Strongyloides ratti          S_ratti_ED321_v5_0_4      (project PRJEB125, current since WS247)
Trichuris muris              T_muris-TMUE3.1           (project PRJEB126, current since WS264)

Approved gene symbols
---------------------
Brugia malayi                 4163
Caenorhabditis brenneri       8528
Caenorhabditis briggsae       8002
Caenorhabditis japonica       6824
Caenorhabditis remanei        8492
Onchocerca volvulus           3213
Pristionchus pacificus        4341
Strongyloides ratti            109
Trichuris muris                  0

Gene counts
-----------
Brugia malayi                11687 (10936 coding)
Caenorhabditis brenneri      33291 (30705 coding)
Caenorhabditis briggsae      23202 (21024 coding)
Caenorhabditis japonica      32410 (29935 coding)
Caenorhabditis remanei       59263 (57627 coding)
Onchocerca volvulus          12605 (12109 coding)
Pristionchus pacificus       26343 (26342 coding)
Strongyloides ratti          12973 (12464 coding)
Trichuris muris              15754 (14995 coding)

Brugia malayi Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed              8148 (53.6%)     Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    5704 (37.5%)     Some, but not all exon bases are covered by transcript evidence
Predicted              1352 (8.9%)      No coverage by mRNA/EST/RNASeq evidence

Caenorhabditis brenneri Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed              1593 (5.2%)      Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    5706 (18.6%)     Some, but not all exon bases are covered by transcript evidence
Predicted             23426 (76.2%)     No coverage by mRNA/EST/RNASeq evidence

Caenorhabditis briggsae Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed             16526 (68.8%)     Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    4559 (19.0%)     Some, but not all exon bases are covered by transcript evidence
Predicted              2923 (12.2%)     No coverage by mRNA/EST/RNASeq evidence

Caenorhabditis japonica Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed              9390 (26.1%)     Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed   12173 (33.8%)     Some, but not all exon bases are covered by transcript evidence
Predicted             14413 (40.1%)     No coverage by mRNA/EST/RNASeq evidence

Caenorhabditis remanei Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed               962 (3.1%)      Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    5665 (18.0%)     Some, but not all exon bases are covered by transcript evidence
Predicted             24830 (78.9%)     No coverage by mRNA/EST/RNASeq evidence

Onchocerca volvulus Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed              6186 (50.6%)     Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    5581 (45.7%)     Some, but not all exon bases are covered by transcript evidence
Predicted               457 (3.7%)      No coverage by mRNA/EST/RNASeq evidence

Pristionchus pacificus Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed               382 (1.4%)      Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    4992 (18.9%)     Some, but not all exon bases are covered by transcript evidence
Predicted             21059 (79.7%)     No coverage by mRNA/EST/RNASeq evidence

Strongyloides ratti Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed               877 (7.0%)      Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    2342 (18.8%)     Some, but not all exon bases are covered by transcript evidence
Predicted              9265 (74.2%)     No coverage by mRNA/EST/RNASeq evidence

Trichuris muris Gene model confirmation status (based on the EST/mRNA/RNASeq evidence)
------------------------------------------------------------
Confirmed              6262 (41.8%)     Every base of every exon has transcription evidence (mRNA/EST/RNASeq)
Partially_confirmed    3901 (26.0%)     Some, but not all exon bases are covered by transcript evidence
Predicted              4832 (32.2%)     No coverage by mRNA/EST/RNASeq evidence


-==============================================================================-
-=========== News for this release ============================================-
-==============================================================================-
New data sets
--------------
90 miRNA gene clusters were added from MirGeneDB 2.1
138 MirGeneDB gene cross-references added to WormBase C. elegans genes 
28115 EBI AlphaFold protein cross-references added to WormBase CDSs and Proteins

New/updated reference genomes
------------------------------------


Proposed Changes / Forthcoming Data
------------------------------------


Model Changes
--------------

Model changes for this release are documented here:

http://wiki.wormbase.org/index.php/WS284_Models.wrm

For more information mail help@wormbase.org

-==============================================================================-
-=========== Installation guide ===============================================-
-==============================================================================-


Quick installation guide for UNIX/Linux systems
-----------------------------------------------

1. Create a new directory to contain your copy of WormBase,
        e.g. /users/yourname/wormbase

2. Unpack and untar all of the database.*.tar.gz files into
        this directory. You will need approximately 50-60 Gb of disk space.

3. Obtain and install a suitable acedb binary for your system


4. Use the acedb 'xace' program to open your database, e.g.
        type 'xace /users/yourname/wormbase' at the command prompt.

5. See the acedb website for more information about acedb and
        using xace.

____________  END _____________