Reference Genome Reports - Annotation Coverage

From WormBaseWiki
Jump to: navigation, search

Reference Genome Reports - Annotation Coverage

Purpose: To report to the GO Consortium the extent of annotation coverage for Reference Genome genes.


Annotation Coverage Criteria: The Reference Genome Project aims to provide comprehensive annotations for groups of conserved proteins.

At WB, we don't yet have a mechanism for marking genes as comprehensively annotated in our curation database, so currently we're using the following criteria for assessing annotation coverage:

Genes containing annotations to two or more ontologies (Biological Process, Molecular Fnction, Cellular Component) are considered comprehensively annotated.

This choice reflects that, in C. elegans papers, there is predominantly phenotypic analysis (Biological Process) and expression data (Cellular Component), with fewer papers reporting Molecular Function experiments.


Script Requirements:

The reporting script uses a gene association file for the analysis. The file can be the combined gene_association file submitted to the GO Consortium or the gene_association file that only contains annotations generated at WormBase (minus the UniProtKB annotations).

Three files are needed to run the script:

a) A gene_association file

The combined gene_association.wb file can be downloaded from the GO Consortium.

The wormbase_gene_association.wb file is available on tazendra.


b) A gp2protein.wb file

The gp2protein.wb file can be downloaded from the GO Consortium.


c) A list of C. elegans Reference Genome gene IDs

A list of C. elegans Reference Genome genes can be obtained from the Jax ftp site.

ftp://ftp.informatics.jax.org/pub/curatorwork/GODB/refg_id_list.txt

The list needs to be filtered for worm gene and protein IDs.


Script Design

The basic functionality of the script is as follows:

a) Ignoring lines that contain IEA annotations, the script converts, if present, any UniProt IDs to WBGene IDs.

b) Using the list of C. elegans Reference Genome genes, the script determines if annotations exist for zero, one, two, or three aspects of the Gene Ontology.


Script Output

The script outputs four files that contain the number of genes that have annotations in zero, one, two, or three aspects of the ontology, along with a list of the gene IDs for each.

Files are named:

ref_genome_0_ontology

ref_genome_1_ontology

ref_genome_2_ontology

ref_genome_3_ontology


Back to Gene Ontology