Difference between revisions of "New GO Progress Report Script"

From WormBaseWiki
Jump to navigationJump to search
 
(10 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
GO is now requiring quarterly progress reports, with the first one due at the meeting this month (2014-03-16).
 
GO is now requiring quarterly progress reports, with the first one due at the meeting this month (2014-03-16).
  
We've been wanting to provide a more details progress report for GO for some time now, so this is a good opportunity to do that.
+
We've been wanting to provide a more detailed progress report for GO for some time now, so this is a good opportunity to do that.
  
 
Here's one idea for C. elegans manual annotations:
 
Here's one idea for C. elegans manual annotations:
  
Input files:
+
Input files (available on tazendra in home/acedb/ranjana/GO/Progress_Reports/Test):
 
*gp2protein.wb
 
*gp2protein.wb
 
*gp_association.wb
 
*gp_association.wb
  
#Ignore all lines with IEA evidence code
+
#Ignore all lines with IEA evidence code, ECO:0000256, in Column 6
 
#Replace UniProtKB identifiers in Column 2 with WBGene ID using gp2protein.wb
 
#Replace UniProtKB identifiers in Column 2 with WBGene ID using gp2protein.wb
 
#Remove (i.e. ignore for further reporting) any resulting lines that are ''exact'' duplicate lines of annotation
 
#Remove (i.e. ignore for further reporting) any resulting lines that are ''exact'' duplicate lines of annotation
  
 
Then determine:
 
Then determine:
#Total number of unique annotations
+
#Total number of unique annotations, i.e. unique lines.
#Total number of unique WBGenes  
+
#Total number of unique WBGenes.
#For each of the values in qualifier Column 4 count number of annotations for a given evidence code in Column 7 and number of annotations with an entry in Column 12
+
#For each unique value in qualifier Column 3 count number of annotations for a given evidence code in Column 6 and number of annotations with an entry in Column 11.  In other words, for each col3, I'd like to know the total number of different col6 and then also for each col3, the total number of different col6 that additionally have an entry in col11.
#Sort results according to unique entries in Column 10 (i.e., each contributing group)
+
#Sort results according to unique entries in Column 10 (i.e., each contributing group).
#Also report on any lines where the UniProtKB identifier cannot be converted to a WBGene
+
#Also report on any lines where the UniProtKB identifier cannot be converted to a WBGene, i.e. where there is no mapping between a UniProtKB identifier and a WBGene according to the gp2protein file.
 +
 
 +
The goal is to have a report that shows, for each group contributing manual annotations to C. elegans genes, how many genes they've annotated, how many annotations they've made, what is the breakdown by ontology and evidence code, and how many of those annotations have an annotation extension.
  
  
Line 24: Line 26:
  
 
http://wiki.geneontology.org/index.php/ZFIN_December_2013#Annotation_Progress
 
http://wiki.geneontology.org/index.php/ZFIN_December_2013#Annotation_Progress
 +
 +
Back to [[Gene Ontology]]

Latest revision as of 18:52, 2 December 2014

GO is now requiring quarterly progress reports, with the first one due at the meeting this month (2014-03-16).

We've been wanting to provide a more detailed progress report for GO for some time now, so this is a good opportunity to do that.

Here's one idea for C. elegans manual annotations:

Input files (available on tazendra in home/acedb/ranjana/GO/Progress_Reports/Test):

  • gp2protein.wb
  • gp_association.wb
  1. Ignore all lines with IEA evidence code, ECO:0000256, in Column 6
  2. Replace UniProtKB identifiers in Column 2 with WBGene ID using gp2protein.wb
  3. Remove (i.e. ignore for further reporting) any resulting lines that are exact duplicate lines of annotation

Then determine:

  1. Total number of unique annotations, i.e. unique lines.
  2. Total number of unique WBGenes.
  3. For each unique value in qualifier Column 3 count number of annotations for a given evidence code in Column 6 and number of annotations with an entry in Column 11. In other words, for each col3, I'd like to know the total number of different col6 and then also for each col3, the total number of different col6 that additionally have an entry in col11.
  4. Sort results according to unique entries in Column 10 (i.e., each contributing group).
  5. Also report on any lines where the UniProtKB identifier cannot be converted to a WBGene, i.e. where there is no mapping between a UniProtKB identifier and a WBGene according to the gp2protein file.

The goal is to have a report that shows, for each group contributing manual annotations to C. elegans genes, how many genes they've annotated, how many annotations they've made, what is the breakdown by ontology and evidence code, and how many of those annotations have an annotation extension.


For an idea of the kind of table I'm hoping to be able to produce, here's a link to Zfin's progress report from December 2013:

http://wiki.geneontology.org/index.php/ZFIN_December_2013#Annotation_Progress

Back to Gene Ontology