Difference between revisions of "GPAD to .ace file"

From WormBaseWiki
Jump to navigationJump to search
 
(27 intermediate revisions by the same user not shown)
Line 2: Line 2:
  
 
This page outlines how we'll go forward with converting the gpad file we get back from UniProt into a .ace file for upload to citace.
 
This page outlines how we'll go forward with converting the gpad file we get back from UniProt into a .ace file for upload to citace.
 +
 +
'''Script to convert gp_association.wb to gp_association.ace'''
 +
 +
Located here:  /home/acedb/ranjana/citace_upload/go_curation/ptgo_to_ace
 +
 +
Called: gpToAce.pl
 +
 +
Script takes as input the gp_association.wb and the gp2protein.wb files and converts the UniProtKB file to a .ace file for upload to citace.
 +
 +
Yes, but thankfully you can filter the results to exclude something, so you'd do something like :diff file1 file2 | grep -v "Date_last_updated"
 +
 +
and that should work.  If it doesn't we can remove the Date_last_updated from the dump (put a # at the beginning of the line
 +
that says "Date_last_updated", currently line 89)
 +
 +
 +
  
 
The final file specifications for the GPAD file are available here:
 
The final file specifications for the GPAD file are available here:
  
 
http://wiki.geneontology.org/index.php/Final_GPAD_and_GPI_file_format
 
http://wiki.geneontology.org/index.php/Final_GPAD_and_GPI_file_format
 +
 +
We will download the file from UniProtKB from their ftp site:
 +
 +
ftp://ftp.ebi.ac.uk/pub/contrib/goa/
 +
 +
'''.ace file output format:'''
 +
 +
Gene : "WBGene00000001"
 +
 +
GO_term "GO:0008340"    "IMP"  Paper_evidence  "WBPaper00005614"
 +
 +
GO_term "GO:0008340"    "IMP"  Curator_confirmed  "WBPerson324"
 +
 +
GO_term "GO:0040024"    "IMP"  Paper_evidence  "WBPaper00005614"
 +
 +
GO_term "GO:0040024"    "IMP"  Curator_confirmed  "WBPerson324"
 +
 +
GO_term "GO:0008286"    "IMP"  Paper_evidence  "WBPaper00005614"
 +
 +
GO_term "GO:0008286"    "IMP"  Curator_confirmed  "WBPerson324"
 +
 +
GO_term "GO:0005623"    "IDA"  Paper_evidence  "WBPaper00005614"
 +
 +
GO_term "GO:0005623"    "IDA"  Curator_confirmed  "WBPerson324"
 +
 +
GO_term "GO:0019901"    "IPI"  Paper_evidence  "WBPaper00005614"
 +
 +
GO_term "GO:0019901"    "IPI"  Curator_confirmed  "WBPerson1843"
 +
  
 
Things to confirm:
 
Things to confirm:
Line 11: Line 56:
 
#We get back from UniProt ''all'' C. elegans annotations, regardless of annotation source (e.g., WormBase, UniProt, IntAct, etc).
 
#We get back from UniProt ''all'' C. elegans annotations, regardless of annotation source (e.g., WormBase, UniProt, IntAct, etc).
 
#We get back all data in the with/from column (e.g., variations, RNAi experiments)
 
#We get back all data in the with/from column (e.g., variations, RNAi experiments)
 
+
#How many duplicate annotations from different sources?  How do we want to handle that - only take WB or have multiple Curator_confirmed values (I think I favor the latter).
  
 
A table mapping gpad columns to .ace:
 
A table mapping gpad columns to .ace:
Line 24: Line 69:
  
  
====Final format (09 Jan 2013)====
+
====Final format (09 Jan 2013, updated 07 Nov 2013)====
  
  
Line 40: Line 85:
 
| 1 || DB || required || 1 || 1 || must be in xrf_abbs || n/a || n/a
 
| 1 || DB || required || 1 || 1 || must be in xrf_abbs || n/a || n/a
 
|-  
 
|-  
| 2 || DB_Object_ID || required || 1 || 2 || canonical or spliceform ID || Gene : "WBGene00000001" || will need to map from UniProtKB IDs, using gp2protein file or later, gpi.  Note that if the identical GO annotation is made to one or more UniProtKB IDs corresponding to a single WBGene ID, then we will consolidate them into one WBGene annotation
+
| 2 || DB_Object_ID || required || 1 || 2 || canonical or spliceform ID || Gene : "WBGene00000001" || We will need to map UniProtKB IDs to WormBase WBGene IDs, using the current gp2protein file (available here: http://www.geneontology.org/gp2protein/) or later, the gpi fileIf this column contains a UniProtKB ID followed by a dash and then a number, we can ignore the dash and the number for mapping to a WormBase ID.
 
|-  
 
|-  
| 3 || Qualifier || required || 0 or greater || 4 || qualifiers to be confirmed || n/a || skip lines that have qualifiers
+
| 3 || Qualifier || required || 0 or greater || 4 || qualifiers to be confirmed || n/a || '''Skip lines that have qualifier values preced by NOT, colocalizes_with, or contributes_to.  For example, skip NOT(pipe)enables but parse if just enables.  (The explicit relations, involved_in, enables, part_of, have been added to every annotation and are now placed in the qualifier column of each annotation line.)'''
 
|-  
 
|-  
| 4 || GO ID ||  required || 1 || 5 || must be extant GO ID || GO_term "GO:0008340" || can take directly from file
+
| 4 || GO ID ||  required || 1 || 5 || must be extant GO ID || GO_term "GO:0008340" || Can take directly from file
 
|-  
 
|-  
| 5 || DB:Reference(s) || required || 1 or greater || 6 || DB must be in xrf_abbs || Paper_evidence  "WBPaper00005614" || We may need to map PMIDs or dois to WBPaper IDs
+
| 5 || DB:Reference(s) || required || 1 or greater || 6 || DB must be in xrf_abbs || Paper_evidence  "WBPaper00005614" || We will need to map PMIDs or dois to WBPaper IDs using the information contained in the pap_identifier table
 
|-  
 
|-  
| 6 || Evidence code || required || 1 || 7 || from ECO ||  "IMP" || We may need to map ECO codes to three-letter abbreviations; would be good to ECO in WB
+
| 6 || Evidence code || required || 1 || 7 || from ECO ||  "IMP" or others (e.g., "IGI") || This column is populated with ECO codes, but the corresponding three-letter GO evidence code is in Column 12 with the heading go_evidence=
 
|-  
 
|-  
 
| 7 || With (or) From || optional || 0 or greater || 8 || || n/a || n/a ||
 
| 7 || With (or) From || optional || 0 or greater || 8 || || n/a || n/a ||
Line 54: Line 99:
 
| 8 || Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || NCBI taxon ID || n/a || n/a
 
| 8 || Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || NCBI taxon ID || n/a || n/a
 
|-  
 
|-  
| 9 || Date ||  required || 1 || 14 || YYYYMMDD || Date_last_updated evidence || can take directly from file
+
| 9 || Date ||  required || 1 || 14 || YYYYMMDD || Date_last_updated evidence || Can take directly from file
 
|-  
 
|-  
| 10 || Assigned_by ||  required || 1 || 15 || from xrf_abbs || Curator_confirmed || If WormBase, take value from Annotation Properties, if not WB, then we will need to create Person objects for other databases or projects, e.g. UniProt, IntAct, RefGenome
+
| 10 || Assigned_by ||  required || 1 || 15 || from xrf_abbs || Curator_confirmed || If WormBase, take value from Annotation Properties preceded by curator_name=, if not WB, then we will need to create Person objects (other thoughts?) for other databases or projects, e.g. UniProt, IntAct, RefGenome
 
|-  
 
|-  
 
| 11 || Annotation Extension || optional || 0 or greater || 16 || || n/a || n/a ||
 
| 11 || Annotation Extension || optional || 0 or greater || 16 || || n/a || n/a ||
 
|-  
 
|-  
| 12 || Annotation Properties || optional || 0 or greater ||  || See Note 1 below || Curator_confirmed || Take this value only if Column 10 is populated with WormBase
+
| 12 || Annotation Properties || optional || 0 or greater ||  || See Note 1 below || GO_evidence, Curator_confirmed || Use this column for GO_evidence and also for Curator_confirmed if Column 10 is WormBase - '''Order of information has switched here.  It's now GO evidence first, then id, then curator.'''
 
|}
 
|}
  

Latest revision as of 18:00, 11 November 2013

Back to Gene Ontology

This page outlines how we'll go forward with converting the gpad file we get back from UniProt into a .ace file for upload to citace.

Script to convert gp_association.wb to gp_association.ace

Located here: /home/acedb/ranjana/citace_upload/go_curation/ptgo_to_ace

Called: gpToAce.pl

Script takes as input the gp_association.wb and the gp2protein.wb files and converts the UniProtKB file to a .ace file for upload to citace.

Yes, but thankfully you can filter the results to exclude something, so you'd do something like :diff file1 file2 | grep -v "Date_last_updated"

and that should work. If it doesn't we can remove the Date_last_updated from the dump (put a # at the beginning of the line that says "Date_last_updated", currently line 89)



The final file specifications for the GPAD file are available here:

http://wiki.geneontology.org/index.php/Final_GPAD_and_GPI_file_format

We will download the file from UniProtKB from their ftp site:

ftp://ftp.ebi.ac.uk/pub/contrib/goa/

.ace file output format:

Gene : "WBGene00000001"

GO_term "GO:0008340" "IMP" Paper_evidence "WBPaper00005614"

GO_term "GO:0008340" "IMP" Curator_confirmed "WBPerson324"

GO_term "GO:0040024" "IMP" Paper_evidence "WBPaper00005614"

GO_term "GO:0040024" "IMP" Curator_confirmed "WBPerson324"

GO_term "GO:0008286" "IMP" Paper_evidence "WBPaper00005614"

GO_term "GO:0008286" "IMP" Curator_confirmed "WBPerson324"

GO_term "GO:0005623" "IDA" Paper_evidence "WBPaper00005614"

GO_term "GO:0005623" "IDA" Curator_confirmed "WBPerson324"

GO_term "GO:0019901" "IPI" Paper_evidence "WBPaper00005614"

GO_term "GO:0019901" "IPI" Curator_confirmed "WBPerson1843"


Things to confirm:

  1. We get back from UniProt all C. elegans annotations, regardless of annotation source (e.g., WormBase, UniProt, IntAct, etc).
  2. We get back all data in the with/from column (e.g., variations, RNAi experiments)
  3. How many duplicate annotations from different sources? How do we want to handle that - only take WB or have multiple Curator_confirmed values (I think I favor the latter).

A table mapping gpad columns to .ace:

gp_association files (GPAD)

N.B. The first line in the gp_association file should be;

!gpa-version: 1.1


Final format (09 Jan 2013, updated 07 Nov 2013)

column name required? cardinality old column # extra info .ace file equivalent how to populate
1 DB required 1 1 must be in xrf_abbs n/a n/a
2 DB_Object_ID required 1 2 canonical or spliceform ID Gene : "WBGene00000001" We will need to map UniProtKB IDs to WormBase WBGene IDs, using the current gp2protein file (available here: http://www.geneontology.org/gp2protein/) or later, the gpi file. If this column contains a UniProtKB ID followed by a dash and then a number, we can ignore the dash and the number for mapping to a WormBase ID.
3 Qualifier required 0 or greater 4 qualifiers to be confirmed n/a Skip lines that have qualifier values preced by NOT, colocalizes_with, or contributes_to. For example, skip NOT(pipe)enables but parse if just enables. (The explicit relations, involved_in, enables, part_of, have been added to every annotation and are now placed in the qualifier column of each annotation line.)
4 GO ID required 1 5 must be extant GO ID GO_term "GO:0008340" Can take directly from file
5 DB:Reference(s) required 1 or greater 6 DB must be in xrf_abbs Paper_evidence "WBPaper00005614" We will need to map PMIDs or dois to WBPaper IDs using the information contained in the pap_identifier table
6 Evidence code required 1 7 from ECO "IMP" or others (e.g., "IGI") This column is populated with ECO codes, but the corresponding three-letter GO evidence code is in Column 12 with the heading go_evidence=
7 With (or) From optional 0 or greater 8 n/a n/a
8 Interacting taxon ID (for multi-organism processes) optional 0 or 1 13 NCBI taxon ID n/a n/a
9 Date required 1 14 YYYYMMDD Date_last_updated evidence Can take directly from file
10 Assigned_by required 1 15 from xrf_abbs Curator_confirmed If WormBase, take value from Annotation Properties preceded by curator_name=, if not WB, then we will need to create Person objects (other thoughts?) for other databases or projects, e.g. UniProt, IntAct, RefGenome
11 Annotation Extension optional 0 or greater 16 n/a n/a
12 Annotation Properties optional 0 or greater See Note 1 below GO_evidence, Curator_confirmed Use this column for GO_evidence and also for Curator_confirmed if Column 10 is WormBase - Order of information has switched here. It's now GO evidence first, then id, then curator.

Notes

1. The Annotation Properties column can be filled with a pipe separated list of "property_name = property_value". There will be a fixed vocabulary for the property names and this list can be extended when necessary. The initial supported properties would be curator_name and annotation_identifier*, but can be extended to include e.g. curator_ID, modification_date, creation_date, annotation_notes...etc.

* curator_name and annotation_identifier will be useful for groups that are using Protein2GO for protein annotation who wish to maintain their annotations in their own database. These values can be used to keep track of individual annotations.

Further questions/discussion points

1. Qualifiers column. a. Are the explicit relations mandatory? b. If so, what are they.

2. Evidence column. a. Chain of evidence

3. Annotation properties column. Tony has suggested including the GO evidence code here to avoid using a lookup to reverse engineer the file