Specifications for Converting GO-CAM GPAD 2.0 to .ace File

WormBase Data Model

https://wormbase.org/tools/schema/run?class=GO_annotation

Source File Specifications

https://geneontology.org/docs/gene-product-association-data-gpad-format-2.0/

Source File Location

http://current.geneontology.org/products/upstream_and_raw_data/noctua_wb.gpad.gz

GO_Annotation IDs - finding the right number from which to start

Consult gp_annotation.ace GO annotation file (generated first) to get the correct starting number for the GO_annotation objects generated from the Noctua GPAD file.
- See code at lines 45-49 of /home/postgres/work/citace_upload/go_curation/get_go_annotation_ace.pm

 my $annotCounter = 0;
 my $annotFile = '/home/acedb/kimberly/citace_upload/go/gp_annotation.ace';
 open (IN, "<$annotFile") or die "Cannot open $annotFile : $!";
 while (my $line = <IN>) { if ($line =~ m/GO_annotation : "(\d+)"/) { $annotCounter = $1; } }
 close (IN) or die "Cannot close $annotFile : $!";

In this case, I would move the go_annotation.ace file to the go/ directory as I do for the gp_annotation.ace file above and we would start from the next highest number after consulting the go_annotation.ace file.
New file: gocam_annotation.ace

Annotations to Skip

Skip all annotations that are Assigned_by SynGO and get these from UniProt-GOA because the evidence codes are already mapped for us for GAF.
- The reason this matters is that SynGO uses evidence codes other than the three-letter codes and so their evidence codes need to be mapped to the traditional three-letter codes.
Skip all annotations that have values with these prefixes in the With/From field:
- EC:
- InterPro:
- PANTHER:
- UniPathway:
- UniProt-KW:
- UniProt-SubCell:
- UniRule:

Columns and Mappings

Column 1: DB Prefix and DB Object ID

Populate 'Gene' field in ?GO_annotation
- Strip WB prefix and just populate WBGene ID
- As of 2024-12-11, there are no other types of IDs in Column 1

Column 2: Negation

Populate 'Annotation_relation_not' field in .ace model with value in Column 3
- As of 2024-12-11, there are no NOT annotations in the WB Noctua GPAD

Column 3: Relation

Populate 'Annotation_relation' field with RO identifier

Column 4: GO ID

Populate value directly from GPAD file

Column 5: Reference

Convert PMID to WBPaper
Convert DOI to WBPaper
Populate GO_REF as usual

Column 6: Evidence Code

Populate value directly from GPAD file (i.e. mapping from three-letter code to ECO ID no longer needed)

Column 7: With/From

What’s here from go_gpad parser:

 my (@withunis) = split/\|/, $with;
   my @withConverted;
   foreach my $withuni (@withunis) {
     if ($withuni =~ m/^UniProtKB:/) {                         # convert UniProtKB: to wbgene
        $withuni =~ s/^UniProtKB://;
        my $withuniStripped = $withuni;
        $withuniStripped =~ s/\-\d$//g;
        my $wbgWith = $withuni;
        if ($dbidToWb{$withuniStripped}) { $wbgWith = $dbidToWb{$withuniStripped}; push @withConverted, $wbgWith; }
          else {
            push @withConverted, "UniProtKB:$withuni";         # add uniprotkb value if does not map to wormbase value (kimberly 2015 02 05)
            $err .= qq($withuni doesn't map to WBGene from column 8, line $count\n); } }
      else {
        push @withConverted, $withuni; }                           # those that are not UniProtKB: just get added back
   } # foreach my $withuni (@withunis)
   my $withConverted = join"|", @withConverted;

Need to add an entry for just UniProt that would have the same print as UniProtKB

 foreach my $with (@withs) {
          if ($with =~ m/With:Not_supplied/) { 1; }            # do nothing
            elsif ($with =~ m/^UniProtKB:(\w+)/) {             print ACE qq(Database\t"UniProt"\t"UniProtAcc"\t"$1"\n); }
            elsif ($with =~ m/^InterPro:(IPR\d+)/) {           print ACE qq(Motif\t"INTERPRO:$1"\n); }
            elsif ($with =~ m/^HGNC:(\d+)/) {                  print ACE qq(Database\t"HGNC"\t"HGNCID"\t"$1"\n); }
            elsif ($with =~ m/^MGI:(MGI:\d+)/) {               print ACE qq(Database\t"MGI"\t"MGIID"\t"$1"\n); }
            elsif ($with =~ m/^HAMAP:(MF_\d+)/) {              print ACE qq(Database\t"HAMAP"\t"HAMAP_annotation_rule"\t"$1"\n); }
            elsif ($with =~ m/^EC:([\.\d]+)/) {                print ACE qq(Database\t"KEGG"\t"KEGG_id"\t"$1"\n); }
            elsif ($with =~ m/^UniPathway:(UPA\d+)/) {         print ACE qq(Database\t"UniPathway"\t"Pathway_id"\t"$1"\n); }
            elsif ($with =~ m/^UniProtKB-KW:(KW-\d+)/) {       print ACE qq(Database\t"UniProt"\t"UniProtKB-KW"\t"$1"\n); }
            elsif ($with =~ m/^UniProtKB-SubCell:(SL-\d+)/) {  print ACE qq(Database\t"UniProt"\t"UniProtKB-SubCell"\t"$1"\n); }
            elsif ($with =~ m/^UniRule:(\w+)/) {               print ACE qq(Database\t"UniProt"\t"UniRule"\t"$1"\n); }
            elsif ($with =~ m/^(GO:\d+)/) {                    print ACE qq(Inferred_from_GO_term\t"$1"\n); }
            elsif ($with =~ m/^WB:(WBVar\d+)/) {               print ACE qq(Variation\t"$1"\n); }
            elsif ($with =~ m/^WB:(WBRNAi\d+)/) {              print ACE qq(RNAi_result\t"$1"\n); }
            elsif ($with =~ m/^WB:(WBGene\d+)/) {              print ACE qq(Interacting_gene\t"$1"\n); }
            elsif ($with =~ m/^(WBGene\d+)/) {                 print ACE qq(Interacting_gene\t"$1"\n); }
            elsif ($with =~ m/^PomBase:([\.\w]+)/) {           print ACE qq(Database\t"PomBase"\t"PomBase_systematic_name"\t"$1"\n); }
            elsif ($with =~ m/^SGD:(S\d+)/) {                  print ACE qq(Database\t"SGD"\t"SGDID"\t"$1"\n); }
            elsif ($with =~ m/^PANTHER:(PTN\d+)/) {            print ACE qq(Database\t"Panther"\t"PanTree_node"\t"$1"\n); }
            elsif ($with =~ m/^TAIR:locus:(\d+)/) {            print ACE qq(Database\t"TAIR"\t"TAIR_locus_id"\t"$1"\n); }
            elsif ($with =~ m/^FB:(FBgn\d+)/) {                print ACE qq(Database\t"FLYBASE"\t"FLYBASEID"\t"$1"\n); }
            elsif ($with =~ m/^RGD:(\d+)/) {                   print ACE qq(Database\t"RGD"\t"RGDID"\t"$1"\n); }
            elsif ($with =~ m/^dictyBase:(DDB_G\d+)/) {        print ACE qq(Database\t"dictyBase"\t"dictyBaseID"\t"$1"\n); }
            else {                                             print ERR qq(WITH $with not acounted in .ace file\n); }
        } # foreach my $with (@withs)

Column 8: Interacting taxon

Currently don’t have any entries for this, but could reuse code from go_gpad parser

Column 9: Date

YYYYMMDD - need to convert to YYYY-MM-DD

Column 10: Assigned_by

WB will need to be converted to 'WormBase'
SynGO (Will not take the SynGO annotations from this file, however need to upload an ?Analysis object that has the relevant details - file done 2018-10-23)

Column 11: Annotation Extensions

Parse as for go_gpad parsing script

 my (@annExtsComma) = split/\|/, $annotExtConverted;
        foreach my $annExtComma (@annExtsComma) {
          my (@annExts) = split/,/, $annExtComma;
          foreach my $annExt (@annExts) {
            my $relation = ;
            if ($annExt =~ m/^(.*?)\((.*?)\)/) { $relation = $1; $annExt = $2; }
            if ($annExt =~ m/(WBls:\d+)/) {                       print ACE qq(Life_stage_relation\t"$relation"\t"$1"\n); }
              elsif ($annExt =~ m/WB:(WBGene\d+)/) {              print ACE qq(Gene_relation\t"$relation"\t"$1"\n); }
              elsif ($annExt =~ m/(WBGene\d+)/) {                 print ACE qq(Gene_relation\t"$relation"\t"$1"\n); }
              elsif ($annExt =~ m/CHEBI:(\d+)/) {
                if ($chebiToMol{$1}) {                           print ACE qq(Molecule_relation\t"$relation"\t"$chebiToMol{$1}"\n); }
                  else {                                         print ERR qq(CHEBI $1 does not map to WBMol\n); } }
              elsif ($annExt =~ m/(WBbt:\d+)/) {                  print ACE qq(Anatomy_relation\t"$relation"\t"$1"\n); }
              elsif ($annExt =~ m/(GO:\d+)/) {                    print ACE qq(GO_term_relation\t"$relation"\t"$1"\n); }
              elsif ($annExt =~ m/UniProtKB:/) {                  1; }
              elsif ($annExt =~ m/FB:/) {                         1; }
              else {                                             print ERR qq(Annotation Extension $annExt not acounted in .ace file\n); }
          } # foreach my $annExt (@annExts)

- Note that these are all RO relations
  - Also, SynGO is using UBERON for occurs_in and evidence codes like ‘knockout’ that don’t make sense for the papers they’re curating.

Column 12: Annotation Properties

Nothing to convert here.
At some point, we may want to think about capturing some of this metadata, but will need thorough review first.
Check curator; I’m still not sure we’re handling this correctly when we import existing annotations into GO-CAM models.

Specifications for Converting GO-CAM GPAD 1.1 to .ace File

Source File Location

http://build.berkeleybop.org/job/export-lego-to-gpad-sparql/lastSuccessfulBuild/artifact/legacy/wb.gpad

GO_Annotation IDs - finding the right number from which to start

Will need to handle similarly to the OA annotations in that we'll need to consult .ace GO annotation files sequentially to get the correct starting number for the GO_annotation objects generated from this GPAD file.
- See code at lines 45-49 of /home/postgres/work/citace_upload/go_curation/get_go_annotation_ace.pm

 my $annotCounter = 0;
 my $annotFile = '/home/acedb/kimberly/citace_upload/go/gp_annotation.ace';
 open (IN, "<$annotFile") or die "Cannot open $annotFile : $!";
 while (my $line = <IN>) { if ($line =~ m/GO_annotation : "(\d+)"/) { $annotCounter = $1; } }
 close (IN) or die "Cannot close $annotFile : $!";

In this case, I would move the go_annotation.ace file to the go/ directory as I do for the gp_annotation.ace file above and we would start from the next highest number after consulting the go_annotation.ace file.
New file: gocam_annotation.ace

Annotations to Skip

Skip all annotations that are Assigned_by SynGO and get these from UniProt-GOA because the evidence codes are already mapped for us for GAF.
- The reason this matters is that SynGO uses evidence codes other than the three-letter codes and so their evidence codes need to be mapped to the traditional three-letter codes.
- BUT….also need to double-check that all of the SynGO annotations in Noctua GPAD are also in Protein2GO and confirm with Tony, Alex, Dustin, Paul, etc. how new SynGO annotations will enter the pipeline.
- In Protein2GO, SynGO annotations have Source 'SYNX'.
- Application of evidence codes looks to be different from GO guidelines.

Skip all annotations that have these values in the With/From field:
- EC:
- InterPro:
- PANTHER:
- UniPathway:
- UniProt-KW:
- UniProt-SubCell:
- UniRule:

Columns and Mappings

Column 1: DB Prefix

Ignore

Column 2: DB Object ID

Populate Gene field in ?GO_annotation (no conversion needed)

Column 3: Relation

Populate Annotation_relation field with RO identifier
- For now, need to map the text string to an RO identifier (no NOT annotations here, yet)
- Subroutine exists in goa gpad parsing script, but we need to add all of the relations
- And check on how NOT is handled

 sub populateAnnotrelToRo {
 $annotToRo{'colocalizes_with'} = 'RO:0002325';
 $annotToRo{'contributes_to'}   = 'RO:0002326';
 $annotToRo{'enables'}          = 'RO:0002327';
 $annotToRo{'involved_in'}      = 'RO:0002331';
 $annotToRo{'part_of'}          = 'BFO:0000050';
 $annotToRo{'acts_upstream_of_or_within'}   = 'RO:0002264';
 $annotToRo{'acts_upstream_of'}   = 'RO_0002263';
 $annotToRo{'acts_upstream_of_or_within_negative_effect'}   = 'RO:0004033';
 $annotToRo{'acts_upstream_of_or_within_positive_effect'}   = 'RO:0004032';
 $annotToRo{'acts_upstream_of_negative_effect'}   = 'RO:0004035';
 $annotToRo{'acts_upstream_of_positive_effect'}   = 'RO:0004034';
 } # sub populateAnnotrelToRo

Column 4: GO ID

GO_term (no conversion needed)

Column 5: Reference

Convert PMID to WBPaper
Convert DOI to WBPaper
Populate GO_REF as usual
PAINT_REF - convert to appropriate WBPaper (but why do we have annotations with PAINT_REF and are they in the GO database somewhere? They need to be updated in the GO database.)

 See lines 217 - 228 in /home/acedb/kimberly/citace_upload/go/gpad2ace/2018_October/go_gpad_parser.pl

Column 6: Evidence Code

Convert ECO ID to 3-letter GO code
- This part will probably need to be written new as Tony maps the ECO to GO code in an annotation properties field, but this GPAD currently doesn't have that.

 ECO:0000250 ISS
 ECO:0000270 IEP
 ECO:0000304 TAS
 ECO:0000307 ND
 ECO:0000314 IDA
 ECO:0000315 IMP
 ECO:0000316 IGI
 ECO:0000318 IBA
 ECO:0000352 IEA (but this looks to have been a mistake somewhere in the pipeline)
 ECO:0000353 IPI
 ECO:0000501 IEA
 ECO:0001225 IMP
 ECO:0001232 IDA
 ECO:0005589 IDA
 ECO:0005593 IDA (needs a manual assertion version)
 ECO:0005611 IDA (but check these as ECO definition is not quite clear/familiar)
 ECO:0006003 IDA 
 ECO:0006013 IDA
 ECO:0006062 IDA
 ECO:0006063 IMP (would need to double-check these)	
 ECO:0006064 IDA		
 ECO:0007007 HEP
 ECO:0007293 IEA (also check this one as it is: experimental evidence used in automatic assertion)

Column 7: With/From

What’s here from go_gpad parser:

 my (@withunis) = split/\|/, $with;
   my @withConverted;
   foreach my $withuni (@withunis) {
     if ($withuni =~ m/^UniProtKB:/) {                         # convert UniProtKB: to wbgene
        $withuni =~ s/^UniProtKB://;
        my $withuniStripped = $withuni;
        $withuniStripped =~ s/\-\d$//g;
        my $wbgWith = $withuni;
        if ($dbidToWb{$withuniStripped}) { $wbgWith = $dbidToWb{$withuniStripped}; push @withConverted, $wbgWith; }
          else {
            push @withConverted, "UniProtKB:$withuni";         # add uniprotkb value if does not map to wormbase value (kimberly 2015 02 05)
            $err .= qq($withuni doesn't map to WBGene from column 8, line $count\n); } }
      else {
        push @withConverted, $withuni; }                           # those that are not UniProtKB: just get added back
   } # foreach my $withuni (@withunis)
   my $withConverted = join"|", @withConverted;

Need to add an entry for just UniProt that would have the same print as UniProtKB

 foreach my $with (@withs) {
          if ($with =~ m/With:Not_supplied/) { 1; }            # do nothing
            elsif ($with =~ m/^UniProtKB:(\w+)/) {             print ACE qq(Database\t"UniProt"\t"UniProtAcc"\t"$1"\n); }
            elsif ($with =~ m/^InterPro:(IPR\d+)/) {           print ACE qq(Motif\t"INTERPRO:$1"\n); }
            elsif ($with =~ m/^HGNC:(\d+)/) {                  print ACE qq(Database\t"HGNC"\t"HGNCID"\t"$1"\n); }
            elsif ($with =~ m/^MGI:(MGI:\d+)/) {               print ACE qq(Database\t"MGI"\t"MGIID"\t"$1"\n); }
            elsif ($with =~ m/^HAMAP:(MF_\d+)/) {              print ACE qq(Database\t"HAMAP"\t"HAMAP_annotation_rule"\t"$1"\n); }
            elsif ($with =~ m/^EC:([\.\d]+)/) {                print ACE qq(Database\t"KEGG"\t"KEGG_id"\t"$1"\n); }
            elsif ($with =~ m/^UniPathway:(UPA\d+)/) {         print ACE qq(Database\t"UniPathway"\t"Pathway_id"\t"$1"\n); }
            elsif ($with =~ m/^UniProtKB-KW:(KW-\d+)/) {       print ACE qq(Database\t"UniProt"\t"UniProtKB-KW"\t"$1"\n); }
            elsif ($with =~ m/^UniProtKB-SubCell:(SL-\d+)/) {  print ACE qq(Database\t"UniProt"\t"UniProtKB-SubCell"\t"$1"\n); }
            elsif ($with =~ m/^UniRule:(\w+)/) {               print ACE qq(Database\t"UniProt"\t"UniRule"\t"$1"\n); }
            elsif ($with =~ m/^(GO:\d+)/) {                    print ACE qq(Inferred_from_GO_term\t"$1"\n); }
            elsif ($with =~ m/^WB:(WBVar\d+)/) {               print ACE qq(Variation\t"$1"\n); }
            elsif ($with =~ m/^WB:(WBRNAi\d+)/) {              print ACE qq(RNAi_result\t"$1"\n); }
            elsif ($with =~ m/^WB:(WBGene\d+)/) {              print ACE qq(Interacting_gene\t"$1"\n); }
            elsif ($with =~ m/^(WBGene\d+)/) {                 print ACE qq(Interacting_gene\t"$1"\n); }
            elsif ($with =~ m/^PomBase:([\.\w]+)/) {           print ACE qq(Database\t"PomBase"\t"PomBase_systematic_name"\t"$1"\n); }
            elsif ($with =~ m/^SGD:(S\d+)/) {                  print ACE qq(Database\t"SGD"\t"SGDID"\t"$1"\n); }
            elsif ($with =~ m/^PANTHER:(PTN\d+)/) {            print ACE qq(Database\t"Panther"\t"PanTree_node"\t"$1"\n); }
            elsif ($with =~ m/^TAIR:locus:(\d+)/) {            print ACE qq(Database\t"TAIR"\t"TAIR_locus_id"\t"$1"\n); }
            elsif ($with =~ m/^FB:(FBgn\d+)/) {                print ACE qq(Database\t"FLYBASE"\t"FLYBASEID"\t"$1"\n); }
            elsif ($with =~ m/^RGD:(\d+)/) {                   print ACE qq(Database\t"RGD"\t"RGDID"\t"$1"\n); }
            elsif ($with =~ m/^dictyBase:(DDB_G\d+)/) {        print ACE qq(Database\t"dictyBase"\t"dictyBaseID"\t"$1"\n); }
            else {                                             print ERR qq(WITH $with not acounted in .ace file\n); }
        } # foreach my $with (@withs)

Column 8: Interacting taxon

Currently don’t have any entries for this, but could reuse code from go_gpad parser

Column 9: Date

YYYYMMDD - need to convert to YYYY-MM-DD

Column 10: Assigned_by

WB will need to be converted to 'WormBase'
SynGO (Will not take the SynGO annotations from this file, however need to upload an ?Analysis object that has the relevant details - file done 2018-10-23)

Column 11: Annotation Extensions

Parse as for go_gpad parsing script

 my (@annExtsComma) = split/\|/, $annotExtConverted;
        foreach my $annExtComma (@annExtsComma) {
          my (@annExts) = split/,/, $annExtComma;
          foreach my $annExt (@annExts) {
            my $relation = ;
            if ($annExt =~ m/^(.*?)\((.*?)\)/) { $relation = $1; $annExt = $2; }
            if ($annExt =~ m/(WBls:\d+)/) {                       print ACE qq(Life_stage_relation\t"$relation"\t"$1"\n); }
              elsif ($annExt =~ m/WB:(WBGene\d+)/) {              print ACE qq(Gene_relation\t"$relation"\t"$1"\n); }
              elsif ($annExt =~ m/(WBGene\d+)/) {                 print ACE qq(Gene_relation\t"$relation"\t"$1"\n); }
              elsif ($annExt =~ m/CHEBI:(\d+)/) {
                if ($chebiToMol{$1}) {                           print ACE qq(Molecule_relation\t"$relation"\t"$chebiToMol{$1}"\n); }
                  else {                                         print ERR qq(CHEBI $1 does not map to WBMol\n); } }
              elsif ($annExt =~ m/(WBbt:\d+)/) {                  print ACE qq(Anatomy_relation\t"$relation"\t"$1"\n); }
              elsif ($annExt =~ m/(GO:\d+)/) {                    print ACE qq(GO_term_relation\t"$relation"\t"$1"\n); }
              elsif ($annExt =~ m/UniProtKB:/) {                  1; }
              elsif ($annExt =~ m/FB:/) {                         1; }
              else {                                             print ERR qq(Annotation Extension $annExt not acounted in .ace file\n); }
          } # foreach my $annExt (@annExts)

- Note that these are all RO relations
  - Also, SynGO is using UBERON for occurs_in and evidence codes like ‘knockout’ that don’t make sense for the papers they’re curating.

Column 12: Annotation Properties

Nothing to convert here.
At some point, we may want to think about capturing some of this metadata, but will need thorough review first.
Check curator; I’m still not sure we’re handling this correctly when we import existing annotations into GO-CAM models.

GO-CAM GPAD

Contents

Specifications for Converting GO-CAM GPAD 2.0 to .ace File

WormBase Data Model

Source File Specifications

Source File Location

GO_Annotation IDs - finding the right number from which to start

Annotations to Skip

Columns and Mappings

Column 1: DB Prefix and DB Object ID

Column 2: Negation

Column 3: Relation

Column 4: GO ID

Column 5: Reference

Column 6: Evidence Code

Column 7: With/From

Column 8: Interacting taxon

Column 9: Date

Column 10: Assigned_by

Column 11: Annotation Extensions

Column 12: Annotation Properties

Specifications for Converting GO-CAM GPAD 1.1 to .ace File

Source File Location

GO_Annotation IDs - finding the right number from which to start

Annotations to Skip

Columns and Mappings

Column 1: DB Prefix

Column 2: DB Object ID

Column 3: Relation

Column 4: GO ID

Column 5: Reference

Column 6: Evidence Code

Column 7: With/From

Column 8: Interacting taxon

Column 9: Date

Column 10: Assigned_by

Column 11: Annotation Extensions

Column 12: Annotation Properties

Navigation menu

Search