From WormBaseWiki
Jump to navigationJump to search

Specifications for Converting GO-CAM GPAD to .ace File

Source File Location

GO_Annotation IDs - finding the right number from which to start

  • Will need to handle similarly to the OA annotations in that we'll need to consult .ace GO annotation files sequentially to get the correct starting number for the GO_annotation objects generated from this GPAD file.
    • See code at lines 45-49 of /home/postgres/work/citace_upload/go_curation/get_go_annotation_ace.pm
 my $annotCounter = 0;
 my $annotFile = '/home/acedb/kimberly/citace_upload/go/gp_annotation.ace';
 open (IN, "<$annotFile") or die "Cannot open $annotFile : $!";
 while (my $line = <IN>) { if ($line =~ m/GO_annotation : "(\d+)"/) { $annotCounter = $1; } }
 close (IN) or die "Cannot close $annotFile : $!";
  • In this case, I would move the go_annotation.ace file to the go/ directory as I do for the gp_annotation.ace file above and we would start from the next highest number after consulting the go_annotation.ace file.

Annotations to Skip

  • Skip all annotations that are Assigned_by SynGO and get these from UniProt-GOA because the evidence codes are already mapped for us for GAF.
    • The reason this matters is that SynGO uses evidence codes other than the three-letter codes and so their evidence codes need to be mapped to the traditional three-letter codes.
    • BUT….also need to double-check that all of the SynGO annotations in Noctua GPAD are also in Protein2GO and confirm with Tony, Alex, Dustin, Paul, etc. how new SynGO annotations will enter the pipeline.
    • In Protein2GO, SynGO annotations have Source 'SYNX'.
    • Application of evidence codes looks to be different from GO guidelines.
  • Skip all annotations that have these values in the With/From field:
    • EC:
    • InterPro:
    • PANTHER:
    • UniPathway:
    • UniProt-KW:
    • UniProt-SubCell:
    • UniRule:

Columns and Mappings

Column 1: DB Prefix

  • Ignore

Column 2: DB Object ID

  • Populate Gene field in ?GO_annotation (no conversion needed)

Column 3: Relation

  • Populate Annotation_relation field with RO identifier
    • For now, need to map the text string to an RO identifier (no NOT annotations here, yet)
    • Subroutine exists in goa gpad parsing script, but we need to add all of the relations
    • And check on how NOT is handled
 sub populateAnnotrelToRo {
 $annotToRo{'colocalizes_with'} = 'RO:0002325';
 $annotToRo{'contributes_to'}   = 'RO:0002326';
 $annotToRo{'enables'}          = 'RO:0002327';
 $annotToRo{'involved_in'}      = 'RO:0002331';
 $annotToRo{'part_of'}          = 'BFO:0000050';
 $annotToRo{'acts_upstream_of_or_within'}   = 'RO:0002264';
 $annotToRo{'acts_upstream_of'}   = 'RO_0002263';
 $annotToRo{'acts_upstream_of_or_within_negative_effect'}   = 'RO:0004033';
 $annotToRo{'acts_upstream_of_or_within_positive_effect'}   = 'RO:0004032';
 $annotToRo{'acts_upstream_of_negative_effect'}   = 'RO:0004035';
 $annotToRo{'acts_upstream_of_positive_effect'}   = 'RO:0004034';
 } # sub populateAnnotrelToRo

Column 4: GO ID

  • GO_term (no conversion needed)

Column 5: Reference

  • Convert PMID to WBPaper
  • Convert DOI to WBPaper
  • Populate GO_REF as usual
  • PAINT_REF - convert to appropriate WBPaper (but why do we have annotations with PAINT_REF and are they in the GO database somewhere? They need to be updated in the GO database.)
 See lines 217 - 228 in /home/acedb/kimberly/citace_upload/go/gpad2ace/2018_October/go_gpad_parser.pl

Column 6: Evidence Code

  • Convert ECO ID to 3-letter GO code
    • This part will probably need to be written new as Tony maps the ECO to GO code in an annotation properties field, but this GPAD currently doesn't have that.
 ECO:0000250 ISS
 ECO:0000270 IEP
 ECO:0000304 TAS
 ECO:0000307 ND
 ECO:0000314 IDA
 ECO:0000315 IMP
 ECO:0000316 IGI
 ECO:0000318 IBA
 ECO:0000352 IEA (but this looks to have been a mistake somewhere in the pipeline)
 ECO:0000353 IPI
 ECO:0000501 IEA
 ECO:0001225 IMP
 ECO:0001232 IDA
 ECO:0005589 IDA
 ECO:0005611 IDA (but check these as ECO definition is not quite clear/familiar)
 ECO:0007007 HEP
 ECO:0007293 IEA (also check this one as it is: experimental evidence used in automatic assertion)

Column 7: With/From

  • What’s here from go_gpad parser:
 my (@withunis) = split/\|/, $with;
   my @withConverted;
   foreach my $withuni (@withunis) {
     if ($withuni =~ m/^UniProtKB:/) {                         # convert UniProtKB: to wbgene
        $withuni =~ s/^UniProtKB://;
        my $withuniStripped = $withuni;
        $withuniStripped =~ s/\-\d$//g;
        my $wbgWith = $withuni;
        if ($dbidToWb{$withuniStripped}) { $wbgWith = $dbidToWb{$withuniStripped}; push @withConverted, $wbgWith; }
          else {
            push @withConverted, "UniProtKB:$withuni";         # add uniprotkb value if does not map to wormbase value (kimberly 2015 02 05)
            $err .= qq($withuni doesn't map to WBGene from column 8, line $count\n); } }
      else {
        push @withConverted, $withuni; }                           # those that are not UniProtKB: just get added back
   } # foreach my $withuni (@withunis)
   my $withConverted = join"|", @withConverted;

Need to add an entry for just UniProt that would have the same print as UniProtKB

 foreach my $with (@withs) {
          if ($with =~ m/With:Not_supplied/) { 1; }            # do nothing
            elsif ($with =~ m/^UniProtKB:(\w+)/) {             print ACE qq(Database\t"UniProt"\t"UniProtAcc"\t"$1"\n); }
            elsif ($with =~ m/^InterPro:(IPR\d+)/) {           print ACE qq(Motif\t"INTERPRO:$1"\n); }
            elsif ($with =~ m/^HGNC:(\d+)/) {                  print ACE qq(Database\t"HGNC"\t"HGNCID"\t"$1"\n); }
            elsif ($with =~ m/^MGI:(MGI:\d+)/) {               print ACE qq(Database\t"MGI"\t"MGIID"\t"$1"\n); }
            elsif ($with =~ m/^HAMAP:(MF_\d+)/) {              print ACE qq(Database\t"HAMAP"\t"HAMAP_annotation_rule"\t"$1"\n); }
            elsif ($with =~ m/^EC:([\.\d]+)/) {                print ACE qq(Database\t"KEGG"\t"KEGG_id"\t"$1"\n); }
            elsif ($with =~ m/^UniPathway:(UPA\d+)/) {         print ACE qq(Database\t"UniPathway"\t"Pathway_id"\t"$1"\n); }
            elsif ($with =~ m/^UniProtKB-KW:(KW-\d+)/) {       print ACE qq(Database\t"UniProt"\t"UniProtKB-KW"\t"$1"\n); }
            elsif ($with =~ m/^UniProtKB-SubCell:(SL-\d+)/) {  print ACE qq(Database\t"UniProt"\t"UniProtKB-SubCell"\t"$1"\n); }
            elsif ($with =~ m/^UniRule:(\w+)/) {               print ACE qq(Database\t"UniProt"\t"UniRule"\t"$1"\n); }
            elsif ($with =~ m/^(GO:\d+)/) {                    print ACE qq(Inferred_from_GO_term\t"$1"\n); }
            elsif ($with =~ m/^WB:(WBVar\d+)/) {               print ACE qq(Variation\t"$1"\n); }
            elsif ($with =~ m/^WB:(WBRNAi\d+)/) {              print ACE qq(RNAi_result\t"$1"\n); }
            elsif ($with =~ m/^WB:(WBGene\d+)/) {              print ACE qq(Interacting_gene\t"$1"\n); }
            elsif ($with =~ m/^(WBGene\d+)/) {                 print ACE qq(Interacting_gene\t"$1"\n); }
            elsif ($with =~ m/^PomBase:([\.\w]+)/) {           print ACE qq(Database\t"PomBase"\t"PomBase_systematic_name"\t"$1"\n); }
            elsif ($with =~ m/^SGD:(S\d+)/) {                  print ACE qq(Database\t"SGD"\t"SGDID"\t"$1"\n); }
            elsif ($with =~ m/^PANTHER:(PTN\d+)/) {            print ACE qq(Database\t"Panther"\t"PanTree_node"\t"$1"\n); }
            elsif ($with =~ m/^TAIR:locus:(\d+)/) {            print ACE qq(Database\t"TAIR"\t"TAIR_locus_id"\t"$1"\n); }
            elsif ($with =~ m/^FB:(FBgn\d+)/) {                print ACE qq(Database\t"FLYBASE"\t"FLYBASEID"\t"$1"\n); }
            elsif ($with =~ m/^RGD:(\d+)/) {                   print ACE qq(Database\t"RGD"\t"RGDID"\t"$1"\n); }
            elsif ($with =~ m/^dictyBase:(DDB_G\d+)/) {        print ACE qq(Database\t"dictyBase"\t"dictyBaseID"\t"$1"\n); }
            else {                                             print ERR qq(WITH $with not acounted in .ace file\n); }
        } # foreach my $with (@withs)

Column 8: Interacting taxon

  • Currently don’t have any entries for this, but could reuse code from go_gpad parser

Column 9: Date

  • YYYYMMDD - (no conversion needed)

Column 10: Assigned_by

  • WB will need to be converted to 'WormBase'
  • SynGO (Will not take the SynGO annotations from this file, however need to upload an ?Analysis object that has the relevant details - file done 2018-10-23)

Column 11: Annotation Extensions

  • Parse as for go_gpad parsing script
 my (@annExtsComma) = split/\|/, $annotExtConverted;
        foreach my $annExtComma (@annExtsComma) {
          my (@annExts) = split/,/, $annExtComma;
          foreach my $annExt (@annExts) {
            my $relation = ;
            if ($annExt =~ m/^(.*?)\((.*?)\)/) { $relation = $1; $annExt = $2; }
            if ($annExt =~ m/(WBls:\d+)/) {                       print ACE qq(Life_stage_relation\t"$relation"\t"$1"\n); }
              elsif ($annExt =~ m/WB:(WBGene\d+)/) {              print ACE qq(Gene_relation\t"$relation"\t"$1"\n); }
              elsif ($annExt =~ m/(WBGene\d+)/) {                 print ACE qq(Gene_relation\t"$relation"\t"$1"\n); }
              elsif ($annExt =~ m/CHEBI:(\d+)/) {
                if ($chebiToMol{$1}) {                           print ACE qq(Molecule_relation\t"$relation"\t"$chebiToMol{$1}"\n); }
                  else {                                         print ERR qq(CHEBI $1 does not map to WBMol\n); } }
              elsif ($annExt =~ m/(WBbt:\d+)/) {                  print ACE qq(Anatomy_relation\t"$relation"\t"$1"\n); }
              elsif ($annExt =~ m/(GO:\d+)/) {                    print ACE qq(GO_term_relation\t"$relation"\t"$1"\n); }
              elsif ($annExt =~ m/UniProtKB:/) {                  1; }
              elsif ($annExt =~ m/FB:/) {                         1; }
              else {                                             print ERR qq(Annotation Extension $annExt not acounted in .ace file\n); }
          } # foreach my $annExt (@annExts)
    • Note that these are all RO relations
      • Also, SynGO is using UBERON for occurs_in and evidence codes like ‘knockout’ that don’t make sense for the papers they’re curating.

Column 12: Annotation Properties

  • Nothing to convert here.
  • At some point, we may want to think about capturing some of this metadata, but will need thorough review first.
  • Check curator; I’m still not sure we’re handling this correctly when we import existing annotations into GO-CAM models.