Difference between revisions of "GO-CAM GPAD"

From WormBaseWiki
Jump to navigationJump to search
 
(9 intermediate revisions by the same user not shown)
Line 13: Line 13:
 
   close (IN) or die "Cannot close $annotFile : $!";
 
   close (IN) or die "Cannot close $annotFile : $!";
 
*In this case, I would move the go_annotation.ace file to the go/ directory as I do for the gp_annotation.ace file above and we would start from the next highest number after consulting the go_annotation.ace file.
 
*In this case, I would move the go_annotation.ace file to the go/ directory as I do for the gp_annotation.ace file above and we would start from the next highest number after consulting the go_annotation.ace file.
 +
*New file: gocam_annotation.ace
  
 
== Annotations to Skip ==
 
== Annotations to Skip ==
*Proposal: skip all annotations that are Assigned_by SynGO and get these from UniProt-GOA because the evidence codes are already mapped for us for GAF.
+
*Skip all annotations that are Assigned_by SynGO and get these from UniProt-GOA because the evidence codes are already mapped for us for GAF.
 
**The reason this matters is that SynGO uses evidence codes other than the three-letter codes and so their evidence codes need to be mapped to the traditional three-letter codes.
 
**The reason this matters is that SynGO uses evidence codes other than the three-letter codes and so their evidence codes need to be mapped to the traditional three-letter codes.
 
**BUT….also need to double-check that all of the SynGO annotations in Noctua GPAD are also in Protein2GO and confirm with Tony, Alex, Dustin, Paul, etc. how new SynGO annotations will enter the pipeline.
 
**BUT….also need to double-check that all of the SynGO annotations in Noctua GPAD are also in Protein2GO and confirm with Tony, Alex, Dustin, Paul, etc. how new SynGO annotations will enter the pipeline.
Line 28: Line 29:
 
**UniProt-KW:
 
**UniProt-KW:
 
**UniProt-SubCell:
 
**UniProt-SubCell:
 +
**UniRule:
  
 
== Columns and Mappings ==
 
== Columns and Mappings ==
Line 77: Line 79:
 
   ECO:0000316 IGI
 
   ECO:0000316 IGI
 
   ECO:0000318 IBA
 
   ECO:0000318 IBA
   ECO:0000352 ???
+
   ECO:0000352 IEA (but this looks to have been a mistake somewhere in the pipeline)
 
   ECO:0000353 IPI
 
   ECO:0000353 IPI
 
   ECO:0000501 IEA
 
   ECO:0000501 IEA
 +
  ECO:0001225 IMP
 +
  ECO:0001232 IDA
 +
  ECO:0005589 IDA
 +
  ECO:0005593 IDA (needs a manual assertion version)
 
   ECO:0005611 IDA (but check these as ECO definition is not quite clear/familiar)
 
   ECO:0005611 IDA (but check these as ECO definition is not quite clear/familiar)
 +
  ECO:0006003 IDA
 +
  ECO:0006013 IDA
 +
  ECO:0006062 IDA
 +
  ECO:0006063 IMP (would need to double-check these)
 +
  ECO:0006064 IDA
 
   ECO:0007007 HEP
 
   ECO:0007007 HEP
 
   ECO:0007293 IEA (also check this one as it is: experimental evidence used in automatic assertion)
 
   ECO:0007293 IEA (also check this one as it is: experimental evidence used in automatic assertion)
Line 136: Line 147:
  
 
=== Column 9: Date ===
 
=== Column 9: Date ===
*YYYYMMDD - (no conversion needed)
+
*YYYYMMDD - need to convert to YYYY-MM-DD
  
 
=== Column 10: Assigned_by ===
 
=== Column 10: Assigned_by ===

Latest revision as of 22:10, 5 November 2018

Specifications for Converting GO-CAM GPAD to .ace File

Source File Location

GO_Annotation IDs - finding the right number from which to start

  • Will need to handle similarly to the OA annotations in that we'll need to consult .ace GO annotation files sequentially to get the correct starting number for the GO_annotation objects generated from this GPAD file.
    • See code at lines 45-49 of /home/postgres/work/citace_upload/go_curation/get_go_annotation_ace.pm
 my $annotCounter = 0;
 my $annotFile = '/home/acedb/kimberly/citace_upload/go/gp_annotation.ace';
 open (IN, "<$annotFile") or die "Cannot open $annotFile : $!";
 while (my $line = <IN>) { if ($line =~ m/GO_annotation : "(\d+)"/) { $annotCounter = $1; } }
 close (IN) or die "Cannot close $annotFile : $!";
  • In this case, I would move the go_annotation.ace file to the go/ directory as I do for the gp_annotation.ace file above and we would start from the next highest number after consulting the go_annotation.ace file.
  • New file: gocam_annotation.ace

Annotations to Skip

  • Skip all annotations that are Assigned_by SynGO and get these from UniProt-GOA because the evidence codes are already mapped for us for GAF.
    • The reason this matters is that SynGO uses evidence codes other than the three-letter codes and so their evidence codes need to be mapped to the traditional three-letter codes.
    • BUT….also need to double-check that all of the SynGO annotations in Noctua GPAD are also in Protein2GO and confirm with Tony, Alex, Dustin, Paul, etc. how new SynGO annotations will enter the pipeline.
    • In Protein2GO, SynGO annotations have Source 'SYNX'.
    • Application of evidence codes looks to be different from GO guidelines.
  • Skip all annotations that have these values in the With/From field:
    • EC:
    • InterPro:
    • PANTHER:
    • UniPathway:
    • UniProt-KW:
    • UniProt-SubCell:
    • UniRule:

Columns and Mappings

Column 1: DB Prefix

  • Ignore

Column 2: DB Object ID

  • Populate Gene field in ?GO_annotation (no conversion needed)

Column 3: Relation

  • Populate Annotation_relation field with RO identifier
    • For now, need to map the text string to an RO identifier (no NOT annotations here, yet)
    • Subroutine exists in goa gpad parsing script, but we need to add all of the relations
    • And check on how NOT is handled
 sub populateAnnotrelToRo {
 $annotToRo{'colocalizes_with'} = 'RO:0002325';
 $annotToRo{'contributes_to'}   = 'RO:0002326';
 $annotToRo{'enables'}          = 'RO:0002327';
 $annotToRo{'involved_in'}      = 'RO:0002331';
 $annotToRo{'part_of'}          = 'BFO:0000050';
 $annotToRo{'acts_upstream_of_or_within'}   = 'RO:0002264';
 $annotToRo{'acts_upstream_of'}   = 'RO_0002263';
 $annotToRo{'acts_upstream_of_or_within_negative_effect'}   = 'RO:0004033';
 $annotToRo{'acts_upstream_of_or_within_positive_effect'}   = 'RO:0004032';
 $annotToRo{'acts_upstream_of_negative_effect'}   = 'RO:0004035';
 $annotToRo{'acts_upstream_of_positive_effect'}   = 'RO:0004034';
 } # sub populateAnnotrelToRo

Column 4: GO ID

  • GO_term (no conversion needed)

Column 5: Reference

  • Convert PMID to WBPaper
  • Convert DOI to WBPaper
  • Populate GO_REF as usual
  • PAINT_REF - convert to appropriate WBPaper (but why do we have annotations with PAINT_REF and are they in the GO database somewhere? They need to be updated in the GO database.)
 See lines 217 - 228 in /home/acedb/kimberly/citace_upload/go/gpad2ace/2018_October/go_gpad_parser.pl

Column 6: Evidence Code

  • Convert ECO ID to 3-letter GO code
    • This part will probably need to be written new as Tony maps the ECO to GO code in an annotation properties field, but this GPAD currently doesn't have that.
 ECO:0000250 ISS
 ECO:0000270 IEP
 ECO:0000304 TAS
 ECO:0000307 ND
 ECO:0000314 IDA
 ECO:0000315 IMP
 ECO:0000316 IGI
 ECO:0000318 IBA
 ECO:0000352 IEA (but this looks to have been a mistake somewhere in the pipeline)
 ECO:0000353 IPI
 ECO:0000501 IEA
 ECO:0001225 IMP
 ECO:0001232 IDA
 ECO:0005589 IDA
 ECO:0005593 IDA (needs a manual assertion version)
 ECO:0005611 IDA (but check these as ECO definition is not quite clear/familiar)
 ECO:0006003 IDA 
 ECO:0006013 IDA
 ECO:0006062 IDA
 ECO:0006063 IMP (would need to double-check these)	
 ECO:0006064 IDA		
 ECO:0007007 HEP
 ECO:0007293 IEA (also check this one as it is: experimental evidence used in automatic assertion)

Column 7: With/From

  • What’s here from go_gpad parser:
 my (@withunis) = split/\|/, $with;
   my @withConverted;
   foreach my $withuni (@withunis) {
     if ($withuni =~ m/^UniProtKB:/) {                         # convert UniProtKB: to wbgene
        $withuni =~ s/^UniProtKB://;
        my $withuniStripped = $withuni;
        $withuniStripped =~ s/\-\d$//g;
        my $wbgWith = $withuni;
        if ($dbidToWb{$withuniStripped}) { $wbgWith = $dbidToWb{$withuniStripped}; push @withConverted, $wbgWith; }
          else {
            push @withConverted, "UniProtKB:$withuni";         # add uniprotkb value if does not map to wormbase value (kimberly 2015 02 05)
            $err .= qq($withuni doesn't map to WBGene from column 8, line $count\n); } }
      else {
        push @withConverted, $withuni; }                           # those that are not UniProtKB: just get added back
   } # foreach my $withuni (@withunis)
   my $withConverted = join"|", @withConverted;

Need to add an entry for just UniProt that would have the same print as UniProtKB

 foreach my $with (@withs) {
          if ($with =~ m/With:Not_supplied/) { 1; }            # do nothing
            elsif ($with =~ m/^UniProtKB:(\w+)/) {             print ACE qq(Database\t"UniProt"\t"UniProtAcc"\t"$1"\n); }
            elsif ($with =~ m/^InterPro:(IPR\d+)/) {           print ACE qq(Motif\t"INTERPRO:$1"\n); }
            elsif ($with =~ m/^HGNC:(\d+)/) {                  print ACE qq(Database\t"HGNC"\t"HGNCID"\t"$1"\n); }
            elsif ($with =~ m/^MGI:(MGI:\d+)/) {               print ACE qq(Database\t"MGI"\t"MGIID"\t"$1"\n); }
            elsif ($with =~ m/^HAMAP:(MF_\d+)/) {              print ACE qq(Database\t"HAMAP"\t"HAMAP_annotation_rule"\t"$1"\n); }
            elsif ($with =~ m/^EC:([\.\d]+)/) {                print ACE qq(Database\t"KEGG"\t"KEGG_id"\t"$1"\n); }
            elsif ($with =~ m/^UniPathway:(UPA\d+)/) {         print ACE qq(Database\t"UniPathway"\t"Pathway_id"\t"$1"\n); }
            elsif ($with =~ m/^UniProtKB-KW:(KW-\d+)/) {       print ACE qq(Database\t"UniProt"\t"UniProtKB-KW"\t"$1"\n); }
            elsif ($with =~ m/^UniProtKB-SubCell:(SL-\d+)/) {  print ACE qq(Database\t"UniProt"\t"UniProtKB-SubCell"\t"$1"\n); }
            elsif ($with =~ m/^UniRule:(\w+)/) {               print ACE qq(Database\t"UniProt"\t"UniRule"\t"$1"\n); }
            elsif ($with =~ m/^(GO:\d+)/) {                    print ACE qq(Inferred_from_GO_term\t"$1"\n); }
            elsif ($with =~ m/^WB:(WBVar\d+)/) {               print ACE qq(Variation\t"$1"\n); }
            elsif ($with =~ m/^WB:(WBRNAi\d+)/) {              print ACE qq(RNAi_result\t"$1"\n); }
            elsif ($with =~ m/^WB:(WBGene\d+)/) {              print ACE qq(Interacting_gene\t"$1"\n); }
            elsif ($with =~ m/^(WBGene\d+)/) {                 print ACE qq(Interacting_gene\t"$1"\n); }
            elsif ($with =~ m/^PomBase:([\.\w]+)/) {           print ACE qq(Database\t"PomBase"\t"PomBase_systematic_name"\t"$1"\n); }
            elsif ($with =~ m/^SGD:(S\d+)/) {                  print ACE qq(Database\t"SGD"\t"SGDID"\t"$1"\n); }
            elsif ($with =~ m/^PANTHER:(PTN\d+)/) {            print ACE qq(Database\t"Panther"\t"PanTree_node"\t"$1"\n); }
            elsif ($with =~ m/^TAIR:locus:(\d+)/) {            print ACE qq(Database\t"TAIR"\t"TAIR_locus_id"\t"$1"\n); }
            elsif ($with =~ m/^FB:(FBgn\d+)/) {                print ACE qq(Database\t"FLYBASE"\t"FLYBASEID"\t"$1"\n); }
            elsif ($with =~ m/^RGD:(\d+)/) {                   print ACE qq(Database\t"RGD"\t"RGDID"\t"$1"\n); }
            elsif ($with =~ m/^dictyBase:(DDB_G\d+)/) {        print ACE qq(Database\t"dictyBase"\t"dictyBaseID"\t"$1"\n); }
            else {                                             print ERR qq(WITH $with not acounted in .ace file\n); }
        } # foreach my $with (@withs)

Column 8: Interacting taxon

  • Currently don’t have any entries for this, but could reuse code from go_gpad parser

Column 9: Date

  • YYYYMMDD - need to convert to YYYY-MM-DD

Column 10: Assigned_by

  • WB will need to be converted to 'WormBase'
  • SynGO (Will not take the SynGO annotations from this file, however need to upload an ?Analysis object that has the relevant details - file done 2018-10-23)

Column 11: Annotation Extensions

  • Parse as for go_gpad parsing script
 my (@annExtsComma) = split/\|/, $annotExtConverted;
        foreach my $annExtComma (@annExtsComma) {
          my (@annExts) = split/,/, $annExtComma;
          foreach my $annExt (@annExts) {
            my $relation = ;
            if ($annExt =~ m/^(.*?)\((.*?)\)/) { $relation = $1; $annExt = $2; }
            if ($annExt =~ m/(WBls:\d+)/) {                       print ACE qq(Life_stage_relation\t"$relation"\t"$1"\n); }
              elsif ($annExt =~ m/WB:(WBGene\d+)/) {              print ACE qq(Gene_relation\t"$relation"\t"$1"\n); }
              elsif ($annExt =~ m/(WBGene\d+)/) {                 print ACE qq(Gene_relation\t"$relation"\t"$1"\n); }
              elsif ($annExt =~ m/CHEBI:(\d+)/) {
                if ($chebiToMol{$1}) {                           print ACE qq(Molecule_relation\t"$relation"\t"$chebiToMol{$1}"\n); }
                  else {                                         print ERR qq(CHEBI $1 does not map to WBMol\n); } }
              elsif ($annExt =~ m/(WBbt:\d+)/) {                  print ACE qq(Anatomy_relation\t"$relation"\t"$1"\n); }
              elsif ($annExt =~ m/(GO:\d+)/) {                    print ACE qq(GO_term_relation\t"$relation"\t"$1"\n); }
              elsif ($annExt =~ m/UniProtKB:/) {                  1; }
              elsif ($annExt =~ m/FB:/) {                         1; }
              else {                                             print ERR qq(Annotation Extension $annExt not acounted in .ace file\n); }
          } # foreach my $annExt (@annExts)
    • Note that these are all RO relations
      • Also, SynGO is using UBERON for occurs_in and evidence codes like ‘knockout’ that don’t make sense for the papers they’re curating.

Column 12: Annotation Properties

  • Nothing to convert here.
  • At some point, we may want to think about capturing some of this metadata, but will need thorough review first.
  • Check curator; I’m still not sure we’re handling this correctly when we import existing annotations into GO-CAM models.