Specifications for Converting GO-CAM GPAD to .ace File

GO_Annotation IDs - finding the right number from which to start

  • Will need to handle similarly to the OA annotations in that we'll need to consult .ace GO annotation files sequentially to get the correct starting number for the GO_annotation objects generated from this GPAD file.
    • See code at lines 45-49 of /home/postgres/work/citace_upload/go_curation/get_go_annotation_ace.pm
 my $annotCounter = 0;
 my $annotFile = '/home/acedb/kimberly/citace_upload/go/gp_annotation.ace';
 open (IN, "<$annotFile") or die "Cannot open $annotFile : $!";
 while (my $line = <IN>) { if ($line =~ m/GO_annotation : "(\d+)"/) { $annotCounter = $1; } }
 close (IN) or die "Cannot close $annotFile : $!";
  • In this case, I would move the go_annotation.ace file to the go/ directory as I do for the gp_annotation.ace file above and we would start from the next highest number after consulting the go_annotation.ace file.

SynGO Annotations

  • Proposal: skip all annotations that are Assigned_by SynGO and get these from UniProt-GOA because the evidence codes are already mapped for us for GAF.
    • The reason this matters is that SynGO uses evidence codes other than the three-letter codes and so their evidence codes need to be mapped to the traditional three-letter codes.
    • BUT….also need to double-check that all of the SynGO annotations in Noctua GPAD are also in Protein2GO and confirm with Tony, Alex, Dustin, Paul, etc. how new SynGO annotations will enter the pipeline.

Columns and Mappings

Column 1: DB Prefix

  • Ignore

Column 2: DB Object ID

  • Populate Gene field in ?GO_annotation (no conversion needed)

Column 3: Relation

  • Populate Annotation_relation field with RO identifier
    • For now, need to map the text string to an RO identifier (no NOT annotations here, yet)
    • Subroutine exists in goa gpad parsing script, but we need to add all of the relations
    • And check on how NOT is handled

Column 4: GO ID

  • GO_term (no conversion needed)

Column 5: Reference

  • Convert PMID to WBPaper
  • Populate GO_REF as usual (see other gpad parsing script)
  • PAINT_REF - convert to appropriate PMID (but why do we have these and are they in the GO database somewhere?)

Column 6: Evidence Code

  • Convert ECO ID to 3-letter GO code
    • This part will probably need to be written new as Tony maps the ECO to GO code in an annotation properties field, but this GPAD currently doesn't have that.

Column 7: With/From

  • What’s here from go_gpad parser:

UniProtKB UniProtKB-KW UniProtKB-SubCell: UniPathway UniRule WB:WBGene WB:WBVar InterPro EC PANTHER PomBase SGD UniProt (this needs to be fixed in our GAF!! - see ‘Modification to our GAF file output’ email thread with Kevin Howe beginning on 2018-07-17; should be okay in WS267)

Column 8: Interacting taxon

  • Currently don’t have any entries for this, but could reuse code from go_gpad parser

Column 9: Date

  • YYYYMMDD - (no conversion needed)

Column 10: Assigned_by

  • WB
    • (SynGO - Need to add as an ?Analysis object for go_gpad parsing script)

Column 11: Annotation Extensions

  • Parse as for go_gpad parsing script
    • Note that these are all RO relations
      • Also, SynGO is using UBERON for occurs_in and evidence codes like ‘knockout’ that don’t make sense for the papers they’re curating.

Column 12: Annotation Properties

  • Check curator; I’m still not sure we’re handling this correctly when we import annotations.