GO-CAM GPAD
From WormBaseWiki
Revision as of 17:54, 11 December 2024 by Vanaukenk (talk | contribs) (→Column 1: DB Prefix and DB Object ID)
Contents
- 1 Specifications for Converting GO-CAM GPAD 2.0 to .ace File
- 1.1 WormBase Data Model
- 1.2 Source File Specifications
- 1.3 Source File Location
- 1.4 GO_Annotation IDs - finding the right number from which to start
- 1.5 Annotations to Skip
- 1.6 Columns and Mappings
- 1.6.1 Column 1: DB Prefix and DB Object ID
- 1.6.2 Column 2: Negation
- 1.6.3 Column 3: Relation
- 1.6.4 Column 4: GO ID
- 1.6.5 Column 5: Reference
- 1.6.6 Column 6: Evidence Code
- 1.6.7 Column 7: With/From
- 1.6.8 Column 8: Interacting taxon
- 1.6.9 Column 9: Date
- 1.6.10 Column 10: Assigned_by
- 1.6.11 Column 11: Annotation Extensions
- 1.6.12 Column 12: Annotation Properties
- 2 Specifications for Converting GO-CAM GPAD 1.1 to .ace File
- 2.1 Source File Location
- 2.2 GO_Annotation IDs - finding the right number from which to start
- 2.3 Annotations to Skip
- 2.4 Columns and Mappings
- 2.4.1 Column 1: DB Prefix
- 2.4.2 Column 2: DB Object ID
- 2.4.3 Column 3: Relation
- 2.4.4 Column 4: GO ID
- 2.4.5 Column 5: Reference
- 2.4.6 Column 6: Evidence Code
- 2.4.7 Column 7: With/From
- 2.4.8 Column 8: Interacting taxon
- 2.4.9 Column 9: Date
- 2.4.10 Column 10: Assigned_by
- 2.4.11 Column 11: Annotation Extensions
- 2.4.12 Column 12: Annotation Properties
Specifications for Converting GO-CAM GPAD 2.0 to .ace File
WormBase Data Model
Source File Specifications
Source File Location
GO_Annotation IDs - finding the right number from which to start
- Consult gp_annotation.ace GO annotation file (generated first) to get the correct starting number for the GO_annotation objects generated from the Noctua GPAD file.
- See code at lines 45-49 of /home/postgres/work/citace_upload/go_curation/get_go_annotation_ace.pm
my $annotCounter = 0; my $annotFile = '/home/acedb/kimberly/citace_upload/go/gp_annotation.ace'; open (IN, "<$annotFile") or die "Cannot open $annotFile : $!"; while (my $line = <IN>) { if ($line =~ m/GO_annotation : "(\d+)"/) { $annotCounter = $1; } } close (IN) or die "Cannot close $annotFile : $!";
- In this case, I would move the go_annotation.ace file to the go/ directory as I do for the gp_annotation.ace file above and we would start from the next highest number after consulting the go_annotation.ace file.
- New file: gocam_annotation.ace
Annotations to Skip
- Skip all annotations that are Assigned_by SynGO and get these from UniProt-GOA because the evidence codes are already mapped for us for GAF.
- The reason this matters is that SynGO uses evidence codes other than the three-letter codes and so their evidence codes need to be mapped to the traditional three-letter codes.
- Skip all annotations that have values with these prefixes in the With/From field:
- EC:
- InterPro:
- PANTHER:
- UniPathway:
- UniProt-KW:
- UniProt-SubCell:
- UniRule:
Columns and Mappings
Column 1: DB Prefix and DB Object ID
- Populate 'Gene' field in ?GO_annotation
- Strip WB prefix and just populate WBGene ID
- As of 2024-12-11, there are no other types of IDs in Column 1
Column 2: Negation
- Populate 'Annotation_relation_not' field in .ace model with value in Column 3
- As of 2024-12-11, there are no NOT annotations in the WB Noctua GPAD
Column 3: Relation
- Populate 'Annotation_relation' field with RO identifier
Column 4: GO ID
- Populate value directly from GPAD file
Column 5: Reference
- Convert PMID to WBPaper
- Convert DOI to WBPaper
- Populate GO_REF as usual
Column 6: Evidence Code
- Populate value directly from GPAD file (i.e. mapping from three-letter code to ECO ID no longer needed)
Column 7: With/From
- What’s here from go_gpad parser:
my (@withunis) = split/\|/, $with; my @withConverted; foreach my $withuni (@withunis) { if ($withuni =~ m/^UniProtKB:/) { # convert UniProtKB: to wbgene $withuni =~ s/^UniProtKB://; my $withuniStripped = $withuni; $withuniStripped =~ s/\-\d$//g; my $wbgWith = $withuni; if ($dbidToWb{$withuniStripped}) { $wbgWith = $dbidToWb{$withuniStripped}; push @withConverted, $wbgWith; } else { push @withConverted, "UniProtKB:$withuni"; # add uniprotkb value if does not map to wormbase value (kimberly 2015 02 05) $err .= qq($withuni doesn't map to WBGene from column 8, line $count\n); } } else { push @withConverted, $withuni; } # those that are not UniProtKB: just get added back } # foreach my $withuni (@withunis) my $withConverted = join"|", @withConverted;
Need to add an entry for just UniProt that would have the same print as UniProtKB
foreach my $with (@withs) { if ($with =~ m/With:Not_supplied/) { 1; } # do nothing elsif ($with =~ m/^UniProtKB:(\w+)/) { print ACE qq(Database\t"UniProt"\t"UniProtAcc"\t"$1"\n); } elsif ($with =~ m/^InterPro:(IPR\d+)/) { print ACE qq(Motif\t"INTERPRO:$1"\n); } elsif ($with =~ m/^HGNC:(\d+)/) { print ACE qq(Database\t"HGNC"\t"HGNCID"\t"$1"\n); } elsif ($with =~ m/^MGI:(MGI:\d+)/) { print ACE qq(Database\t"MGI"\t"MGIID"\t"$1"\n); } elsif ($with =~ m/^HAMAP:(MF_\d+)/) { print ACE qq(Database\t"HAMAP"\t"HAMAP_annotation_rule"\t"$1"\n); } elsif ($with =~ m/^EC:([\.\d]+)/) { print ACE qq(Database\t"KEGG"\t"KEGG_id"\t"$1"\n); } elsif ($with =~ m/^UniPathway:(UPA\d+)/) { print ACE qq(Database\t"UniPathway"\t"Pathway_id"\t"$1"\n); } elsif ($with =~ m/^UniProtKB-KW:(KW-\d+)/) { print ACE qq(Database\t"UniProt"\t"UniProtKB-KW"\t"$1"\n); } elsif ($with =~ m/^UniProtKB-SubCell:(SL-\d+)/) { print ACE qq(Database\t"UniProt"\t"UniProtKB-SubCell"\t"$1"\n); } elsif ($with =~ m/^UniRule:(\w+)/) { print ACE qq(Database\t"UniProt"\t"UniRule"\t"$1"\n); } elsif ($with =~ m/^(GO:\d+)/) { print ACE qq(Inferred_from_GO_term\t"$1"\n); } elsif ($with =~ m/^WB:(WBVar\d+)/) { print ACE qq(Variation\t"$1"\n); } elsif ($with =~ m/^WB:(WBRNAi\d+)/) { print ACE qq(RNAi_result\t"$1"\n); } elsif ($with =~ m/^WB:(WBGene\d+)/) { print ACE qq(Interacting_gene\t"$1"\n); } elsif ($with =~ m/^(WBGene\d+)/) { print ACE qq(Interacting_gene\t"$1"\n); } elsif ($with =~ m/^PomBase:([\.\w]+)/) { print ACE qq(Database\t"PomBase"\t"PomBase_systematic_name"\t"$1"\n); } elsif ($with =~ m/^SGD:(S\d+)/) { print ACE qq(Database\t"SGD"\t"SGDID"\t"$1"\n); } elsif ($with =~ m/^PANTHER:(PTN\d+)/) { print ACE qq(Database\t"Panther"\t"PanTree_node"\t"$1"\n); } elsif ($with =~ m/^TAIR:locus:(\d+)/) { print ACE qq(Database\t"TAIR"\t"TAIR_locus_id"\t"$1"\n); } elsif ($with =~ m/^FB:(FBgn\d+)/) { print ACE qq(Database\t"FLYBASE"\t"FLYBASEID"\t"$1"\n); } elsif ($with =~ m/^RGD:(\d+)/) { print ACE qq(Database\t"RGD"\t"RGDID"\t"$1"\n); } elsif ($with =~ m/^dictyBase:(DDB_G\d+)/) { print ACE qq(Database\t"dictyBase"\t"dictyBaseID"\t"$1"\n); } else { print ERR qq(WITH $with not acounted in .ace file\n); } } # foreach my $with (@withs)
Column 8: Interacting taxon
- Currently don’t have any entries for this, but could reuse code from go_gpad parser
Column 9: Date
- YYYYMMDD - need to convert to YYYY-MM-DD
Column 10: Assigned_by
- WB will need to be converted to 'WormBase'
- SynGO (Will not take the SynGO annotations from this file, however need to upload an ?Analysis object that has the relevant details - file done 2018-10-23)
Column 11: Annotation Extensions
- Parse as for go_gpad parsing script
my (@annExtsComma) = split/\|/, $annotExtConverted; foreach my $annExtComma (@annExtsComma) { my (@annExts) = split/,/, $annExtComma; foreach my $annExt (@annExts) { my $relation = ; if ($annExt =~ m/^(.*?)\((.*?)\)/) { $relation = $1; $annExt = $2; } if ($annExt =~ m/(WBls:\d+)/) { print ACE qq(Life_stage_relation\t"$relation"\t"$1"\n); } elsif ($annExt =~ m/WB:(WBGene\d+)/) { print ACE qq(Gene_relation\t"$relation"\t"$1"\n); } elsif ($annExt =~ m/(WBGene\d+)/) { print ACE qq(Gene_relation\t"$relation"\t"$1"\n); } elsif ($annExt =~ m/CHEBI:(\d+)/) { if ($chebiToMol{$1}) { print ACE qq(Molecule_relation\t"$relation"\t"$chebiToMol{$1}"\n); } else { print ERR qq(CHEBI $1 does not map to WBMol\n); } } elsif ($annExt =~ m/(WBbt:\d+)/) { print ACE qq(Anatomy_relation\t"$relation"\t"$1"\n); } elsif ($annExt =~ m/(GO:\d+)/) { print ACE qq(GO_term_relation\t"$relation"\t"$1"\n); } elsif ($annExt =~ m/UniProtKB:/) { 1; } elsif ($annExt =~ m/FB:/) { 1; } else { print ERR qq(Annotation Extension $annExt not acounted in .ace file\n); } } # foreach my $annExt (@annExts)
- Note that these are all RO relations
- Also, SynGO is using UBERON for occurs_in and evidence codes like ‘knockout’ that don’t make sense for the papers they’re curating.
- Note that these are all RO relations
Column 12: Annotation Properties
- Nothing to convert here.
- At some point, we may want to think about capturing some of this metadata, but will need thorough review first.
- Check curator; I’m still not sure we’re handling this correctly when we import existing annotations into GO-CAM models.
Specifications for Converting GO-CAM GPAD 1.1 to .ace File
Source File Location
GO_Annotation IDs - finding the right number from which to start
- Will need to handle similarly to the OA annotations in that we'll need to consult .ace GO annotation files sequentially to get the correct starting number for the GO_annotation objects generated from this GPAD file.
- See code at lines 45-49 of /home/postgres/work/citace_upload/go_curation/get_go_annotation_ace.pm
my $annotCounter = 0; my $annotFile = '/home/acedb/kimberly/citace_upload/go/gp_annotation.ace'; open (IN, "<$annotFile") or die "Cannot open $annotFile : $!"; while (my $line = <IN>) { if ($line =~ m/GO_annotation : "(\d+)"/) { $annotCounter = $1; } } close (IN) or die "Cannot close $annotFile : $!";
- In this case, I would move the go_annotation.ace file to the go/ directory as I do for the gp_annotation.ace file above and we would start from the next highest number after consulting the go_annotation.ace file.
- New file: gocam_annotation.ace
Annotations to Skip
- Skip all annotations that are Assigned_by SynGO and get these from UniProt-GOA because the evidence codes are already mapped for us for GAF.
- The reason this matters is that SynGO uses evidence codes other than the three-letter codes and so their evidence codes need to be mapped to the traditional three-letter codes.
- BUT….also need to double-check that all of the SynGO annotations in Noctua GPAD are also in Protein2GO and confirm with Tony, Alex, Dustin, Paul, etc. how new SynGO annotations will enter the pipeline.
- In Protein2GO, SynGO annotations have Source 'SYNX'.
- Application of evidence codes looks to be different from GO guidelines.
- Skip all annotations that have these values in the With/From field:
- EC:
- InterPro:
- PANTHER:
- UniPathway:
- UniProt-KW:
- UniProt-SubCell:
- UniRule:
Columns and Mappings
Column 1: DB Prefix
- Ignore
Column 2: DB Object ID
- Populate Gene field in ?GO_annotation (no conversion needed)
Column 3: Relation
- Populate Annotation_relation field with RO identifier
- For now, need to map the text string to an RO identifier (no NOT annotations here, yet)
- Subroutine exists in goa gpad parsing script, but we need to add all of the relations
- And check on how NOT is handled
sub populateAnnotrelToRo { $annotToRo{'colocalizes_with'} = 'RO:0002325'; $annotToRo{'contributes_to'} = 'RO:0002326'; $annotToRo{'enables'} = 'RO:0002327'; $annotToRo{'involved_in'} = 'RO:0002331'; $annotToRo{'part_of'} = 'BFO:0000050'; $annotToRo{'acts_upstream_of_or_within'} = 'RO:0002264'; $annotToRo{'acts_upstream_of'} = 'RO_0002263'; $annotToRo{'acts_upstream_of_or_within_negative_effect'} = 'RO:0004033'; $annotToRo{'acts_upstream_of_or_within_positive_effect'} = 'RO:0004032'; $annotToRo{'acts_upstream_of_negative_effect'} = 'RO:0004035'; $annotToRo{'acts_upstream_of_positive_effect'} = 'RO:0004034'; } # sub populateAnnotrelToRo
Column 4: GO ID
- GO_term (no conversion needed)
Column 5: Reference
- Convert PMID to WBPaper
- Convert DOI to WBPaper
- Populate GO_REF as usual
- PAINT_REF - convert to appropriate WBPaper (but why do we have annotations with PAINT_REF and are they in the GO database somewhere? They need to be updated in the GO database.)
See lines 217 - 228 in /home/acedb/kimberly/citace_upload/go/gpad2ace/2018_October/go_gpad_parser.pl
Column 6: Evidence Code
- Convert ECO ID to 3-letter GO code
- This part will probably need to be written new as Tony maps the ECO to GO code in an annotation properties field, but this GPAD currently doesn't have that.
ECO:0000250 ISS ECO:0000270 IEP ECO:0000304 TAS ECO:0000307 ND ECO:0000314 IDA ECO:0000315 IMP ECO:0000316 IGI ECO:0000318 IBA ECO:0000352 IEA (but this looks to have been a mistake somewhere in the pipeline) ECO:0000353 IPI ECO:0000501 IEA ECO:0001225 IMP ECO:0001232 IDA ECO:0005589 IDA ECO:0005593 IDA (needs a manual assertion version) ECO:0005611 IDA (but check these as ECO definition is not quite clear/familiar) ECO:0006003 IDA ECO:0006013 IDA ECO:0006062 IDA ECO:0006063 IMP (would need to double-check these) ECO:0006064 IDA ECO:0007007 HEP ECO:0007293 IEA (also check this one as it is: experimental evidence used in automatic assertion)
Column 7: With/From
- What’s here from go_gpad parser:
my (@withunis) = split/\|/, $with; my @withConverted; foreach my $withuni (@withunis) { if ($withuni =~ m/^UniProtKB:/) { # convert UniProtKB: to wbgene $withuni =~ s/^UniProtKB://; my $withuniStripped = $withuni; $withuniStripped =~ s/\-\d$//g; my $wbgWith = $withuni; if ($dbidToWb{$withuniStripped}) { $wbgWith = $dbidToWb{$withuniStripped}; push @withConverted, $wbgWith; } else { push @withConverted, "UniProtKB:$withuni"; # add uniprotkb value if does not map to wormbase value (kimberly 2015 02 05) $err .= qq($withuni doesn't map to WBGene from column 8, line $count\n); } } else { push @withConverted, $withuni; } # those that are not UniProtKB: just get added back } # foreach my $withuni (@withunis) my $withConverted = join"|", @withConverted;
Need to add an entry for just UniProt that would have the same print as UniProtKB
foreach my $with (@withs) { if ($with =~ m/With:Not_supplied/) { 1; } # do nothing elsif ($with =~ m/^UniProtKB:(\w+)/) { print ACE qq(Database\t"UniProt"\t"UniProtAcc"\t"$1"\n); } elsif ($with =~ m/^InterPro:(IPR\d+)/) { print ACE qq(Motif\t"INTERPRO:$1"\n); } elsif ($with =~ m/^HGNC:(\d+)/) { print ACE qq(Database\t"HGNC"\t"HGNCID"\t"$1"\n); } elsif ($with =~ m/^MGI:(MGI:\d+)/) { print ACE qq(Database\t"MGI"\t"MGIID"\t"$1"\n); } elsif ($with =~ m/^HAMAP:(MF_\d+)/) { print ACE qq(Database\t"HAMAP"\t"HAMAP_annotation_rule"\t"$1"\n); } elsif ($with =~ m/^EC:([\.\d]+)/) { print ACE qq(Database\t"KEGG"\t"KEGG_id"\t"$1"\n); } elsif ($with =~ m/^UniPathway:(UPA\d+)/) { print ACE qq(Database\t"UniPathway"\t"Pathway_id"\t"$1"\n); } elsif ($with =~ m/^UniProtKB-KW:(KW-\d+)/) { print ACE qq(Database\t"UniProt"\t"UniProtKB-KW"\t"$1"\n); } elsif ($with =~ m/^UniProtKB-SubCell:(SL-\d+)/) { print ACE qq(Database\t"UniProt"\t"UniProtKB-SubCell"\t"$1"\n); } elsif ($with =~ m/^UniRule:(\w+)/) { print ACE qq(Database\t"UniProt"\t"UniRule"\t"$1"\n); } elsif ($with =~ m/^(GO:\d+)/) { print ACE qq(Inferred_from_GO_term\t"$1"\n); } elsif ($with =~ m/^WB:(WBVar\d+)/) { print ACE qq(Variation\t"$1"\n); } elsif ($with =~ m/^WB:(WBRNAi\d+)/) { print ACE qq(RNAi_result\t"$1"\n); } elsif ($with =~ m/^WB:(WBGene\d+)/) { print ACE qq(Interacting_gene\t"$1"\n); } elsif ($with =~ m/^(WBGene\d+)/) { print ACE qq(Interacting_gene\t"$1"\n); } elsif ($with =~ m/^PomBase:([\.\w]+)/) { print ACE qq(Database\t"PomBase"\t"PomBase_systematic_name"\t"$1"\n); } elsif ($with =~ m/^SGD:(S\d+)/) { print ACE qq(Database\t"SGD"\t"SGDID"\t"$1"\n); } elsif ($with =~ m/^PANTHER:(PTN\d+)/) { print ACE qq(Database\t"Panther"\t"PanTree_node"\t"$1"\n); } elsif ($with =~ m/^TAIR:locus:(\d+)/) { print ACE qq(Database\t"TAIR"\t"TAIR_locus_id"\t"$1"\n); } elsif ($with =~ m/^FB:(FBgn\d+)/) { print ACE qq(Database\t"FLYBASE"\t"FLYBASEID"\t"$1"\n); } elsif ($with =~ m/^RGD:(\d+)/) { print ACE qq(Database\t"RGD"\t"RGDID"\t"$1"\n); } elsif ($with =~ m/^dictyBase:(DDB_G\d+)/) { print ACE qq(Database\t"dictyBase"\t"dictyBaseID"\t"$1"\n); } else { print ERR qq(WITH $with not acounted in .ace file\n); } } # foreach my $with (@withs)
Column 8: Interacting taxon
- Currently don’t have any entries for this, but could reuse code from go_gpad parser
Column 9: Date
- YYYYMMDD - need to convert to YYYY-MM-DD
Column 10: Assigned_by
- WB will need to be converted to 'WormBase'
- SynGO (Will not take the SynGO annotations from this file, however need to upload an ?Analysis object that has the relevant details - file done 2018-10-23)
Column 11: Annotation Extensions
- Parse as for go_gpad parsing script
my (@annExtsComma) = split/\|/, $annotExtConverted; foreach my $annExtComma (@annExtsComma) { my (@annExts) = split/,/, $annExtComma; foreach my $annExt (@annExts) { my $relation = ; if ($annExt =~ m/^(.*?)\((.*?)\)/) { $relation = $1; $annExt = $2; } if ($annExt =~ m/(WBls:\d+)/) { print ACE qq(Life_stage_relation\t"$relation"\t"$1"\n); } elsif ($annExt =~ m/WB:(WBGene\d+)/) { print ACE qq(Gene_relation\t"$relation"\t"$1"\n); } elsif ($annExt =~ m/(WBGene\d+)/) { print ACE qq(Gene_relation\t"$relation"\t"$1"\n); } elsif ($annExt =~ m/CHEBI:(\d+)/) { if ($chebiToMol{$1}) { print ACE qq(Molecule_relation\t"$relation"\t"$chebiToMol{$1}"\n); } else { print ERR qq(CHEBI $1 does not map to WBMol\n); } } elsif ($annExt =~ m/(WBbt:\d+)/) { print ACE qq(Anatomy_relation\t"$relation"\t"$1"\n); } elsif ($annExt =~ m/(GO:\d+)/) { print ACE qq(GO_term_relation\t"$relation"\t"$1"\n); } elsif ($annExt =~ m/UniProtKB:/) { 1; } elsif ($annExt =~ m/FB:/) { 1; } else { print ERR qq(Annotation Extension $annExt not acounted in .ace file\n); } } # foreach my $annExt (@annExts)
- Note that these are all RO relations
- Also, SynGO is using UBERON for occurs_in and evidence codes like ‘knockout’ that don’t make sense for the papers they’re curating.
- Note that these are all RO relations
Column 12: Annotation Properties
- Nothing to convert here.
- At some point, we may want to think about capturing some of this metadata, but will need thorough review first.
- Check curator; I’m still not sure we’re handling this correctly when we import existing annotations into GO-CAM models.