Difference between revisions of "Noctua - Upload of WB Manual Annotations"

From WormBaseWiki
Jump to navigationJump to search
 
(60 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-2-0.md
 
https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-2-0.md
  
= OA Annotations =
+
= OA Annotations - this work is done =
  
 
== Mapping from gop_ postgres tables to GPAD 2.0 column ==
 
== Mapping from gop_ postgres tables to GPAD 2.0 column ==
  
Skip all entries that have the value 'False Positive' in gop_falsepositive
+
'''Skip all entries that have the value 'False Positive' in gop_falsepositive'''
  
 
{| cellspacing="2" border="1"
 
{| cellspacing="2" border="1"
Line 18: Line 18:
 
| gop_wbgene || 1 || Preface each value with 'WB:' || WB:WBGene00006925
 
| gop_wbgene || 1 || Preface each value with 'WB:' || WB:WBGene00006925
 
|-
 
|-
| gop_qualifier || 2|| Add text string "NOT" (Note: an OA query for "NOT" didn't return any values, so I don't think we actually have any of these in the OA.) || NOT
+
| gop_qualifier || 2|| This is the NEGATION field in the GPAD2.0 file, but no action needed as we don't have any negation in our GO OA annotations. || n/a
 
|-
 
|-
| gop_qualifier || 3 || Map text value to Relations Ontology (RO) term id (see below); add RO id. || RO:0001025
+
| gop_qualifier || 3 || Map text value to Relations Ontology (RO) term id (see table below); populate with RO id. || RO:0001025
 
|-
 
|-
 
| gop_goid || 4|| Add GO term id as it exists in table. || GO:0051306
 
| gop_goid || 4|| Add GO term id as it exists in table. || GO:0051306
 
|-
 
|-
| gop_accession || 5 || Add id as it exists in table. || GO_REF:0000015
+
| gop_accession || 5 || Remove quotes and add id in table; we only have seven of these. || GO_REF:0000015
 
|-
 
|-
 
| gop_paper || 5 || Add WBPaper ID and corresponding PMID, pipe separated. || PMID:10978280|WB:WBPaper00004310
 
| gop_paper || 5 || Add WBPaper ID and corresponding PMID, pipe separated. || PMID:10978280|WB:WBPaper00004310
Line 50: Line 50:
 
| ?? || 12 || Add postgres annotation id, prefixed with 'id=WBOA:' || id=WBOA:3565
 
| ?? || 12 || Add postgres annotation id, prefixed with 'id=WBOA:' || id=WBOA:3565
 
|-
 
|-
| gop_curator || 12 || If available, map curator to ORCID and prefix with 'contributor-id=https://orcid.org/'.  If no ORICD, add 'GOC:cab1' || contributor-id=https://orcid.org/0000-0002-1478-7671 or GOC:cab1
+
| gop_curator || 12 || If available, map curator to ORCID and prefix with 'contributor-id=https://orcid.org/'.  If no ORICD, add 'contributor-id=GOC:cab1' || contributor-id=https://orcid.org/0000-0002-1478-7671 or contributor-id=GOC:cab1
 
|-
 
|-
 
| gop_comment || 12 || Add free text, prefixed with 'comment=' || comment=2020-03-17; flagged FP prior to Noctua upload; no ISS With/From; more specific PAINT annotation exists.
 
| gop_comment || 12 || Add free text, prefixed with 'comment=' || comment=2020-03-17; flagged FP prior to Noctua upload; no ISS With/From; more specific PAINT annotation exists.
 +
|-
 +
| gop_lastupdate || 12 || Add creation-date=YYYY-MM-DD (or YYYY-MM-DDTHH:MM) || creation-date=2021-06-29 (or creation-date=2021-07-15T16:52)
 +
|-
 +
| gop_lastupdate || 12 || Add modification-date=YYYY-MM-DD (or YYYY-MM-DDTHH:MM) || modification-date=2021-06-29 (or modification-date=2021-07-15T16:52)
 
|-
 
|-
 
|}
 
|}
  
=== Mapping relation names to RO ids. ===
+
=== Mapping gene product-to-term relation names to RO ids. ===
  
 
{| cellspacing="2" border="1"
 
{| cellspacing="2" border="1"
Line 62: Line 66:
 
! qualifier name (gop_qualifier)
 
! qualifier name (gop_qualifier)
 
! RO ID
 
! RO ID
 +
! number of annotations in WS280 ace file (353 annotations total)
  
 
|-
 
|-
| part_of || BFO:0000050
+
| acts_upstream_of_or_within || RO:0002264 || 9
 
|-
 
|-
| enables || RO:0002327
+
| located_in || RO:0001025 || 10
 
|-
 
|-
| acts_upstream_of_or_within || RO:0002264
+
| involved_in || RO:0002331 || 307
 
|-
 
|-
| colocalizes_with || RO:0002325
+
| enables || RO:0002327 || 18
 
|-
 
|-
| involved_in || RO:0002331
+
| part_of || BFO:0000050 || 8
 
|-
 
|-
| located_in || RO:0001025
+
|}
 +
 
 +
*Note: no instances of these gp2term relations in the OA:
 +
** colocalizes_with (RO:0002325)
 +
** contributes_to (RO:0002326)
 +
*Note: found one annotation coming from the OA that lacked a gp2term relation; updated that for WS281
 +
 
 +
=== Mapping annotation extension relations to RO ids ===
 +
{| cellspacing="2" border="1"
 
|-
 
|-
| contributes_to || RO:0002326
+
! relation label
|-
+
! RO ID
| has_input || RO:0002233
+
! number of annotations in WS280 ace file (353 annotations total)
|-
 
| happens_during || RO:0002092
 
|-
 
| has_direct_input || GOREL:0000752
 
 
|-
 
|-
| in_absence_of || GOREL:0000755
+
| has_input || RO:0002233 || 10
 
|-
 
|-
| in_presence_of || GOREL:0000027
+
| happens_during || RO:0002092 || 4
 
|-
 
|-
| localization_dependent_on ||GOREL:0000009
+
| occurs_in || BFO:0000066 || 1
 
|-
 
|-
| RO:0002211_activity_of || GOREL:0098702
+
| part_of || BFO:0000050 || 8
|-
 
| dependent_on || GOREL:0000004
 
 
|-
 
|-
 
|}
 
|}
Line 132: Line 139:
 
= Protein2GO Annotations =
 
= Protein2GO Annotations =
 
* Input files:
 
* Input files:
** GPAD from Protein2GO
+
** GPAD from Protein2GO - from GOA ftp - WB_gpad2noctua.gpa (last one created was 2021-01-11 - will need to ask Alex for a new file)
 
** Latest WB gpi file: ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/PRJNA13758/annotation/gene_product_info/c_elegans.canonical_bioproject.current_development.gene_product_info.gpi.gz
 
** Latest WB gpi file: ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/PRJNA13758/annotation/gene_product_info/c_elegans.canonical_bioproject.current_development.gene_product_info.gpi.gz
 +
 
* Other mappings needed:
 
* Other mappings needed:
 
** PMID to WBPaper
 
** PMID to WBPaper
Line 149: Line 157:
  
 
|-
 
|-
| 1 || Annotated entity || Convert each UniProtKB: accession to a WB:WBGene id using the latest WB gpi file || UniProtKB:G5ED58 || WB:WBGene00006925 || Yes
+
| 1 || Annotated entity* || UniProtKB: accession convert to a WBGene id using the latest WB gpi file || UniProtKB:G5ED58 || WB:WBGene00006925 || Yes
 +
|-
 +
| 1 || Annotated entity* || UniProtKB: strip digit after '-', then convert to a WBGene id using the latest WB gpi file || UniProtKB:P34708-1 || WB: WBGene00006604 || Yes
 +
|-
 +
| 1 || Annotated entity* || ComplexPortal: Ignore || ComplexPortal:CPX-1000 || n/a || n/a
 +
|-
 +
| 1 || Annotated entity* || RNAcentral: Ignore || RNAcentral:URS00000082FF_6239 || n/a || n/a
 
|-
 
|-
 
| 2|| Negation || Leave as is || NOT || NOT || No
 
| 2|| Negation || Leave as is || NOT || NOT || No
 
|-
 
|-
| 3 || Qualifier || Leave as is, except for BFO:0000050* || RO:0001025 || RO:0001025 || No
+
| 3 || Qualifier || Leave as is || RO:0002327 || RO:0002327 || No
 
|-
 
|-
| 4 || GO term ID || Leave as it || GO:0051306
+
| 4 || GO term ID || Leave as is || GO:0051306 || GO:0051306 || No
 
|-
 
|-
| 5 || Reference || Leave PMID or DOI as is; add corresponding WB:WBPaper id; identifiers are pipe-separated || PMID:10978280|WB:WBPaper00004310
+
| 5 || Reference* || Leave GO_REFs as is; map PMID or DOI to corresponding WBPaper id and add WBPaper id as a pipe-separated value || PMID:10978280 || PMID:10978280|WB:WBPaper00004310 || Yes
 
|-
 
|-
| 6 || Evidence || Leave as is || ECO:0000314
+
| 6 || Evidence || Leave as is || ECO:0000314 || ECO:00000314 || No
 
|-
 
|-
| 7 || With/From || Where possible, convert each UniProtKB: accession to a WB:WBGene id using the latest WB gpi file; otherwise leave entry as is || WB:WBGene00000001
+
| 7 || With/From* || Leave as is, except for UniProtKB: accessions; for UniProtKB: accessions, try to map to a WBGene id using the latest WB gpi file; if UniProtKB: accession doesn't map to a WBGene id, then leave as is || UniProtKB:D9PTP8 || WB:WBGene00013354 || Yes - output a list of UniProtKB accessions that didn't map to a WBGene
 
|-  
 
|-  
| 8 || Interacting taxon || Leave as is || NCBITaxon:273526
+
| 8 || Interacting taxon || Leave as is || NCBITaxon:273526 || NCBITaxon:273526 || No
 
|-
 
|-
| 9 || Annotation date || Leave as is || 2006-02-03T12:26
+
| 9 || Annotation date || Leave as is || 2006-02-03T12:26 || 2006-02-03T12:26 || No
 
|-
 
|-
| 10 || Assigned_by || Leave as it || WB
+
| 10 || Assigned_by || Leave as it || WB || WB || No
 
|-
 
|-
| 11 || Annotation extensions || If relation is a text string, convert to an id according RO mapping table above.  Otherwise, leave as is. Report on any relation text strings that didn't map to an id. || RO:0002233(WB:WBGene00000584)
+
| 11 || Annotation extensions || If relation is a text string, convert to an id according RO mapping table above.  Otherwise, leave as is.   || RO:0002233(WB:WBGene00000584) || RO:0002233(WB:WBGene00000584) || Yes - report on any relation text strings that don't map to an ontology id.
 
|-
 
|-
| 12 || Annotation properties* || Most will stay as is, except for history (see below). || id=GOA:2113483848|contributor-id=https://orcid.org/0000-0002-1706-4196|comment=action:Updated by Kimberly Van Auken|model-state=deleted
+
| 12 || Annotation properties* || Most will stay as is, except for history (see below). || id=GOA:2113483848|contributor-id=https://orcid.org/0000-0002-1706-4196|comment=action:Updated by Kimberly Van Auken|model-state=???
 
|-
 
|-
 
|}
 
|}
  
=== Qualifiers ===
 
*For annotation lines that use BFO:0000050, check GO ID parentage.  If GO:0032991 is a parent term, leave BFO:0000050.  If GO:00032991 is not a parent term, change to RO:0001025.
 
  
=== *Annotation Properties ===
+
=== Annotated Entity ===
 +
* Use latest WormBase gpi file to map UniProtKB accessions to WBGene ids (take column 1 value in incoming GPAD file and find corresponding column 9 value in WB gpi file) then map to corresponding WBGene ID (column 2 in WB gpi file)
 +
* Latest WormBase gpi file:  ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/PRJNA13758/annotation/gene_product_info/c_elegans.canonical_bioproject.current_development.gene_product_info.gpi.gz
 +
* For input values that include a '-digit', e.g. UniProtKB:P37806-1, strip the '-digit' for the purposes of mapping to a WBGene and add a comment in the Annotation properties field: comment=Original annotation made to whatever the full UniProtKB accession is.  For example: comment=Original annotation made to UniProtKB:P37806-1.
 +
* We will be ignoring any GPAD lines that have ComplexPortal or RNAcentral ids in Column 1.
 +
* We have made some manual annotations to organisms other than C. elegans, so there will be some GPAD lines for which there is no mapping from the UniProtKB accession to a WBGene id but for which we want to migrate the annotation to Noctua.  To find these, we'll need to check if the unmappable UniProtKB accession is also in the 6239 GPAD file.  If yes, discard; if no, keep.  Report on which ones we kept and which ones we discarded.
 +
 
 +
=== References ===
 +
* Map incoming doi or PMID to WBPaper id using the pap_identifier table.
 +
 
 +
=== With/From ===
 +
* Use latest WormBase gpi file to map UniProtKB accessions (column 1 in incoming GPAD file; column 9 in WB gpi file) to corresponding WBGene ID (column 2 in WB gpi file)
 +
* Latest WormBase gpi file:  ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/PRJNA13758/annotation/gene_product_info/c_elegans.canonical_bioproject.current_development.gene_product_info.gpi.gz
 +
 
 +
=== Annotation Properties ===
 
* Contributor
 
* Contributor
 
** Most contributors are captured with an orcid.
 
** Most contributors are captured with an orcid.
 
** However, Carol and Josh do not have orcids, so we need to populate a GOC abberviation for their contributor id.
 
** However, Carol and Josh do not have orcids, so we need to populate a GOC abberviation for their contributor id.
*** If contributor-id is blank, and comment=action:Added by Josh Jaffery [Expired account], then populate contributor-id=GOC:jja
+
*** If contributor-id is blank and comment=action:Added by Josh Jaffery [Expired account], then populate contributor-id=GOC:jja
*** If contributor-id is blank, and comment=action:Added by Carol Bastiani [Expired account], then populate contributor-id=GOC:cab1
+
*** If contributor-id is blank and comment=action:Added by Carol Bastiani [Expired account], then populate contributor-id=GOC:cab1
 
* History
 
* History
 
** Group annotations by id, e.g. id=GOA:2113472118
 
** Group annotations by id, e.g. id=GOA:2113472118
 
** If more than one line with same id, check corresponding date field (column 9) for each line
 
** If more than one line with same id, check corresponding date field (column 9) for each line
** For most recent date of grouped annotations, leave line as is
+
** For most recent date of grouped annotations, add creation-date=YYYY-MM-DD from earliest annotation, modification-date=YYYY-MM-DD from each subsequent date
** For earlier dates of grouped annotations, add 'model-state=deleted' to annotation properties field (column 12) by pipe-separating it from the last annotation properties entry
 
  
 
=== Final File for Import ===
 
=== Final File for Import ===
* The final file for import will be a concatenated file of OA and Protein2GO GPAD files
+
* The final files for import will be a file of OA annotations and a file of Protein2GO GPAD files
* We'll need to make the file available for Dustin to pick up somewhere for the import.
+
* We'll need to make the files available for Dustin to pick up somewhere for the import or for me to gzip and email him.

Latest revision as of 17:43, 9 November 2023

GOC GPAD/GPI 2.0 Specifications

https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-2-0.md

OA Annotations - this work is done

Mapping from gop_ postgres tables to GPAD 2.0 column

Skip all entries that have the value 'False Positive' in gop_falsepositive

gop_ postgres table name GPAD 2.0 column Action Example
gop_wbgene 1 Preface each value with 'WB:' WB:WBGene00006925
gop_qualifier 2 This is the NEGATION field in the GPAD2.0 file, but no action needed as we don't have any negation in our GO OA annotations. n/a
gop_qualifier 3 Map text value to Relations Ontology (RO) term id (see table below); populate with RO id. RO:0001025
gop_goid 4 Add GO term id as it exists in table. GO:0051306
gop_accession 5 Remove quotes and add id in table; we only have seven of these. GO_REF:0000015
gop_paper 5 Add WBPaper ID and corresponding PMID, pipe separated. PMID:10978280|WB:WBPaper00004310
gop_goinference 6 Map three-letter GO code to ECO code (see below); add ECO id. ECO:0000314
gop_with_wbgene 7 Preface each value with 'WB:'; comma-separate multiple values WB:WBGene00000001
gop_with 7 Add id as it exists in table; comma-separate multiple values FB:FBgn0003719
gop_with_phenotype 7 Add id as it exists in table; comma-separate multiple values WBPhenotype:0000689
gop_with_rnai 7 Preface each value with 'WB:'; comma-separate multiple values WB:WBRNAi00001974
gop_with_wbvariation 7 Preface each value with 'WB:'; comma-separate multiple values WB:WBVar00242156
- 8 No action; I don't think we have any values for an interacting taxon.
gop_lastupdate 9 If YYYY-MM-DD, add as exists in table. If YYYY-MM-DD HH:MM:SS convert to: YYYY-MM-DDTHH:MM 2020-05-13 or 2006-02-03T12:26
no OA table 10 Add WB WB
gop_xrefto 11 Convert relation name to RO id, add value, directly and parenthetically, after RO id. RO:0002233(WB:WBGene00000584)
?? 12 Add postgres annotation id, prefixed with 'id=WBOA:' id=WBOA:3565
gop_curator 12 If available, map curator to ORCID and prefix with 'contributor-id=https://orcid.org/'. If no ORICD, add 'contributor-id=GOC:cab1' contributor-id=https://orcid.org/0000-0002-1478-7671 or contributor-id=GOC:cab1
gop_comment 12 Add free text, prefixed with 'comment=' comment=2020-03-17; flagged FP prior to Noctua upload; no ISS With/From; more specific PAINT annotation exists.
gop_lastupdate 12 Add creation-date=YYYY-MM-DD (or YYYY-MM-DDTHH:MM) creation-date=2021-06-29 (or creation-date=2021-07-15T16:52)
gop_lastupdate 12 Add modification-date=YYYY-MM-DD (or YYYY-MM-DDTHH:MM) modification-date=2021-06-29 (or modification-date=2021-07-15T16:52)

Mapping gene product-to-term relation names to RO ids.

qualifier name (gop_qualifier) RO ID number of annotations in WS280 ace file (353 annotations total)
acts_upstream_of_or_within RO:0002264 9
located_in RO:0001025 10
involved_in RO:0002331 307
enables RO:0002327 18
part_of BFO:0000050 8
  • Note: no instances of these gp2term relations in the OA:
    • colocalizes_with (RO:0002325)
    • contributes_to (RO:0002326)
  • Note: found one annotation coming from the OA that lacked a gp2term relation; updated that for WS281

Mapping annotation extension relations to RO ids

relation label RO ID number of annotations in WS280 ace file (353 annotations total)
has_input RO:0002233 10
happens_during RO:0002092 4
occurs_in BFO:0000066 1
part_of BFO:0000050 8

Mapping three-letter GO codes to ECO ids.

three-letter GO code ECO ID
ISS ECO:0000250
IEP ECO:0000270
NAS ECO:0000303
TAS ECO:0000304
IC ECO:0000305
ND ECO:0000307
IDA ECO:0000314
IMP ECO:0000315
IGI ECO:0000316
IPI ECO:0000353

Questions

  • What about blank OA entries, e.g. pgid 14222?
    • Blank entries were ignored.

Protein2GO Annotations

  • Other mappings needed:
    • PMID to WBPaper
    • Relation text to gorel id
    • Curators without orcids to GOC abbreviations
GPAD 2.0 column number GPAD 2.0 column name Action UniProt Source File Example WormBase Output File Example Report on parsing failures
1 Annotated entity* UniProtKB: accession convert to a WBGene id using the latest WB gpi file UniProtKB:G5ED58 WB:WBGene00006925 Yes
1 Annotated entity* UniProtKB: strip digit after '-', then convert to a WBGene id using the latest WB gpi file UniProtKB:P34708-1 WB: WBGene00006604 Yes
1 Annotated entity* ComplexPortal: Ignore ComplexPortal:CPX-1000 n/a n/a
1 Annotated entity* RNAcentral: Ignore RNAcentral:URS00000082FF_6239 n/a n/a
2 Negation Leave as is NOT NOT No
3 Qualifier Leave as is RO:0002327 RO:0002327 No
4 GO term ID Leave as is GO:0051306 GO:0051306 No
5 Reference* Leave GO_REFs as is; map PMID or DOI to corresponding WBPaper id and add WBPaper id as a pipe-separated value PMID:10978280 PMID:10978280|WB:WBPaper00004310 Yes
6 Evidence Leave as is ECO:0000314 ECO:00000314 No
7 With/From* Leave as is, except for UniProtKB: accessions; for UniProtKB: accessions, try to map to a WBGene id using the latest WB gpi file; if UniProtKB: accession doesn't map to a WBGene id, then leave as is UniProtKB:D9PTP8 WB:WBGene00013354 Yes - output a list of UniProtKB accessions that didn't map to a WBGene
8 Interacting taxon Leave as is NCBITaxon:273526 NCBITaxon:273526 No
9 Annotation date Leave as is 2006-02-03T12:26 2006-02-03T12:26 No
10 Assigned_by Leave as it WB WB No
11 Annotation extensions If relation is a text string, convert to an id according RO mapping table above. Otherwise, leave as is. RO:0002233(WB:WBGene00000584) RO:0002233(WB:WBGene00000584) Yes - report on any relation text strings that don't map to an ontology id.
12 Annotation properties* Most will stay as is, except for history (see below). contributor-id=https://orcid.org/0000-0002-1706-4196%7Ccomment=action:Updated by Kimberly Van Auken|model-state=???


Annotated Entity

  • Use latest WormBase gpi file to map UniProtKB accessions to WBGene ids (take column 1 value in incoming GPAD file and find corresponding column 9 value in WB gpi file) then map to corresponding WBGene ID (column 2 in WB gpi file)
  • Latest WormBase gpi file: ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/PRJNA13758/annotation/gene_product_info/c_elegans.canonical_bioproject.current_development.gene_product_info.gpi.gz
  • For input values that include a '-digit', e.g. UniProtKB:P37806-1, strip the '-digit' for the purposes of mapping to a WBGene and add a comment in the Annotation properties field: comment=Original annotation made to whatever the full UniProtKB accession is. For example: comment=Original annotation made to UniProtKB:P37806-1.
  • We will be ignoring any GPAD lines that have ComplexPortal or RNAcentral ids in Column 1.
  • We have made some manual annotations to organisms other than C. elegans, so there will be some GPAD lines for which there is no mapping from the UniProtKB accession to a WBGene id but for which we want to migrate the annotation to Noctua. To find these, we'll need to check if the unmappable UniProtKB accession is also in the 6239 GPAD file. If yes, discard; if no, keep. Report on which ones we kept and which ones we discarded.

References

  • Map incoming doi or PMID to WBPaper id using the pap_identifier table.

With/From

Annotation Properties

  • Contributor
    • Most contributors are captured with an orcid.
    • However, Carol and Josh do not have orcids, so we need to populate a GOC abberviation for their contributor id.
      • If contributor-id is blank and comment=action:Added by Josh Jaffery [Expired account], then populate contributor-id=GOC:jja
      • If contributor-id is blank and comment=action:Added by Carol Bastiani [Expired account], then populate contributor-id=GOC:cab1
  • History
    • Group annotations by id, e.g. id=GOA:2113472118
    • If more than one line with same id, check corresponding date field (column 9) for each line
    • For most recent date of grouped annotations, add creation-date=YYYY-MM-DD from earliest annotation, modification-date=YYYY-MM-DD from each subsequent date

Final File for Import

  • The final files for import will be a file of OA annotations and a file of Protein2GO GPAD files
  • We'll need to make the files available for Dustin to pick up somewhere for the import or for me to gzip and email him.