GO entity markup

From WormBaseWiki
Jump to: navigation, search

Trial runs with linking GO terms in GSA papers

This project will be carried out in parallel with the normal markup pipeline with the following criteria:

  • markup is done on a separate textpresso machine -dev.textpresso.org (production machine is textpresso-dev.caltech.edu)
  • the papers will not be sent to DJS (there is no upload form set for this pipeline)
  • only the GO lexicon from AmiGo will be used
  • papers marked up will be papers that come through the normal pipeline (no retroactive markup will occur for now)
  • GO marked up papers will be sent to the GO linking crew: Kimberly, Ranjana, Daniela, Chris, Karen, and Paul
  • as with the normal pipeline a link to an entity table of all entities, the generated URL, and a brief description of the webpage will be included in the alert e-mail (see below)
  • all comments for the papers will be made available through this wiki.
  • all links are formed for WB GO pages, not AmiGo pages

e-mail alert from Arun

Once a paper has been received and run through the GO linking script on dev.textpresso.org, an e-mail message will be sent out to everyone on the GO linking crew. This message includes a link to the paper and to the entity table.


Forwarded message
From: <arunr@wormbase.org>
Date: Wed, Jul 13, 2011 at 10:14 AM
Subject: GSA auto-email: GSA 128421 linked file available
To: arunr@wormbase.org

This is an automatic email sent to you by the GSA pipeline.

ATTENTION: This is not the production file. This is only for testing GO term linking.

Responsible curator: Daniela Raciti

Linked file available for manual QC at
http://dev.textpresso.org/gsa/worm/html//128421.html

The entity table for this first pass/automatically linked article is available at
http://dev.textpresso.org/gsa/worm/first_pass_entity_link_tables/128421.html

Thank you!


known problems

  • Some GO terms do not have pages and WB displays a page with title 'Gene Ontology Search' for these URLs. See

http://www.wormbase.org/db/ontology/gene?name=GO%3A0001047;class=GO_term

    • these problem links have been color-coded in 'grey' in the entity table. The URL is live, but the page has no relevant content.

General Thoughts

kimberly

GO links seem to fall into three categories:

  • Those that are correct.
  • Those that are incorrect.
  • Those that aren't necessarily wrong, but either don't quite capture the essence of the entity being discussed in the paper, or are cases where maybe the GO term isn't the best or most informative link to make. This happens, for example, when the linking matches a phrase that is part of a larger concept to a GO term. Some examples of these:
 E2F transcription factors (linked to transcription, DNA-dependent)
 acetylcholine receptor agonist levamisole (linked to acetylcholine receptor activity)
 cell death gene (linked to cell death)
 rab-2 locomotion phenotype (linked to locomotion)


What can we realistically address by manual editing, and how much time would it take? Is it worth the time?

Would there be consistency issues to resolve?

Would it be better to link the overall article to GO terms, rather than link from specific terms in the paper? GO terms with a minimum number of links would be attached to the paper? Is there a way to do this on the Genetics site?

Should we just link from certain sections, e.g. the abstract?

What options do we currently have for viewing links? Can users select what types of links they want to see, e.g. what branch of the ontology or string-matches vs curated links?

What role could community annotation play here?

The ontology could certainly add more synonyms, add plurals, etc. What's the most efficient way to do this?

Ranjana

  • One question I had was--would we link all text matches, or only link GO terms relevant to the gene/processes being studied in the paper?
  • Would we do materials and methods, discussion, etc?
  • I looked at only one paper, but could immediately make-out that this is a time-intensive effort especially if we want to do a good quality job.
  • I feel the cost-benefit might be better if we did only abstracts, we could see how it goes once in production and take it from there.
  • With the linking of GO terms you start getting into the areas of annotation, and you don;t want to mislead the user, that's what makes it time intensive.

Papers that have been linked

Click on the associated links to see the various pages documenting the GO linking of that paper

doi10.1534/genetics.111.128421 00038399 | GO_linked_html | GO_entity_list | WBPaper00038399_GO_linking_comments
doi10.1534/genetics.111.130450 00038523 | GO_linked_html | GO_entity_list | WBPaper00038523_GO_linking_comments
doi10.1534/genetics.111.131227 00038528 | GO_linked_html | GO_entity_list | WBPaper00038528_GO_linking_comments
doi10.1534/genetics.111.131714 00039858 expected

Meeting 8/9/2011

options

  • define an acceptable quality of linking, set boundaries for the linking
  • linking script actions
    • exclusion list
    • synonym classes
    • plural
  • don't actually link the terms, just create an entity list

action items

  • bring in SGD
  • define goal of this linking:
    • mark-up
    • annotation
    • feeding back to GO
  • look specifically at abstracts for GO process terms, check accuracy; 5-10 abstracts

GSA-GO Linking Summary Tables

The following 10 GENETICS article IDs have been marked up for evaluation, 
although the whole paper has been marked up, we are only evaluating the links in 
the abstract, which has been colored for clarity:

131227
130450
128421
128389
129064
128512
129486
123323
110338
123992

Please reserve comments on the links in the rest of the paper for later. 

The linked files can be accessed at:
http://dev.textpresso.org/gsa/worm/html/?M=D
 
The corresponding entity tables are at:
http://dev.textpresso.org/gsa/worm/first_pass_entity_link_tables/?M=D

  • look at individually marked up terms and their branches of the ontology to get accuracy rates.
  • have something by November for the GO meeting

Grey areas to address

  • Those that aren't necessarily wrong, but either don't quite capture the essence of the entity being discussed in the paper, or are cases where maybe the GO term isn't the best or most informative link to make. This happens, for example, when the linking matches a phrase that is part of a larger concept to a GO term. Some examples of these:
 E2F transcription factors (linked to transcription, DNA-dependent)
 acetylcholine receptor agonist levamisole (linked to acetylcholine receptor activity)
 cell death gene (linked to cell death)
 rab-2 locomotion phenotype (linked to locomotion)


Inappropriate Links to Remove From the Entity List

These are examples of GO-term links that have been made in some of the test cases that are clearly incorrect and perhaps should be removed (pending approval from the group) entirely from the entity lists:

1)

Paper ID(s): 131227, 128512

Linked Term: "hypersensitivity"

Reason to remove: The GO term "hypersensitivity" has a rather specific meaning/definition that implies an immune response and inflammation; this clearly does not apply to C. elegans


GO definition: An inflammatory response to an exogenous environmental antigen or an endogenous antigen initiated by the adaptive immune system.


Hypersensitivity in the context of C. elegans papers refers to hypersensitivity to things like drugs, neuronal signaling molecules, etc.


Add to exclusion list? Y(kv)


2)

Paper ID(s): 130450

Linked Term: "FPS"

Reason to remove: This acronym came up in reference to "Fast Pharyngeal Pumping Span" whereas the GO term links to "floridoside-phosphate synthase activity"; this is clearly incorrect. Acronyms like this are likely to often produce false positives; perhaps we can just leave this for removal at the QC step when it is inappropriate?


For now, I would be in favor of removing these types of links at the QC step. (kv)


3)

Paper ID(s): 130450

Linked Term: "LS"

Reason to remove: Another acronym; referring to "life span" in this paper, the link goes to the GO term "lipoate synthase activity". "LS" is likely to mean many things in many contexts


Same as above - for now, remove at QC step. (kv)


4)

Paper ID(s): 128421

Linked Term: "midbody"

Reason to remove: In reference to the midbody region of the worm, "midbody" links to the GO term that refers to the macromolecular complex involved in cellular mitosis. This is likely to happen often in C. elegans literature.


I would actually keep this term in the list. A Textpresso abstract search for 'midbody' returns 22 abstracts, 21 of which refer to the cellular component. (kv)


5)

Paper ID(s): 128389, 110338, 123992

Linked Term: "core"

Reason to remove: The term "core" is linked to the GO term "viral nucelocapsid" for some strange reason. This term should be removed from the entity list.


'Core' is a broad synonym of 'viral nucleocapsid'. (kv)


6)

Paper ID(s): 128512

Linked Term: "PVC"

Reason to remove: The paper refers to the "PVC" interneuron of C. elegans but the links to the GO term "late endosome". This inappropriate linking may be irrelevant once the anatomy linking supersedes the GO term linking.


Given that there is an anatomy term PVC, I'd be fine with adding this to the GO exclusion list. (kv)


7)

Paper ID(s): 129486

Linked Term: "constriction"

Reason to remove: This term links to the GO term "polytene chromosome weak point", even though "constriction" could have many meanings. Should be removed from the entity list


I'd be fine with adding 'constriction' to the exclusion list for C. elegans papers. (kv)


8)

Paper ID(s): 110338

Linked Term: "CA"

Reason to remove: Here "CA" refers to California in an address; it links to the GO term "glutaryl-7-aminocephalosporanic-acid acylase activity"


I'd be fine with adding 'CA' to the exclusion list. (kv)


Odd GO Term Links and Errors

These links are probably not what we want and we should consider revising how the linking script handles them in the future.

1)

Paper ID(s): 131227

Linked Term: "synaptic membrane"

Problem: The link links to a WormBase GO term search page

Term in .ace file?


2)

Paper ID(s): 128421

Linked Term: "transactivation"

Problem: The link links to a WormBase GO term search page

Term in .ace file?


3)

Paper ID(s): 128421

Linked Term: "core promoter binding"

Problem: The link links to a WormBase GO term search page

Term in .ace file?


4)

Paper ID(s): 128421, 128389

Linked Term: "dimerization"

Problem: This term should link to the general "protein dimerization activity", but instead links to the more specific "protein homodimerization activity"

Dimerization activity is a broad synonym of protein homodimerization activity. Note that on the last GO conference call (09/13/2011) there was discussion about revising and/or removing the dimerization and multimerization terms, so we may just want to handle this on a case-by-case basis until the issues with these terms get sorted out by GO. (kv)


5)

Paper ID(s): 128389

Linked Term: "E-box binding"

Problem: The link links to a WormBase GO term search page

Term in .ace file?


6)

Paper ID(s): 129064, 129486, 110338

Linked Term: "embryogenesis"

Problem: This term is linked to the GO term "embryonic development ending in seed dormancy" which is too specific and irrelevant to C. elegans. Should be linked to "embryo development" (GO:0009790)


This could also link out to GO:0009792 'embryonic development ending in birth or egg hatching' which is what I typically annotate to for C. elegans papers. Which link gets used might depend upon the context in which the term embryogenesis appears. (kv)


7)

Paper ID(s): 128512, 110338, 123992

Linked Term: "transmembrane"

Problem: The link links to a WormBase GO term search page

Term in .ace file?


8)

Paper ID(s): 129486

Linked Term: "entry into mitosis"

Problem: This term links to the GO term "cell cycle switching, meiotic to mitotic cell cycle" although this could simply be referring to the G2-to-M cell cycle transition.

This link matched a narrow synonym of 'cell cycle switching, meiotic to mitotic cell cycle'. This could be handled at the QC step? (kv)


9)

Paper ID(s): 110338

Linked Term: "meiotic chromosome"

Problem: This term links to the GO term "condensed nuclear chromosome" which is misleading


'Meiotic chromosome' is a related synonym for 'condensed nuclear chromosome'. Correct use of this term may depend upon the context, i.e. what stage of meiosis, if known. (kv)


10)

Paper ID(s): 123992

Linked Term: "cytochrome P450"

Problem: This term links to the GO term "oxygen binding"; is this what we want?

Perhaps link out to 'monooxygenase activity'? This is one of the InterPro2GO mappings for cytochrome P450s. (kv)