GO entity markup
Contents
Trial runs with linking GO terms in GSA papers
This project will be carried out in parallel with the normal markup pipeline with the following criteria:
- markup is done on a separate textpresso machine -dev.textpresso.org (production machine is textpresso-dev.caltech.edu)
- the papers will not be sent to DJS (there is no upload form set for this pipeline)
- only the GO lexicon from AmiGo will be used
- papers marked up will be papers that come through the normal pipeline (no retroactive markup will occur for now)
- GO marked up papers will be sent to the GO linking crew: Kimberly, Ranjana, Daniela, Chris, Karen, and Paul
- as with the normal pipeline a link to an entity table of all entities, the generated URL, and a brief description of the webpage will be included in the alert e-mail (see below)
- all comments for the papers will be made available through this wiki.
- all links are formed for WB GO pages, not AmiGo pages
e-mail alert from Arun
Once a paper has been received and run through the GO linking script on dev.textpresso.org, an e-mail message will be sent out to everyone on the GO linking crew. This message includes a link to the paper and to the entity table.
Forwarded message
From: <arunr@wormbase.org>
Date: Wed, Jul 13, 2011 at 10:14 AM
Subject: GSA auto-email: GSA 128421 linked file available
To: arunr@wormbase.org
This is an automatic email sent to you by the GSA pipeline.
ATTENTION: This is not the production file. This is only for testing GO term linking.
Responsible curator: Daniela Raciti
Linked file available for manual QC at
http://dev.textpresso.org/gsa/worm/html//128421.html
The entity table for this first pass/automatically linked article is available at
http://dev.textpresso.org/gsa/worm/first_pass_entity_link_tables/128421.html
Thank you!
known problems
- Some GO terms do not have pages and WB displays a page with title 'Gene Ontology Search' for these URLs. See
http://www.wormbase.org/db/ontology/gene?name=GO%3A0001047;class=GO_term
- these problem links have been color-coded in 'grey' in the entity table. The URL is live, but the page has no relevant content.
General Thoughts
kimberly
GO links seem to fall into three categories:
- Those that are correct.
- Those that are incorrect.
- Those that aren't necessarily wrong, but either don't quite capture the essence of the entity being discussed in the paper, or are cases where maybe the GO term isn't the best or most informative link to make. This happens, for example, when the linking matches a phrase that is part of a larger concept to a GO term. Some examples of these:
E2F transcription factors (linked to transcription, DNA-dependent) acetylcholine receptor agonist levamisole (linked to acetylcholine receptor activity) cell death gene (linked to cell death) rab-2 locomotion phenotype (linked to locomotion)
What can we realistically address by manual editing, and how much time would it take? Is it worth the time?
Would there be consistency issues to resolve?
Would it be better to link the overall article to GO terms, rather than link from specific terms in the paper? GO terms with a minimum number of links would be attached to the paper? Is there a way to do this on the Genetics site?
Should we just link from certain sections, e.g. the abstract?
What options do we currently have for viewing links? Can users select what types of links they want to see, e.g. what branch of the ontology or string-matches vs curated links?
What role could community annotation play here?
The ontology could certainly add more synonyms, add plurals, etc. What's the most efficient way to do this?
Ranjana
- One question I had was--would we link all text matches, or only link GO terms relevant to the gene/processes being studied in the paper?
- Would we do materials and methods, discussion, etc?
- I looked at only one paper, but could immediately make-out that this is a time-intensive effort especially if we want to do a good quality job.
- I feel the cost-benefit might be better if we did only abstracts, we could see how it goes once in production and take it from there.
- With the linking of GO terms you start getting into the areas of annotation, and you don;t want to mislead the user, that's what makes it time intensive.
Papers that have been linked
Click on the associated links to see the various pages documenting the GO linking of that paper
doi10.1534/genetics.111.128421 00038399 | GO_linked_html | GO_entity_list | WBPaper00038399_GO_linking_comments
doi10.1534/genetics.111.130450 00038523 | GO_linked_html | GO_entity_list | WBPaper00038523_GO_linking_comments
doi10.1534/genetics.111.131227 00038528 | GO_linked_html | GO_entity_list | WBPaper00038528_GO_linking_comments
doi10.1534/genetics.111.131714 00039858 expected
Meeting 8/9/2011
options
- define an acceptable quality of linking, set boundaries for the linking
- linking script actions
- exclusion list
- synonym classes
- plural
- don't actually link the terms, just create an entity list
action items
- bring in SGD
- define goal of this linking:
- mark-up
- annotation
- feeding back to GO
- look specifically at abstracts for GO process terms, check accuracy; 5-10 abstracts
GSA-GO Linking Summary Tables
- Please enter results and comments for GO links for the papers below in the appropriate GSA-GO Linking Summary Tables.
The following 10 GENETICS article IDs have been marked up for evaluation, although the whole paper has been marked up, we are only evaluating the links in the abstract, which has been colored for clarity: 131227 130450 128421 128389 129064 128512 129486 123323 110338 123992 Please reserve comments on the links in the rest of the paper for later. The linked files can be accessed at: http://dev.textpresso.org/gsa/worm/html/?M=D The corresponding entity tables are at: http://dev.textpresso.org/gsa/worm/first_pass_entity_link_tables/?M=D
- look at individually marked up terms and their branches of the ontology to get accuracy rates.
- have something by November for the GO meeting
Grey areas to address
- Those that aren't necessarily wrong, but either don't quite capture the essence of the entity being discussed in the paper, or are cases where maybe the GO term isn't the best or most informative link to make. This happens, for example, when the linking matches a phrase that is part of a larger concept to a GO term. Some examples of these:
E2F transcription factors (linked to transcription, DNA-dependent) acetylcholine receptor agonist levamisole (linked to acetylcholine receptor activity) cell death gene (linked to cell death) rab-2 locomotion phenotype (linked to locomotion)
Inappropriate Links to Remove From the Entity List
These are examples of GO-term links that have been made in some of the test cases that are clearly incorrect and perhaps should be removed (pending approval from the group) entirely from the entity lists:
1)
Paper ID(s): 131227, 128512
Linked Term: "hypersensitivity"
Reason to remove: The GO term "hypersensitivity" has a rather specific meaning/definition that implies an immune response and inflammation; this clearly does not apply to C. elegans
GO definition: An inflammatory response to an exogenous environmental antigen or an endogenous antigen initiated by the adaptive immune system.
Hypersensitivity in the context of C. elegans papers refers to hypersensitivity to things like drugs, neuronal signaling molecules, etc.
Add to exclusion list? Y(kv)
2)
Paper ID(s): 130450
Linked Term: "FPS"
Reason to remove: This acronym came up in reference to "Fast Pharyngeal Pumping Span" whereas the GO term links to "floridoside-phosphate synthase activity"; this is clearly incorrect. Acronyms like this are likely to often produce false positives; perhaps we can just leave this for removal at the QC step when it is inappropriate?
For now, I would be in favor of removing these types of links at the QC step. (kv)
3)
Paper ID(s): 130450
Linked Term: "LS"
Reason to remove: Another acronym; referring to "life span" in this paper, the link goes to the GO term "lipoate synthase activity". "LS" is likely to mean many things in many contexts
Same as above - for now, remove at QC step. (kv)
4)
Paper ID(s): 128421
Linked Term: "midbody"
Reason to remove: In reference to the midbody region of the worm, "midbody" links to the GO term that refers to the macromolecular complex involved in cellular mitosis. This is likely to happen often in C. elegans literature.
I would actually keep this term in the list. A Textpresso abstract search for 'midbody' returns 22 abstracts, 21 of which refer to the cellular component. (kv)
5)
Paper ID(s): 128389, 110338, 123992
Linked Term: "core"
Reason to remove: The term "core" is linked to the GO term "viral nucelocapsid" for some strange reason. This term should be removed from the entity list.
'Core' is a broad synonym of 'viral nucleocapsid'. (kv)
6)
Paper ID(s): 128512
Linked Term: "PVC"
Reason to remove: The paper refers to the "PVC" interneuron of C. elegans but the links to the GO term "late endosome". This inappropriate linking may be irrelevant once the anatomy linking supersedes the GO term linking.
Given that there is an anatomy term PVC, I'd be fine with adding this to the GO exclusion list. (kv)
7)
Paper ID(s): 129486
Linked Term: "constriction"
Reason to remove: This term links to the GO term "polytene chromosome weak point", even though "constriction" could have many meanings. Should be removed from the entity list
I'd be fine with adding 'constriction' to the exclusion list for C. elegans papers. (kv)
8)
Paper ID(s): 110338
Linked Term: "CA"
Reason to remove: Here "CA" refers to California in an address; it links to the GO term "glutaryl-7-aminocephalosporanic-acid acylase activity"
I'd be fine with adding 'CA' to the exclusion list. (kv)
Odd GO Term Links and Errors
These links are probably not what we want and we should consider revising how the linking script handles them in the future.
1)
Paper ID(s): 131227
Linked Term: "synaptic membrane"
Problem: The link links to a WormBase GO term search page
Term in .ace file?
2)
Paper ID(s): 128421
Linked Term: "transactivation"
Problem: The link links to a WormBase GO term search page
Term in .ace file?
3)
Paper ID(s): 128421
Linked Term: "core promoter binding"
Problem: The link links to a WormBase GO term search page
Term in .ace file?
4)
Paper ID(s): 128421, 128389
Linked Term: "dimerization"
Problem: This term should link to the general "protein dimerization activity", but instead links to the more specific "protein homodimerization activity"
Dimerization activity is a broad synonym of protein homodimerization activity. Note that on the last GO conference call (09/13/2011) there was discussion about revising and/or removing the dimerization and multimerization terms, so we may just want to handle this on a case-by-case basis until the issues with these terms get sorted out by GO. (kv)
5)
Paper ID(s): 128389
Linked Term: "E-box binding"
Problem: The link links to a WormBase GO term search page
Term in .ace file?
6)
Paper ID(s): 129064, 129486, 110338
Linked Term: "embryogenesis"
Problem: This term is linked to the GO term "embryonic development ending in seed dormancy" which is too specific and irrelevant to C. elegans. Should be linked to "embryo development" (GO:0009790)
This could also link out to GO:0009792 'embryonic development ending in birth or egg hatching' which is what I typically annotate to for C. elegans papers. Which link gets used might depend upon the context in which the term embryogenesis appears. (kv)
7)
Paper ID(s): 128512, 110338, 123992
Linked Term: "transmembrane"
Problem: The link links to a WormBase GO term search page
Term in .ace file?
8)
Paper ID(s): 129486
Linked Term: "entry into mitosis"
Problem: This term links to the GO term "cell cycle switching, meiotic to mitotic cell cycle" although this could simply be referring to the G2-to-M cell cycle transition.
This link matched a narrow synonym of 'cell cycle switching, meiotic to mitotic cell cycle'. This could be handled at the QC step? (kv)
9)
Paper ID(s): 110338
Linked Term: "meiotic chromosome"
Problem: This term links to the GO term "condensed nuclear chromosome" which is misleading
'Meiotic chromosome' is a related synonym for 'condensed nuclear chromosome'. Correct use of this term may depend upon the context, i.e. what stage of meiosis, if known. (kv)
10)
Paper ID(s): 123992
Linked Term: "cytochrome P450"
Problem: This term links to the GO term "oxygen binding"; is this what we want?
Perhaps link out to 'monooxygenase activity'? This is one of the InterPro2GO mappings for cytochrome P450s. (kv)