Mf hmm tool

From WormBaseWiki
Jump to: navigation, search

Back to Gene Ontology

Intended Use

The mf_hmm tool is to be used to categorize sentences with respect to whether they describe enzymatic or transporter activities. The results of the categorization will be used to continually train an HMM to identify sentences for curation from new papers.

Login page

http://textpresso-dev.caltech.edu/cgi-bin/azurebrd/mf_hmm.cgi

The page is password protected.

The front page is a curator login page. This will ensure that subsequent linking out from the form to the paper in the paper editor uses the appropriate curator ID.

Select a curator name from the drop down and click on Login.

List of papers page

The next page contains a list of all papers that potentially contain positive sentences. The papers are sorted according to relative rank (i.e., combined sentence score of the HMM for that paper).

Each paper listed on this page will have one of three curation status tags: Done, Not Done, Partial

The curation status tag is based upon the selection of values (more on this below) for each sentence in the paper:

Done = all sentences have either curated, TP, or FP flag.

Not Done = all sentences have blank flag.

Partial = a mix of curated, TP, FP, and blank.

Clicking on a paper link takes the curator to a page listing the sentences and curation options.

Sentence Classification

Sentences can be classified in one of four ways:

  1. Curate
  2. True positive
  3. False positive
  4. Blank
  • Curate indicates that the curator could confidently make an annotation from that sentence. The bar for selecting a sentence for curation is quite high, since correctly assigning a GO annotation requires:

1) the name of the C. elegans protein or gene product to be annotated

Sentences that refer to proteins or gene products from other species were left blank, but could be classified with the same criteria as for C. elegans, if we want that.

2) a indication of the type of enzymatic or transporter activity (even if this only leads to annotation to a high level GO term)

3) a reasonable indication that the statement describes an experimental result from that paper, such as including the term Figure or Table, as opposed to a statement that describes previously published work.

This is probably the most difficult piece of information to discern from an isolated statement and may benefit greatly from additional contextual information, such as paper section.

Sometimes the positive annotation can be inferred from, or supported by, evidence of loss or reduction of activity in response to a specific chemical treatment.

For example (WBPaper00005703):

In contrast , DF dramatically reduced aconitase activity in GEI-22 / ACO-1- and pcDNA3-transfected cells .

Controlled vocabulary for classifying this statement:

GEI-22/ACO-1 - GO:0003994, GO:0072350 - inferred from lost or reduced activity


In addition, if a statement includes a reference for the activity, it was marked as a TP.

Statements that are run-ons with the heading from a paper sub-section, can often be used to make an annotation.

    WBPaper00026599
    All three TLK-1 proteins were active kinases in vitro, undergoing autophosphorylation and phosphorylating MYBP and ASF-1 (Figure 4A).
  • TP indicates that the sentence discusses enzymatic or transporter activity, but that an annotation could not be made.
    • Included in this cateogory are sentences reporting that an assay was performed, but not what the actual result was.
    WBPaper00026599
    The kinase activity of each TLK-1 protein was examined with MYBP and recombinant ASF-1 as substrates.
    • Also included in this category are sentences that describe the experimental results solely for a mutant version of the protein.
    WBPaper0003093
    DKF-2 Asp925 , Asp929 activity increased further in cells treated with either PMA plus GF103209X (30% increase) or PMA alone (45% increase). 

These sentences are in contrast to sentences that describe activity of mutant proteins in the context of activity of wild-type proteins.

Controlled vocabulary for classifying TP sentences:

1) Experiment was performed



  • FP indicates that the sentences does not describe enzymatic or transporter activity.
    • False positive sentences include statements of other types of experimental data and statements from Materials and Methods. Some examples:

Controlled vocabulary for false positive sentences:

1) Experiment was performed - RNA binding activity

2) RNA binding activity

3) RNA binding activity - NOT


    • False positive sentences may come from 'scrambled' sentences or problems with the pdf-to-text conversion. For example:
    WBPaper00026599 
    Although early tlk-1 (RNAi) embryos did not display chromatin-segregation or chromosome-morphology      
    abnormalities, loss of TLK-1 activity in older embryos (around the 3040 cell stage) resulted in altered nuclear morphology 
    and anaphase TLK-1 Is a Substrate Activator of AIR-2 903 chromosome-segregation defects. 
    
    Note the insert of 'TLK-1 is a Substrate Activator of AIR-2 903' from the abbreviated title and page number in the body.
  • Blank Sentences intentionally left blank are those that refer to a protein from species other than C. elegans. If needed, these could be analyzed again to classify according to Curate, TP, or FP.
  • False negative sentences, if found during the course of analysis or other curation, are being saved in a separate text file to be used for future training.


Thoughts on Curation

The issue of synonyms seems to be particularly critical for enzymology papers. C. elegans proteins are often referred to by a synonym, e.g. Dicer activity, Slicer activity, Dcp1/Dcp2-like activity. This sometimes makes it difficult to determine if an annotation can really be made to the C. elegans protein. --K.

It would be good to have a distribution of the percentages for each type of sentence. --K.

For future curation, how would the form handle symbols like prime, e.g. 5' or 3'? --K.

Curation page

The curation page lists all of the sentences from a given paper that had a score of 9 -6 for the HMM, starting with the highest scoring sentences at the top.

Each sentence also has a number, for curator record keeping. This number does not reflect the sentence number from the marked-up paper.

The top of the page offers curators the option to mark sentences from a paper individually or to mark all of the sentences as a unit. This allows a curator to quickly classify all sentences if it is readily apparent that they all fall into one of the classifications.

Once the curator is done, they can click on Front or click on Next.

  • Clicking on Front will save the results and return to the Front page listing the papers.
  • Clicking on Next will save the results and move on to the next paper in the list.

Curation in the OA

Any annotations that can be made from these sentences are being entered into the OA.

Each annotation coming from this work is being flagged with the text 'mfea_hmm' in the Comments field so we can keep track of the annotations made from this work.

Caveats

Don't use tabs in sentences of hmm results nor comments

Don't overwrite nor delete nor make unwriteable file : papers_done

Don't overwrite nor delete nor make unwriteable WBPaper######## files that have been curated (will have a done or partial status in the papers_done file)

How code works, in general

Data in directory : /data2/srv/textpresso-dev.caltech.edu/www/docroot/michael/mfea-curation/

index.html file is where we get paper order for mainpage.

papers_done file is where we store the done / partial / <blank> status of papers looked at.

WBPaper######## are the files that have the hmm mark results. When changing this data through curation, after each sentence data is appended with tabs so that it becomes <start line><hmm><tab><status><tab><curator_id><tab><comment><end of line>

Curator is stored in format "two"<WBPersonID number> to conform with postgres format should they ever need to talk to each other.