Mf hmm tool

From WormBaseWiki
Jump to navigationJump to search

Back to Gene Ontology

Intended Use

The mf_hmm tool is to be used to categorize sentences with respect to whether they describe enzymatic or transporter activities. The results of the categorization will be used to continually train an HMM to identify sentences for curation from new papers.

Login page

http://textpresso-dev.caltech.edu/cgi-bin/azurebrd/mf_hmm.cgi

The page is password protected.

The front page is a curator login page. This will ensure that subsequent linking out from the form to the paper in the paper editor uses the appropriate curator ID.

Select a curator name from the drop down and click on Login.

List of papers page

The next page contains a list of all papers that potentially contain positive sentences. The papers are sorted according to relative rank (i.e., combined sentence score of the HMM for that paper).

Each paper listed on this page will have one of three curation status tags: Done, Not Done, Partial

The curation status tag is based upon the selection of values (more on this below) for each sentence in the paper:

Done = all sentences have either curated, TP, or FP flag.

Not Done = all sentences have blank flag.

Partial = a mix of curated, TP, FP, and blank.

Clicking on a paper link takes the curator to a page listing the sentences and curation options.

Sentence Classification

  • Curated indicates that the curator could confidently make an annotation from that sentence. The bar for selecting a sentence for curation is quite high, since correctly assigning a GO annotation requires:

1) the name of the protein or gene product to be annotated

2) an indication of the type of enzymatic or transporter activity (even if this only leads to annotation to a high level GO term)

3) an indication that the statement describes an experimental result from that paper, as opposed to a previously published work, such as including the term Figure or Table

    WBPaper00026599
    All three TLK-1 proteins were active kinases in vitro, undergoing autophosphorylation and phosphorylating MYBP and ASF-1 (Figure 4A).
  • TP indicates that the sentence discusses enzymatic or transporter activity, but that an annotation could not be made.
    • Included in this cateogory are sentences reporting that an assay was performed, but not what the actual result was.
    WBPaper00026599
    The kinase activity of each TLK-1 protein was examined with MYBP and recombinant ASF-1 as substrates.
  • FP indicates that the sentences does not describe enzymatic or transporter activity.
    • False positive sentences include statements of other types of experimental data and statements from Materials and Methods. Some examples:


    • False positive sentences may come from 'scrambled' sentences or problems with the pdf-to-text conversion. For example:
    WBPaper00026599 
    Although early tlk-1 (RNAi) embryos did not display chromatin-segregation or chromosome-morphology      
    abnormalities, loss of TLK-1 activity in older embryos (around the 3040 cell stage) resulted in altered nuclear morphology 
    and anaphase TLK-1 Is a Substrate Activator of AIR-2 903 chromosome-segregation defects. 
    
    Note the insert of 'TLK-1 is a Substrate Activator of AIR-2 903' from the abbreviated title and page number in the body.
  • False negative sentences, if found during the course of analysis or other curation, are being saved in a separate text file to be used for future training.

Curation page

The curation page lists all of the sentences from a given paper that had a score of 9 -6 for the HMM, starting with the highest scoring sentences at the top.

Each sentence also has a number, for curator record keeping. This number does not reflect the sentence number from the marked-up paper.

The top of the page offers curators the option to mark sentences from a paper individually or to mark all of the sentences as a unit. This allows a curator to quickly classify all sentences if it is readily apparent that they all fall into one of the classifications.

Once the curator is done, they can click on Front or click on Next.

  • Clicking on Front will save the results and return to the Front page listing the papers.
  • Clicking on Next will save the results and move on to the next paper in the list.

Curation in the OA

Any annotations that can be made from these sentences are being entered into the OA.

Each annotation coming from this work is being flagged with the text 'mfea_hmm' in the Comments field so we can keep track of the annotations made from this work.

Caveats

Don't use tabs in sentences of hmm results nor comments

Don't overwrite nor delete nor make unwriteable file : papers_done

Don't overwrite nor delete nor make unwriteable WBPaper######## files that have been curated (will have a done or partial status in the papers_done file)

How code works, in general

Data in directory : /data2/srv/textpresso-dev.caltech.edu/www/docroot/michael/mfea-curation/

index.html file is where we get paper order for mainpage.

papers_done file is where we store the done / partial / <blank> status of papers looked at.

WBPaper######## are the files that have the hmm mark results. When changing this data through curation, after each sentence data is appended with tabs so that it becomes <start line><hmm><tab><status><tab><curator_id><tab><comment><end of line>

Curator is stored in format "two"<WBPersonID number> to conform with postgres format should they ever need to talk to each other.