Difference between revisions of "Textpresso Central"

From WormBaseWiki
Jump to navigationJump to search
Line 25: Line 25:
 
**View previously made annotations - source?  tie to a sentence where possible?
 
**View previously made annotations - source?  tie to a sentence where possible?
 
***Robust back-end infrastructure with internal Textpresso database holding all annotations
 
***Robust back-end infrastructure with internal Textpresso database holding all annotations
 +
****Adapt data models and tables from postgres curation database on tazendra?
  
  
Line 46: Line 47:
  
  
 +
'''Annotating and Curating'''
  
'''Annotating and Curating'''
+
We would need to develop a markup language (XML) for data flows. This should be a generic as possible.
  
 
* OA and its interaction with TC
 
* OA and its interaction with TC
 +
**Curators would like to be able to view the search results while curating and make annotations from the true positive sentences.
  
Curators would like to be able to view the search results while curating and make annotations from the true positive sentences.
+
*Flag the paper and/or sentence(s) as curatable, relevant but not curatable, false positive (break down further?), unannotated
 
 
We would need to develop a markup language (XML) for data flows. This should be a generic as possible.
 
  
 
*Click on a sentence and, depending upon the curation needs, the curation tool is pre-populated with relevant entities.
 
*Click on a sentence and, depending upon the curation needs, the curation tool is pre-populated with relevant entities.
 +
**Would need to know species (usually OK for elegans, could be trickier for mammalian species)
  
+
*Click on words to add to category
  
Currently, for doing this we have:
+
*Current Textpresso-based curation forms:
 +
**CCC (Cellular Component Curation) form
 +
***Pros: sentences are seen on the same page as annotations
 +
  form pre-populates curation fields with protein names, category terms, and suggested annotations
 +
  easy to mark sentences if not curatable
 +
***Cons: duplicating or making multiple annotations is cumbersome
 +
  don't see term info for proteins or GO terms
 +
  don't see additional annotations for proteins mentioned in sentences
  
1) the CCC (Cellular Component Curation) form
+
**the interaction configuration of the OA  
 
 
Pros: sentences are seen on the same page as annotations
 
 
 
form pre-populates curation fields with protein names, category terms, and suggested annotations
 
 
 
easy to mark sentences if not curatable
 
 
 
Cons: duplicating or making multiple annotations is cumbersome
 
 
 
don't see term info for proteins or GO terms
 
 
 
don't see additional annotations for proteins mentioned in sentences
 
 
 
2) the interaction configuration of the OA  
 
  
 
Any others?  Ask other WB curators.
 
Any others?  Ask other WB curators.
  
* Robust back-end infrastructure with internal Textpresso database holding all annotations
 
 
Adapt data models and tables from postgres curation database on tazendra?
 
  
  

Revision as of 17:52, 28 November 2011

General considerations: Specification of data models, markup languages, and flow now is important.


Searching and Category/Ontology Development

  • Control panel: loading papers from existing corpora into a viewer, incorporation of PubMed queries; search results will be used to import full text from PMC or journal site


  • Searches - existing corpora, list of external identifiers, combination of both, exclusion list
    • External identifiers - which ones? PMIDs, doi's, MOD paper IDs, others?


  • Categories and Keywords
    • Organization of categories by task? e.g. GO curation, Phenotype curation, Expression patterns, etc.
      • Create and display category metadata - source, possible use, version, last updated


  • Search Filters
    • Bibliographic filters - year, journal, paper type, etc.
    • Data Type Flagging - NLP results - data models and storage
      • Development of NLP toolbox: pattern matching, statistics, svm, hmm, crf
      • Index of all NLP results for faster querying
      • Textpresso search score cut-off
    • Curation Status - integration with curation databases - which ones?
    • View previously made annotations - source? tie to a sentence where possible?
      • Robust back-end infrastructure with internal Textpresso database holding all annotations
        • Adapt data models and tables from postgres curation database on tazendra?


  • Textpresso Category and Ontology viewer and editor


Viewing Search Results

  • Viewer: selecting terms, importing them into OA, prepopulating entries of forms; display results from NLP tools; initiate new NLP analyses (pattern matching, statistical, machine learning)

This will require a uniform representation of all machine learning results w.r.t. papers in Textpresso. Annotation markup language comes to my mind.

  • Viewing options
    • Sort by score, year, journal (like PubMed)
    • See search results within the context of the paper
      • Paper viewer
        • See existing annotations, if tied to a sentence
        • Additional mark-up options? e.g. alleles, reagents, genes


Annotating and Curating

We would need to develop a markup language (XML) for data flows. This should be a generic as possible.

  • OA and its interaction with TC
    • Curators would like to be able to view the search results while curating and make annotations from the true positive sentences.
  • Flag the paper and/or sentence(s) as curatable, relevant but not curatable, false positive (break down further?), unannotated
  • Click on a sentence and, depending upon the curation needs, the curation tool is pre-populated with relevant entities.
    • Would need to know species (usually OK for elegans, could be trickier for mammalian species)
  • Click on words to add to category
  • Current Textpresso-based curation forms:
    • CCC (Cellular Component Curation) form
      • Pros: sentences are seen on the same page as annotations
  form pre-populates curation fields with protein names, category terms, and suggested annotations
  easy to mark sentences if not curatable
      • Cons: duplicating or making multiple annotations is cumbersome
  don't see term info for proteins or GO terms
  don't see additional annotations for proteins mentioned in sentences
    • the interaction configuration of the OA

Any others? Ask other WB curators.


Data Models and Flow

  • Integrate Textpresso categories (TCAT), NLP results and curator annotation (CA) into one big data class
 Model needs following elements; not all elements are populated at all times
 - term (TCAT: lexicon entry; NLP: term, sentence identified in paper if applicable; CA: term manually annotated)
 - annotation (TCAT: category term with possible attributes; NLP: machine-learningID or describing term; CA: manual annotation)
 - paper location: PaperID, SentenceID, PosID
 - allowed lexical variations (plural, tenses)
 - ownership (who can change entry)
 - what else?
 - timestamp


  • Data flow / Transaction model
 - does one big model for all exchanges between all module work?
 ...