Textpresso Central

From WormBaseWiki
Revision as of 17:58, 28 November 2011 by Vanaukenk (talk | contribs)
Jump to navigationJump to search

General considerations: Specification of data models, markup languages, and flow now is important.


Searching and Category/Ontology Development

  • Control panel: loading papers from existing corpora into a viewer, incorporation of PubMed queries; search results will be used to import full text from PMC or journal site


  • Searches - existing corpora, list of external identifiers, combination of both, exclusion list
    • External identifiers - which ones? PMIDs, doi's, MOD paper IDs, others?


  • Categories and Keywords
    • Organization of categories by task? e.g. GO curation, Phenotype curation, Expression patterns, etc.
      • Create and display category metadata - source, possible use, version, last updated


  • Search Filters
    • Bibliographic filters - year, journal, paper type, etc.
    • Data Type Flagging - NLP results - data models and storage
      • Development of NLP toolbox: pattern matching, statistics, svm, hmm, crf
      • Index of all NLP results for faster querying
      • Textpresso search score cut-off
    • Curation Status - integration with curation databases - which ones?
    • View previously made annotations - source? tie to a sentence where possible?
      • Robust back-end infrastructure with internal Textpresso database holding all annotations
        • Adapt data models and tables from postgres curation database on tazendra?


  • Textpresso Category and Ontology viewer and editor
    • Stand-alone or interface with curation or both
    • Incorporate statistical analyses (word frequency in positive vs negative sentences, how often is a term the only one from a given category, etc.)



Viewing Search Results

  • Viewer: selecting terms, importing them into OA, prepopulating entries of forms; display results from NLP tools; initiate new NLP analyses (pattern matching, statistical, machine learning)

This will require a uniform representation of all machine learning results w.r.t. papers in Textpresso. Annotation markup language comes to my mind.

  • Viewing options
    • Sort by score, year, journal (like PubMed)
    • See search results within the context of the paper
      • Paper viewer
        • See existing annotations, if tied to a sentence
        • Additional mark-up options? e.g. alleles, reagents, genes



Annotating and Curating

We would need to develop a markup language (XML) for data flows. This should be a generic as possible.

  • OA and its interaction with TC
    • Curators would like to be able to view the search results while curating and make annotations from the true positive sentences.
  • Flag the paper and/or sentence(s) as curatable, relevant but not curatable, false positive (break down further?), unannotated
  • Click on a sentence and, depending upon the curation needs, the curation tool is pre-populated with relevant entities.
    • Would need to know species (usually OK for elegans, could be trickier for mammalian species)
  • Click on words to add to category
  • Current Textpresso-based curation forms:
    • CCC (Cellular Component Curation) form
      • Pros: sentences are seen on the same page as annotations
  form pre-populates curation fields with protein names, category terms, and suggested annotations
  easy to mark sentences if not curatable
      • Cons: duplicating or making multiple annotations is cumbersome
  don't see term info for proteins or GO terms
  don't see additional annotations for proteins mentioned in sentences
    • the interaction configuration of the OA

Any others? Ask other WB curators.

  • Add curator comments to a paper, sentence, term
  • Output of curation
    • to Textpresso database
    • to MOD or other project database (e.g., BioGRID)
    • downloadable file - what formats?


Data Models and Flow

  • Integrate Textpresso categories (TCAT), NLP results and curator annotation (CA) into one big data class
 Model needs following elements; not all elements are populated at all times
 - term (TCAT: lexicon entry; NLP: term, sentence identified in paper if applicable; CA: term manually annotated)
 - annotation (TCAT: category term with possible attributes; NLP: machine-learningID or describing term; CA: manual annotation)
 - paper location: PaperID, SentenceID, PosID
 - allowed lexical variations (plural, tenses)
 - ownership (who can change entry)
 - what else?
 - timestamp



  • Data flow / Transaction model
 - does one big model for all exchanges between all module work?
 ...