Difference between revisions of "Textpresso Central"

From WormBaseWiki
Jump to navigationJump to search
(Created page with '*(+) Control panel: loading papers from existing corpora into a viewer, incorporation of PubMed queries; search results will be used to import full text from PMC or journal site …')
 
 
(44 intermediate revisions by 2 users not shown)
Line 1: Line 1:
*(+) Control panel: loading papers from existing corpora into a viewer, incorporation of PubMed queries; search results will be used to import full text from PMC or journal site
+
'''General considerations:'''  Specification of data models, markup languages, and flow now is important.
  
*(+) Add searching for previously made annotations in papers to search capabilities
+
 
 +
'''Searching and Category/Ontology Development'''
 +
 
 +
* Control panel: loading papers from existing corpora into a viewer, incorporation of PubMed queries; search results will be used to import full text from PMC or journal site
 +
 
 +
 
 +
*Searches - existing corpora, list of external identifiers, combination of both
 +
**External identifiers - which ones? PMIDs, doi's, MOD paper IDs, others?
 +
***Use case: Perform a PubMed search and then port the resulting IDs to a Textpresso search
 +
**Paper Exclusion List - user supplied, from external data file, e.g. Gene Ontology Annotation File (gaf)
 +
***Use cases: TAIR CCC black list and filtering based upon papers already annotated with Component term in gaf file
 +
 
 +
 
 +
* Categories and Keywords
 +
**Organization of categories by task?  e.g. GO curation, Phenotype curation, Expression patterns, etc.
 +
**Create and display category metadata - source, possible use, version, last updated
 +
**Restrict search to a subset of a category
 +
***Use case: FlyBase CCC search using only gene names that start with CG
 +
**How quickly could new searches be performed with modified categories
 +
 
 +
 
 +
 
 +
*Search Filters
 +
**Bibliographic filters - year, journal, paper type, etc.
 +
**Data Type Flagging - NLP results - data models and storage
 +
***Development of NLP toolbox: pattern matching, statistics, svm, hmm, crf
 +
***Index of all NLP results for faster querying
 +
***Display most current precision and recall statistics so users can assess the accuracy of an NLP tool
 +
**Data Type Flagging - author or curator flags
 +
***Information stored in postgres on tazendra
 +
**Textpresso search score cut-off - view scores (mean, range) for curatable papers (this would depend on search criteria)
 +
****Use case: search all SVM predicted negatives and return those papers with a score >n
 +
**Curation Status - integration with curation databases - which ones?
 +
**View previously made annotations - source?  tie to a sentence where possible?
 +
***Robust back-end infrastructure with internal Textpresso database holding all annotations
 +
****Adapt data models and tables from postgres curation database on tazendra?
 +
 
 +
 
 +
* Textpresso Category and Ontology viewer and editor
 +
**Stand-alone or interface with curation or both
 +
**Incorporate statistical analyses (word frequency in positive vs negative sentences, how often is a term the only one from a given category, etc.)
 +
 
 +
 
 +
 
 +
 
 +
'''Viewing Search Results'''
 +
 
 +
* Viewer: selecting terms, importing them into OA, prepopulating entries of forms; display results from NLP tools; initiate new  NLP analyses (pattern matching, statistical, machine learning)
 +
 
 +
This will require a uniform representation of all machine learning results w.r.t. papers in Textpresso. Annotation markup language comes to my mind.
 +
 
 +
*Viewing options
 +
**Sort by score, year, journal (like PubMed)
 +
**See search results within the context of the paper
 +
***Paper viewer
 +
****See existing annotations, if tied to a sentence
 +
****Additional mark-up options?  e.g. alleles, reagents, genes
 +
 
 +
 
 +
 
 +
 
 +
'''Annotating and Curating'''
 +
 
 +
We would need to develop a markup language (XML) for data flows. This should be a generic as possible.
 +
 
 +
* OA and its interaction with TC
 +
**Curators would like to be able to view the search results while curating and make annotations from the true positive sentences.
 +
 
 +
*Flag the paper and/or sentence(s) as curatable, relevant but not curatable (sentence only), false positive (break down further?), unannotated - basically TP, FP, FN, TN
 +
**Have ability to annotate NLP results in bulk
 +
 
 +
*Click on a sentence and, depending upon the curation needs, the curation tool is pre-populated with relevant entities.
 +
**Would need to know species (usually OK for elegans, could be trickier for mammalian species)
 +
 
 +
*Click on words to add to category
 +
 
 +
*Current Textpresso-based curation forms:
 +
**CCC (Cellular Component Curation) form
 +
***Pros:
 +
****sentences are seen on the same page as annotations
 +
****form pre-populates curation fields with protein names, category terms, and suggested annotations
 +
****easy to mark sentences if not curatable
 +
***Cons:
 +
****duplicating or making multiple annotations is cumbersome
 +
****don't see term info for proteins or GO terms
 +
****don't see additional annotations for proteins mentioned in sentences
 +
**the interaction configuration of the OA
 +
**Any others?  Ask other WB curators.
 +
 
 +
*Add curator comments to a paper, sentence, term
 +
 
 +
*Output of curation
 +
**to Textpresso database
 +
**to MOD or other project database (e.g., BioGRID)
 +
**downloadable file - what formats?
 +
 
 +
 
 +
 
 +
'''Data Models and Flow'''
 +
 
 +
* Integrate Textpresso categories (TCAT), NLP results and curator annotation (CA) into one big data class
 +
  Model needs following elements; not all elements are populated at all times
 +
  - term (TCAT: lexicon entry; NLP: term, sentence identified in paper if applicable; CA: term manually annotated)
 +
  - annotation (TCAT: category term with possible attributes; NLP: machine-learningID or describing term; CA: manual annotation)
 +
  - paper location: PaperID, SentenceID, PosID
 +
  - allowed lexical variations (plural, tenses)
 +
  - ownership (who can change entry)
 +
  - what else?
 +
  - timestamp
 +
  - version
 +
  - source
 +
  - comment
 +
  - possible use
 +
 
 +
 
 +
 
 +
* Data flow / Transaction model
 +
  - does one big model for all exchanges between all module work?
 +
  ...
 +
 
 +
'''Action Items 2011-11-29'''
 +
 
 +
* Develop a controlled vocabulary for all items that needs to be query-able:
 +
** Data type
 +
** Curation status
 +
** Possible use
 +
** Source
 +
** Type of Annotation
 +
** Lexical variation
 +
 
 +
* Make a first version of the annotation data model
 +
 
 +
* Think about how to track changes in tokenization of papers in annotation database
 +
** When papers get reformatted (sentence identification improve) how to port old annotations
 +
 
 +
* Manifestation of Data Model in Textpresso database
 +
 
 +
** Import current SVM results
 +
** Import current HMM results
 +
** Import current CCC results
 +
** Import current gene-gene interaction results
 +
** Import current Molecules results
 +
** How will category markup go into Textpresso database?
 +
 
 +
* Design (and later implement) first version of Textpresso Curator interface

Latest revision as of 19:21, 29 November 2011

General considerations: Specification of data models, markup languages, and flow now is important.


Searching and Category/Ontology Development

  • Control panel: loading papers from existing corpora into a viewer, incorporation of PubMed queries; search results will be used to import full text from PMC or journal site


  • Searches - existing corpora, list of external identifiers, combination of both
    • External identifiers - which ones? PMIDs, doi's, MOD paper IDs, others?
      • Use case: Perform a PubMed search and then port the resulting IDs to a Textpresso search
    • Paper Exclusion List - user supplied, from external data file, e.g. Gene Ontology Annotation File (gaf)
      • Use cases: TAIR CCC black list and filtering based upon papers already annotated with Component term in gaf file


  • Categories and Keywords
    • Organization of categories by task? e.g. GO curation, Phenotype curation, Expression patterns, etc.
    • Create and display category metadata - source, possible use, version, last updated
    • Restrict search to a subset of a category
      • Use case: FlyBase CCC search using only gene names that start with CG
    • How quickly could new searches be performed with modified categories


  • Search Filters
    • Bibliographic filters - year, journal, paper type, etc.
    • Data Type Flagging - NLP results - data models and storage
      • Development of NLP toolbox: pattern matching, statistics, svm, hmm, crf
      • Index of all NLP results for faster querying
      • Display most current precision and recall statistics so users can assess the accuracy of an NLP tool
    • Data Type Flagging - author or curator flags
      • Information stored in postgres on tazendra
    • Textpresso search score cut-off - view scores (mean, range) for curatable papers (this would depend on search criteria)
        • Use case: search all SVM predicted negatives and return those papers with a score >n
    • Curation Status - integration with curation databases - which ones?
    • View previously made annotations - source? tie to a sentence where possible?
      • Robust back-end infrastructure with internal Textpresso database holding all annotations
        • Adapt data models and tables from postgres curation database on tazendra?


  • Textpresso Category and Ontology viewer and editor
    • Stand-alone or interface with curation or both
    • Incorporate statistical analyses (word frequency in positive vs negative sentences, how often is a term the only one from a given category, etc.)



Viewing Search Results

  • Viewer: selecting terms, importing them into OA, prepopulating entries of forms; display results from NLP tools; initiate new NLP analyses (pattern matching, statistical, machine learning)

This will require a uniform representation of all machine learning results w.r.t. papers in Textpresso. Annotation markup language comes to my mind.

  • Viewing options
    • Sort by score, year, journal (like PubMed)
    • See search results within the context of the paper
      • Paper viewer
        • See existing annotations, if tied to a sentence
        • Additional mark-up options? e.g. alleles, reagents, genes



Annotating and Curating

We would need to develop a markup language (XML) for data flows. This should be a generic as possible.

  • OA and its interaction with TC
    • Curators would like to be able to view the search results while curating and make annotations from the true positive sentences.
  • Flag the paper and/or sentence(s) as curatable, relevant but not curatable (sentence only), false positive (break down further?), unannotated - basically TP, FP, FN, TN
    • Have ability to annotate NLP results in bulk
  • Click on a sentence and, depending upon the curation needs, the curation tool is pre-populated with relevant entities.
    • Would need to know species (usually OK for elegans, could be trickier for mammalian species)
  • Click on words to add to category
  • Current Textpresso-based curation forms:
    • CCC (Cellular Component Curation) form
      • Pros:
        • sentences are seen on the same page as annotations
        • form pre-populates curation fields with protein names, category terms, and suggested annotations
        • easy to mark sentences if not curatable
      • Cons:
        • duplicating or making multiple annotations is cumbersome
        • don't see term info for proteins or GO terms
        • don't see additional annotations for proteins mentioned in sentences
    • the interaction configuration of the OA
    • Any others? Ask other WB curators.
  • Add curator comments to a paper, sentence, term
  • Output of curation
    • to Textpresso database
    • to MOD or other project database (e.g., BioGRID)
    • downloadable file - what formats?


Data Models and Flow

  • Integrate Textpresso categories (TCAT), NLP results and curator annotation (CA) into one big data class
 Model needs following elements; not all elements are populated at all times
 - term (TCAT: lexicon entry; NLP: term, sentence identified in paper if applicable; CA: term manually annotated)
 - annotation (TCAT: category term with possible attributes; NLP: machine-learningID or describing term; CA: manual annotation)
 - paper location: PaperID, SentenceID, PosID
 - allowed lexical variations (plural, tenses)
 - ownership (who can change entry)
 - what else?
 - timestamp
 - version
 - source
 - comment
 - possible use


  • Data flow / Transaction model
 - does one big model for all exchanges between all module work?
 ...

Action Items 2011-11-29

  • Develop a controlled vocabulary for all items that needs to be query-able:
    • Data type
    • Curation status
    • Possible use
    • Source
    • Type of Annotation
    • Lexical variation
  • Make a first version of the annotation data model
  • Think about how to track changes in tokenization of papers in annotation database
    • When papers get reformatted (sentence identification improve) how to port old annotations
  • Manifestation of Data Model in Textpresso database
    • Import current SVM results
    • Import current HMM results
    • Import current CCC results
    • Import current gene-gene interaction results
    • Import current Molecules results
    • How will category markup go into Textpresso database?
  • Design (and later implement) first version of Textpresso Curator interface