Difference between revisions of "Textpresso Central"

From WormBaseWiki
Jump to navigationJump to search
 
(29 intermediate revisions by 2 users not shown)
Line 6: Line 6:
 
* Control panel: loading papers from existing corpora into a viewer, incorporation of PubMed queries; search results will be used to import full text from PMC or journal site
 
* Control panel: loading papers from existing corpora into a viewer, incorporation of PubMed queries; search results will be used to import full text from PMC or journal site
  
* Development of NLP toolbox: pattern matching, statistics, svm, hmm, crf
 
  
* Index of all NLP results for faster querying
+
*Searches - existing corpora, list of external identifiers, combination of both
 +
**External identifiers - which ones? PMIDs, doi's, MOD paper IDs, others?
 +
***Use case: Perform a PubMed search and then port the resulting IDs to a Textpresso search
 +
**Paper Exclusion List - user supplied, from external data file, e.g. Gene Ontology Annotation File (gaf)
 +
***Use cases: TAIR CCC black list and filtering based upon papers already annotated with Component term in gaf file
  
* Textpresso Ontology viewer and editor
 
  
* Ontology development
+
* Categories and Keywords
 +
**Organization of categories by task?  e.g. GO curation, Phenotype curation, Expression patterns, etc.
 +
**Create and display category metadata - source, possible use, version, last updated
 +
**Restrict search to a subset of a category
 +
***Use case: FlyBase CCC search using only gene names that start with CG
 +
**How quickly could new searches be performed with modified categories
  
* Robust back-end infrastructure with internal Textpresso database holding all annotations
 
  
* Querying the curation status of papers
 
  
* Add searching for previously made annotations in papers to search capabilities
+
*Search Filters
 +
**Bibliographic filters - year, journal, paper type, etc.
 +
**Data Type Flagging - NLP results - data models and storage
 +
***Development of NLP toolbox: pattern matching, statistics, svm, hmm, crf
 +
***Index of all NLP results for faster querying
 +
***Display most current precision and recall statistics so users can assess the accuracy of an NLP tool
 +
**Data Type Flagging - author or curator flags
 +
***Information stored in postgres on tazendra
 +
**Textpresso search score cut-off - view scores (mean, range) for curatable papers (this would depend on search criteria)
 +
****Use case: search all SVM predicted negatives and return those papers with a score >n
 +
**Curation Status - integration with curation databases - which ones?
 +
**View previously made annotations - source?  tie to a sentence where possible?
 +
***Robust back-end infrastructure with internal Textpresso database holding all annotations
 +
****Adapt data models and tables from postgres curation database on tazendra?
  
  
 +
* Textpresso Category and Ontology viewer and editor
 +
**Stand-alone or interface with curation or both
 +
**Incorporate statistical analyses (word frequency in positive vs negative sentences, how often is a term the only one from a given category, etc.)
  
'''Viewing'''
 
  
* Viewer: selecting terms, importing them into OA, prepopulating entries of forms; display results from NLP tools; initiate new  NLP analyses (pattern matching, statistical, machine learning)
 
  
This will require a uniform representation of all machine learning results w.r.t. papers in Textpresso. Annotation markup language comes to my mind.
 
  
 +
'''Viewing Search Results'''
  
 +
* Viewer: selecting terms, importing them into OA, prepopulating entries of forms; display results from NLP tools; initiate new  NLP analyses (pattern matching, statistical, machine learning)
  
 +
This will require a uniform representation of all machine learning results w.r.t. papers in Textpresso. Annotation markup language comes to my mind.
  
'''Annotating and Curating'''
+
*Viewing options
 +
**Sort by score, year, journal (like PubMed)
 +
**See search results within the context of the paper
 +
***Paper viewer
 +
****See existing annotations, if tied to a sentence
 +
****Additional mark-up options?  e.g. alleles, reagents, genes
  
* OA and its interaction with TC
 
  
Curators would like to be able to view the search results while curating and make annotations from the true positive sentences.
 
  
We would need to develop a markup language (XML) for data flows. This should be a generic as possible. 
 
  
Currently, for doing this we have:
+
'''Annotating and Curating'''
  
1) the CCC (Cellular Component Curation) form
+
We would need to develop a markup language (XML) for data flows. This should be a generic as possible.
  
Pros: sentences are seen on the same page as annotations
+
* OA and its interaction with TC
 
+
**Curators would like to be able to view the search results while curating and make annotations from the true positive sentences.
form pre-populates curation fields with protein names, category terms, and suggested annotations
 
 
 
easy to mark sentences if not curatable
 
  
Cons: duplicating or making multiple annotations is cumbersome
+
*Flag the paper and/or sentence(s) as curatable, relevant but not curatable (sentence only), false positive (break down further?), unannotated - basically TP, FP, FN, TN
 +
**Have ability to annotate NLP results in bulk
  
don't see term info for proteins or GO terms
+
*Click on a sentence and, depending upon the curation needs, the curation tool is pre-populated with relevant entities.
 +
**Would need to know species (usually OK for elegans, could be trickier for mammalian species)
  
don't see additional annotations for proteins mentioned in sentences
+
*Click on words to add to category
  
2) the interaction configuration of the OA  
+
*Current Textpresso-based curation forms:
 +
**CCC (Cellular Component Curation) form
 +
***Pros:
 +
****sentences are seen on the same page as annotations
 +
****form pre-populates curation fields with protein names, category terms, and suggested annotations
 +
****easy to mark sentences if not curatable
 +
***Cons:
 +
****duplicating or making multiple annotations is cumbersome
 +
****don't see term info for proteins or GO terms
 +
****don't see additional annotations for proteins mentioned in sentences
 +
**the interaction configuration of the OA  
 +
**Any others?  Ask other WB curators.
  
Any others?  Ask other WB curators.
+
*Add curator comments to a paper, sentence, term
  
* Robust back-end infrastructure with internal Textpresso database holding all annotations
+
*Output of curation
 +
**to Textpresso database
 +
**to MOD or other project database (e.g., BioGRID)
 +
**downloadable file - what formats?
  
Adapt data models and tables from postgres curation database on tazendra?
 
  
  
Line 75: Line 110:
 
   - what else?
 
   - what else?
 
   - timestamp
 
   - timestamp
 +
  - version
 +
  - source
 +
  - comment
 +
  - possible use
 +
  
  
Line 80: Line 120:
 
   - does one big model for all exchanges between all module work?
 
   - does one big model for all exchanges between all module work?
 
   ...
 
   ...
 +
 +
'''Action Items 2011-11-29'''
 +
 +
* Develop a controlled vocabulary for all items that needs to be query-able:
 +
** Data type
 +
** Curation status
 +
** Possible use
 +
** Source
 +
** Type of Annotation
 +
** Lexical variation
 +
 +
* Make a first version of the annotation data model
 +
 +
* Think about how to track changes in tokenization of papers in annotation database
 +
** When papers get reformatted (sentence identification improve) how to port old annotations
 +
 +
* Manifestation of Data Model in Textpresso database
 +
 +
** Import current SVM results
 +
** Import current HMM results
 +
** Import current CCC results
 +
** Import current gene-gene interaction results
 +
** Import current Molecules results
 +
** How will category markup go into Textpresso database?
 +
 +
* Design (and later implement) first version of Textpresso Curator interface

Latest revision as of 19:21, 29 November 2011

General considerations: Specification of data models, markup languages, and flow now is important.


Searching and Category/Ontology Development

  • Control panel: loading papers from existing corpora into a viewer, incorporation of PubMed queries; search results will be used to import full text from PMC or journal site


  • Searches - existing corpora, list of external identifiers, combination of both
    • External identifiers - which ones? PMIDs, doi's, MOD paper IDs, others?
      • Use case: Perform a PubMed search and then port the resulting IDs to a Textpresso search
    • Paper Exclusion List - user supplied, from external data file, e.g. Gene Ontology Annotation File (gaf)
      • Use cases: TAIR CCC black list and filtering based upon papers already annotated with Component term in gaf file


  • Categories and Keywords
    • Organization of categories by task? e.g. GO curation, Phenotype curation, Expression patterns, etc.
    • Create and display category metadata - source, possible use, version, last updated
    • Restrict search to a subset of a category
      • Use case: FlyBase CCC search using only gene names that start with CG
    • How quickly could new searches be performed with modified categories


  • Search Filters
    • Bibliographic filters - year, journal, paper type, etc.
    • Data Type Flagging - NLP results - data models and storage
      • Development of NLP toolbox: pattern matching, statistics, svm, hmm, crf
      • Index of all NLP results for faster querying
      • Display most current precision and recall statistics so users can assess the accuracy of an NLP tool
    • Data Type Flagging - author or curator flags
      • Information stored in postgres on tazendra
    • Textpresso search score cut-off - view scores (mean, range) for curatable papers (this would depend on search criteria)
        • Use case: search all SVM predicted negatives and return those papers with a score >n
    • Curation Status - integration with curation databases - which ones?
    • View previously made annotations - source? tie to a sentence where possible?
      • Robust back-end infrastructure with internal Textpresso database holding all annotations
        • Adapt data models and tables from postgres curation database on tazendra?


  • Textpresso Category and Ontology viewer and editor
    • Stand-alone or interface with curation or both
    • Incorporate statistical analyses (word frequency in positive vs negative sentences, how often is a term the only one from a given category, etc.)



Viewing Search Results

  • Viewer: selecting terms, importing them into OA, prepopulating entries of forms; display results from NLP tools; initiate new NLP analyses (pattern matching, statistical, machine learning)

This will require a uniform representation of all machine learning results w.r.t. papers in Textpresso. Annotation markup language comes to my mind.

  • Viewing options
    • Sort by score, year, journal (like PubMed)
    • See search results within the context of the paper
      • Paper viewer
        • See existing annotations, if tied to a sentence
        • Additional mark-up options? e.g. alleles, reagents, genes



Annotating and Curating

We would need to develop a markup language (XML) for data flows. This should be a generic as possible.

  • OA and its interaction with TC
    • Curators would like to be able to view the search results while curating and make annotations from the true positive sentences.
  • Flag the paper and/or sentence(s) as curatable, relevant but not curatable (sentence only), false positive (break down further?), unannotated - basically TP, FP, FN, TN
    • Have ability to annotate NLP results in bulk
  • Click on a sentence and, depending upon the curation needs, the curation tool is pre-populated with relevant entities.
    • Would need to know species (usually OK for elegans, could be trickier for mammalian species)
  • Click on words to add to category
  • Current Textpresso-based curation forms:
    • CCC (Cellular Component Curation) form
      • Pros:
        • sentences are seen on the same page as annotations
        • form pre-populates curation fields with protein names, category terms, and suggested annotations
        • easy to mark sentences if not curatable
      • Cons:
        • duplicating or making multiple annotations is cumbersome
        • don't see term info for proteins or GO terms
        • don't see additional annotations for proteins mentioned in sentences
    • the interaction configuration of the OA
    • Any others? Ask other WB curators.
  • Add curator comments to a paper, sentence, term
  • Output of curation
    • to Textpresso database
    • to MOD or other project database (e.g., BioGRID)
    • downloadable file - what formats?


Data Models and Flow

  • Integrate Textpresso categories (TCAT), NLP results and curator annotation (CA) into one big data class
 Model needs following elements; not all elements are populated at all times
 - term (TCAT: lexicon entry; NLP: term, sentence identified in paper if applicable; CA: term manually annotated)
 - annotation (TCAT: category term with possible attributes; NLP: machine-learningID or describing term; CA: manual annotation)
 - paper location: PaperID, SentenceID, PosID
 - allowed lexical variations (plural, tenses)
 - ownership (who can change entry)
 - what else?
 - timestamp
 - version
 - source
 - comment
 - possible use


  • Data flow / Transaction model
 - does one big model for all exchanges between all module work?
 ...

Action Items 2011-11-29

  • Develop a controlled vocabulary for all items that needs to be query-able:
    • Data type
    • Curation status
    • Possible use
    • Source
    • Type of Annotation
    • Lexical variation
  • Make a first version of the annotation data model
  • Think about how to track changes in tokenization of papers in annotation database
    • When papers get reformatted (sentence identification improve) how to port old annotations
  • Manifestation of Data Model in Textpresso database
    • Import current SVM results
    • Import current HMM results
    • Import current CCC results
    • Import current gene-gene interaction results
    • Import current Molecules results
    • How will category markup go into Textpresso database?
  • Design (and later implement) first version of Textpresso Curator interface