Difference between revisions of "Textpresso Central"

From WormBaseWiki
Jump to navigationJump to search
Line 4: Line 4:
 
'''Searching and Category/Ontology Development'''
 
'''Searching and Category/Ontology Development'''
  
*(+) Control panel: loading papers from existing corpora into a viewer, incorporation of PubMed queries; search results will be used to import full text from PMC or journal site
+
* Control panel: loading papers from existing corpora into a viewer, incorporation of PubMed queries; search results will be used to import full text from PMC or journal site
  
*(+) Development of NLP toolbox: pattern matching, statistics, svm, hmm, crf
+
* Development of NLP toolbox: pattern matching, statistics, svm, hmm, crf
  
*(+) Index of all NLP results for faster querying
+
* Index of all NLP results for faster querying
  
*(+) Textpresso Ontology viewer and editor
+
* Textpresso Ontology viewer and editor
  
*(+) Ontology development
+
* Ontology development
  
*(+) Robust back-end infrastructure with internal Textpresso database holding all annotations
+
* Robust back-end infrastructure with internal Textpresso database holding all annotations
  
*(+) Querying the curation status of papers
+
* Querying the curation status of papers
  
*(+) Add searching for previously made annotations in papers to search capabilities
+
* Add searching for previously made annotations in papers to search capabilities
  
  
Line 24: Line 24:
 
'''Viewing'''
 
'''Viewing'''
  
*(+) Viewer: selecting terms, importing them into OA, prepopulating entries of forms; display results from NLP tools; initiate new  NLP analyses (pattern matching, statistical, machine learning)
+
* Viewer: selecting terms, importing them into OA, prepopulating entries of forms; display results from NLP tools; initiate new  NLP analyses (pattern matching, statistical, machine learning)
  
 
This will require a uniform representation of all machine learning results w.r.t. papers in Textpresso. Annotation markup language comes to my mind.
 
This will require a uniform representation of all machine learning results w.r.t. papers in Textpresso. Annotation markup language comes to my mind.
Line 33: Line 33:
 
'''Annotating and Curating'''
 
'''Annotating and Curating'''
  
*(+) OA and its interaction with TC
+
* OA and its interaction with TC
  
 
Curators would like to be able to view the search results while curating and make annotations from the true positive sentences.
 
Curators would like to be able to view the search results while curating and make annotations from the true positive sentences.
Line 59: Line 59:
 
Any others?  Ask other WB curators.
 
Any others?  Ask other WB curators.
  
*(+) Robust back-end infrastructure with internal Textpresso database holding all annotations
+
* Robust back-end infrastructure with internal Textpresso database holding all annotations
  
 
Adapt data models and tables from postgres curation database on tazendra?
 
Adapt data models and tables from postgres curation database on tazendra?
 +
 +
 +
'''Data Models and Flow'''
 +
 +
* Integrate Textpresso categories (TCAT), NLP results and curator annotation (CA) into one big data class
 +
  Model needs following elements; not all elements are populated at all times
 +
  - term (TCAT: lexicon entry; NLP: term, sentence identified in paper if applicable; CA: term manually annotated)
 +
  - annotation (TCAT: category term with possible attributes; NLP: machine-learningID or describing term; CA: manual annotation)
 +
  - paper location: PaperID, SentenceID, PosID
 +
  - allowed lexical variations (plural, tenses)
 +
  - ownership (who can change entry)
 +
  - what else?
 +
  - timestamp
 +
 +
 +
* Data flow / Transaction model
 +
  - does one big model for all exchanges between all module work?
 +
  ...

Revision as of 21:45, 10 November 2011

General considerations: Specification of data models, markup languages, and flow now is important.


Searching and Category/Ontology Development

  • Control panel: loading papers from existing corpora into a viewer, incorporation of PubMed queries; search results will be used to import full text from PMC or journal site
  • Development of NLP toolbox: pattern matching, statistics, svm, hmm, crf
  • Index of all NLP results for faster querying
  • Textpresso Ontology viewer and editor
  • Ontology development
  • Robust back-end infrastructure with internal Textpresso database holding all annotations
  • Querying the curation status of papers
  • Add searching for previously made annotations in papers to search capabilities


Viewing

  • Viewer: selecting terms, importing them into OA, prepopulating entries of forms; display results from NLP tools; initiate new NLP analyses (pattern matching, statistical, machine learning)

This will require a uniform representation of all machine learning results w.r.t. papers in Textpresso. Annotation markup language comes to my mind.



Annotating and Curating

  • OA and its interaction with TC

Curators would like to be able to view the search results while curating and make annotations from the true positive sentences.

We would need to develop a markup language (XML) for data flows. This should be a generic as possible.

Currently, for doing this we have:

1) the CCC (Cellular Component Curation) form

Pros: sentences are seen on the same page as annotations

form pre-populates curation fields with protein names, category terms, and suggested annotations

easy to mark sentences if not curatable

Cons: duplicating or making multiple annotations is cumbersome

don't see term info for proteins or GO terms

don't see additional annotations for proteins mentioned in sentences

2) the interaction configuration of the OA

Any others? Ask other WB curators.

  • Robust back-end infrastructure with internal Textpresso database holding all annotations

Adapt data models and tables from postgres curation database on tazendra?


Data Models and Flow

  • Integrate Textpresso categories (TCAT), NLP results and curator annotation (CA) into one big data class
 Model needs following elements; not all elements are populated at all times
 - term (TCAT: lexicon entry; NLP: term, sentence identified in paper if applicable; CA: term manually annotated)
 - annotation (TCAT: category term with possible attributes; NLP: machine-learningID or describing term; CA: manual annotation)
 - paper location: PaperID, SentenceID, PosID
 - allowed lexical variations (plural, tenses)
 - ownership (who can change entry)
 - what else?
 - timestamp


  • Data flow / Transaction model
 - does one big model for all exchanges between all module work?
 ...