Difference between revisions of "Textpresso Central"

Latest revision as of 19:21, 29 November 2011

General considerations: Specification of data models, markup languages, and flow now is important.

Searching and Category/Ontology Development

Control panel: loading papers from existing corpora into a viewer, incorporation of PubMed queries; search results will be used to import full text from PMC or journal site

Searches - existing corpora, list of external identifiers, combination of both
- External identifiers - which ones? PMIDs, doi's, MOD paper IDs, others?
  - Use case: Perform a PubMed search and then port the resulting IDs to a Textpresso search
- Paper Exclusion List - user supplied, from external data file, e.g. Gene Ontology Annotation File (gaf)
  - Use cases: TAIR CCC black list and filtering based upon papers already annotated with Component term in gaf file

Categories and Keywords
- Organization of categories by task? e.g. GO curation, Phenotype curation, Expression patterns, etc.
- Create and display category metadata - source, possible use, version, last updated
- Restrict search to a subset of a category
  - Use case: FlyBase CCC search using only gene names that start with CG
- How quickly could new searches be performed with modified categories

Search Filters
- Bibliographic filters - year, journal, paper type, etc.
- Data Type Flagging - NLP results - data models and storage
  - Development of NLP toolbox: pattern matching, statistics, svm, hmm, crf
  - Index of all NLP results for faster querying
  - Display most current precision and recall statistics so users can assess the accuracy of an NLP tool
- Data Type Flagging - author or curator flags
  - Information stored in postgres on tazendra
- Textpresso search score cut-off - view scores (mean, range) for curatable papers (this would depend on search criteria)
  - - Use case: search all SVM predicted negatives and return those papers with a score >n
- Curation Status - integration with curation databases - which ones?
- View previously made annotations - source? tie to a sentence where possible?
  - Robust back-end infrastructure with internal Textpresso database holding all annotations
    - Adapt data models and tables from postgres curation database on tazendra?

Textpresso Category and Ontology viewer and editor
- Stand-alone or interface with curation or both
- Incorporate statistical analyses (word frequency in positive vs negative sentences, how often is a term the only one from a given category, etc.)

Viewing Search Results

Viewer: selecting terms, importing them into OA, prepopulating entries of forms; display results from NLP tools; initiate new NLP analyses (pattern matching, statistical, machine learning)

This will require a uniform representation of all machine learning results w.r.t. papers in Textpresso. Annotation markup language comes to my mind.

Viewing options
- Sort by score, year, journal (like PubMed)
- See search results within the context of the paper
  - Paper viewer
    - See existing annotations, if tied to a sentence
    - Additional mark-up options? e.g. alleles, reagents, genes

Annotating and Curating

We would need to develop a markup language (XML) for data flows. This should be a generic as possible.

OA and its interaction with TC
- Curators would like to be able to view the search results while curating and make annotations from the true positive sentences.

Flag the paper and/or sentence(s) as curatable, relevant but not curatable (sentence only), false positive (break down further?), unannotated - basically TP, FP, FN, TN
- Have ability to annotate NLP results in bulk

Click on a sentence and, depending upon the curation needs, the curation tool is pre-populated with relevant entities.
- Would need to know species (usually OK for elegans, could be trickier for mammalian species)

Click on words to add to category

Current Textpresso-based curation forms:
- CCC (Cellular Component Curation) form
  - Pros:
    - sentences are seen on the same page as annotations
    - form pre-populates curation fields with protein names, category terms, and suggested annotations
    - easy to mark sentences if not curatable
  - Cons:
    - duplicating or making multiple annotations is cumbersome
    - don't see term info for proteins or GO terms
    - don't see additional annotations for proteins mentioned in sentences
- the interaction configuration of the OA
- Any others? Ask other WB curators.

Add curator comments to a paper, sentence, term

Output of curation
- to Textpresso database
- to MOD or other project database (e.g., BioGRID)
- downloadable file - what formats?

Data Models and Flow

Integrate Textpresso categories (TCAT), NLP results and curator annotation (CA) into one big data class

 Model needs following elements; not all elements are populated at all times
 - term (TCAT: lexicon entry; NLP: term, sentence identified in paper if applicable; CA: term manually annotated)
 - annotation (TCAT: category term with possible attributes; NLP: machine-learningID or describing term; CA: manual annotation)
 - paper location: PaperID, SentenceID, PosID
 - allowed lexical variations (plural, tenses)
 - ownership (who can change entry)
 - what else?
 - timestamp
 - version
 - source
 - comment
 - possible use

Data flow / Transaction model

 - does one big model for all exchanges between all module work?
 ...

Action Items 2011-11-29

Develop a controlled vocabulary for all items that needs to be query-able:
- Data type
- Curation status
- Possible use
- Source
- Type of Annotation
- Lexical variation

Make a first version of the annotation data model

Think about how to track changes in tokenization of papers in annotation database
- When papers get reformatted (sentence identification improve) how to port old annotations

Manifestation of Data Model in Textpresso database

- Import current SVM results
- Import current HMM results
- Import current CCC results
- Import current gene-gene interaction results
- Import current Molecules results
- How will category markup go into Textpresso database?

Design (and later implement) first version of Textpresso Curator interface

@@ Line 4: / Line 4: @@
 '''Searching and Category/Ontology Development'''
-*(+) Control panel: loading papers from existing corpora into a viewer, incorporation of PubMed queries; search results will be used to import full text from PMC or journal site
+* Control panel: loading papers from existing corpora into a viewer, incorporation of PubMed queries; search results will be used to import full text from PMC or journal site
-*(+) Development of NLP toolbox: pattern matching, statistics, svm, hmm, crf
-*(+) Index of all NLP results for faster querying
+*Searches - existing corpora, list of external identifiers, combination of both
+**External identifiers - which ones? PMIDs, doi's, MOD paper IDs, others?
+***Use case: Perform a PubMed search and then port the resulting IDs to a Textpresso search
+**Paper Exclusion List - user supplied, from external data file, e.g. Gene Ontology Annotation File (gaf)
+***Use cases: TAIR CCC black list and filtering based upon papers already annotated with Component term in gaf file
-*(+) Textpresso Ontology viewer and editor
-*(+) Ontology development
+* Categories and Keywords
+**Organization of categories by task?  e.g. GO curation, Phenotype curation, Expression patterns, etc.
+**Create and display category metadata - source, possible use, version, last updated
+**Restrict search to a subset of a category
+***Use case: FlyBase CCC search using only gene names that start with CG
+**How quickly could new searches be performed with modified categories
-*(+) Robust back-end infrastructure with internal Textpresso database holding all annotations
-*(+) Querying the curation status of papers
-*(+) Add searching for previously made annotations in papers to search capabilities
+*Search Filters
+**Bibliographic filters - year, journal, paper type, etc.
+**Data Type Flagging - NLP results - data models and storage
+***Development of NLP toolbox: pattern matching, statistics, svm, hmm, crf
+***Index of all NLP results for faster querying
+***Display most current precision and recall statistics so users can assess the accuracy of an NLP tool
+**Data Type Flagging - author or curator flags
+***Information stored in postgres on tazendra
+**Textpresso search score cut-off - view scores (mean, range) for curatable papers (this would depend on search criteria)
+****Use case: search all SVM predicted negatives and return those papers with a score >n
+**Curation Status - integration with curation databases - which ones?
+**View previously made annotations - source?  tie to a sentence where possible?
+***Robust back-end infrastructure with internal Textpresso database holding all annotations
+****Adapt data models and tables from postgres curation database on tazendra?
+* Textpresso Category and Ontology viewer and editor
+**Stand-alone or interface with curation or both
+**Incorporate statistical analyses (word frequency in positive vs negative sentences, how often is a term the only one from a given category, etc.)
-'''Viewing'''
-*(+) Viewer: selecting terms, importing them into OA, prepopulating entries of forms; display results from NLP tools; initiate new   NLP analyses (pattern matching, statistical, machine learning)
+'''Viewing Search Results'''
+* Viewer: selecting terms, importing them into OA, prepopulating entries of forms; display results from NLP tools; initiate new   NLP analyses (pattern matching, statistical, machine learning)
+This will require a uniform representation of all machine learning results w.r.t. papers in Textpresso. Annotation markup language comes to my mind.
+*Viewing options
+**Sort by score, year, journal (like PubMed)
+**See search results within the context of the paper
+***Paper viewer
+****See existing annotations, if tied to a sentence
+****Additional mark-up options?  e.g. alleles, reagents, genes
@@ Line 30: / Line 64: @@
 '''Annotating and Curating'''
-*(+) OA and its interaction with TC
+We would need to develop a markup language (XML) for data flows. This should be a generic as possible.
+* OA and its interaction with TC
+**Curators would like to be able to view the search results while curating and make annotations from the true positive sentences.
+*Flag the paper and/or sentence(s) as curatable, relevant but not curatable (sentence only), false positive (break down further?), unannotated - basically TP, FP, FN, TN
+**Have ability to annotate NLP results in bulk
+*Click on a sentence and, depending upon the curation needs, the curation tool is pre-populated with relevant entities.
+**Would need to know species (usually OK for elegans, could be trickier for mammalian species)
+*Click on words to add to category
+*Current Textpresso-based curation forms:
+**CCC (Cellular Component Curation) form
+***Pros:
+****sentences are seen on the same page as annotations
+****form pre-populates curation fields with protein names, category terms, and suggested annotations
+****easy to mark sentences if not curatable
+***Cons:
+****duplicating or making multiple annotations is cumbersome
+****don't see term info for proteins or GO terms
+****don't see additional annotations for proteins mentioned in sentences
+**the interaction configuration of the OA
+**Any others?  Ask other WB curators.
+*Add curator comments to a paper, sentence, term
+*Output of curation
+**to Textpresso database
+**to MOD or other project database (e.g., BioGRID)
+**downloadable file - what formats?
+'''Data Models and Flow'''
+* Integrate Textpresso categories (TCAT), NLP results and curator annotation (CA) into one big data class
+  Model needs following elements; not all elements are populated at all times
+  - term (TCAT: lexicon entry; NLP: term, sentence identified in paper if applicable; CA: term manually annotated)
+  - annotation (TCAT: category term with possible attributes; NLP: machine-learningID or describing term; CA: manual annotation)
+  - paper location: PaperID, SentenceID, PosID
+  - allowed lexical variations (plural, tenses)
+  - ownership (who can change entry)
+  - what else?
+  - timestamp
+  - version
+  - source
+  - comment
+  - possible use
+* Data flow / Transaction model
+  - does one big model for all exchanges between all module work?
+  ...
+'''Action Items 2011-11-29'''
+* Develop a controlled vocabulary for all items that needs to be query-able:
+** Data type
+** Curation status
+** Possible use
+** Source
+** Type of Annotation
+** Lexical variation
-Curators would like to be able to view the search results while curating and make annotations from the true positive sentences.
+* Make a first version of the annotation data model
-Currently, for doing this we have:
+* Think about how to track changes in tokenization of papers in annotation database
+** When papers get reformatted (sentence identification improve) how to port old annotations
-) the CCC (Cellular Component Curation) form
+* Manifestation of Data Model in Textpresso database
-) the interaction configuration of the OA
+** Import current SVM results
+** Import current HMM results
+** Import current CCC results
+** Import current gene-gene interaction results
+** Import current Molecules results
+** How will category markup go into Textpresso database?
-*(+) Robust back-end infrastructure with internal Textpresso database holding all annotations.
+* Design (and later implement) first version of Textpresso Curator interface

Difference between revisions of "Textpresso Central"

Latest revision as of 19:21, 29 November 2011

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools