Difference between revisions of "Textpresso Central"
From WormBaseWiki
Jump to navigationJump to search(35 intermediate revisions by 2 users not shown) | |||
Line 4: | Line 4: | ||
'''Searching and Category/Ontology Development''' | '''Searching and Category/Ontology Development''' | ||
− | * | + | * Control panel: loading papers from existing corpora into a viewer, incorporation of PubMed queries; search results will be used to import full text from PMC or journal site |
− | |||
− | *( | + | *Searches - existing corpora, list of external identifiers, combination of both |
+ | **External identifiers - which ones? PMIDs, doi's, MOD paper IDs, others? | ||
+ | ***Use case: Perform a PubMed search and then port the resulting IDs to a Textpresso search | ||
+ | **Paper Exclusion List - user supplied, from external data file, e.g. Gene Ontology Annotation File (gaf) | ||
+ | ***Use cases: TAIR CCC black list and filtering based upon papers already annotated with Component term in gaf file | ||
− | |||
− | * | + | * Categories and Keywords |
+ | **Organization of categories by task? e.g. GO curation, Phenotype curation, Expression patterns, etc. | ||
+ | **Create and display category metadata - source, possible use, version, last updated | ||
+ | **Restrict search to a subset of a category | ||
+ | ***Use case: FlyBase CCC search using only gene names that start with CG | ||
+ | **How quickly could new searches be performed with modified categories | ||
− | |||
− | |||
− | *( | + | *Search Filters |
+ | **Bibliographic filters - year, journal, paper type, etc. | ||
+ | **Data Type Flagging - NLP results - data models and storage | ||
+ | ***Development of NLP toolbox: pattern matching, statistics, svm, hmm, crf | ||
+ | ***Index of all NLP results for faster querying | ||
+ | ***Display most current precision and recall statistics so users can assess the accuracy of an NLP tool | ||
+ | **Data Type Flagging - author or curator flags | ||
+ | ***Information stored in postgres on tazendra | ||
+ | **Textpresso search score cut-off - view scores (mean, range) for curatable papers (this would depend on search criteria) | ||
+ | ****Use case: search all SVM predicted negatives and return those papers with a score >n | ||
+ | **Curation Status - integration with curation databases - which ones? | ||
+ | **View previously made annotations - source? tie to a sentence where possible? | ||
+ | ***Robust back-end infrastructure with internal Textpresso database holding all annotations | ||
+ | ****Adapt data models and tables from postgres curation database on tazendra? | ||
+ | * Textpresso Category and Ontology viewer and editor | ||
+ | **Stand-alone or interface with curation or both | ||
+ | **Incorporate statistical analyses (word frequency in positive vs negative sentences, how often is a term the only one from a given category, etc.) | ||
− | |||
− | * | + | |
+ | |||
+ | '''Viewing Search Results''' | ||
+ | |||
+ | * Viewer: selecting terms, importing them into OA, prepopulating entries of forms; display results from NLP tools; initiate new NLP analyses (pattern matching, statistical, machine learning) | ||
+ | |||
+ | This will require a uniform representation of all machine learning results w.r.t. papers in Textpresso. Annotation markup language comes to my mind. | ||
+ | |||
+ | *Viewing options | ||
+ | **Sort by score, year, journal (like PubMed) | ||
+ | **See search results within the context of the paper | ||
+ | ***Paper viewer | ||
+ | ****See existing annotations, if tied to a sentence | ||
+ | ****Additional mark-up options? e.g. alleles, reagents, genes | ||
+ | |||
Line 30: | Line 64: | ||
'''Annotating and Curating''' | '''Annotating and Curating''' | ||
− | + | We would need to develop a markup language (XML) for data flows. This should be a generic as possible. | |
+ | |||
+ | * OA and its interaction with TC | ||
+ | **Curators would like to be able to view the search results while curating and make annotations from the true positive sentences. | ||
+ | |||
+ | *Flag the paper and/or sentence(s) as curatable, relevant but not curatable (sentence only), false positive (break down further?), unannotated - basically TP, FP, FN, TN | ||
+ | **Have ability to annotate NLP results in bulk | ||
+ | |||
+ | *Click on a sentence and, depending upon the curation needs, the curation tool is pre-populated with relevant entities. | ||
+ | **Would need to know species (usually OK for elegans, could be trickier for mammalian species) | ||
+ | |||
+ | *Click on words to add to category | ||
+ | |||
+ | *Current Textpresso-based curation forms: | ||
+ | **CCC (Cellular Component Curation) form | ||
+ | ***Pros: | ||
+ | ****sentences are seen on the same page as annotations | ||
+ | ****form pre-populates curation fields with protein names, category terms, and suggested annotations | ||
+ | ****easy to mark sentences if not curatable | ||
+ | ***Cons: | ||
+ | ****duplicating or making multiple annotations is cumbersome | ||
+ | ****don't see term info for proteins or GO terms | ||
+ | ****don't see additional annotations for proteins mentioned in sentences | ||
+ | **the interaction configuration of the OA | ||
+ | **Any others? Ask other WB curators. | ||
+ | |||
+ | *Add curator comments to a paper, sentence, term | ||
+ | |||
+ | *Output of curation | ||
+ | **to Textpresso database | ||
+ | **to MOD or other project database (e.g., BioGRID) | ||
+ | **downloadable file - what formats? | ||
+ | |||
+ | |||
+ | |||
+ | '''Data Models and Flow''' | ||
+ | |||
+ | * Integrate Textpresso categories (TCAT), NLP results and curator annotation (CA) into one big data class | ||
+ | Model needs following elements; not all elements are populated at all times | ||
+ | - term (TCAT: lexicon entry; NLP: term, sentence identified in paper if applicable; CA: term manually annotated) | ||
+ | - annotation (TCAT: category term with possible attributes; NLP: machine-learningID or describing term; CA: manual annotation) | ||
+ | - paper location: PaperID, SentenceID, PosID | ||
+ | - allowed lexical variations (plural, tenses) | ||
+ | - ownership (who can change entry) | ||
+ | - what else? | ||
+ | - timestamp | ||
+ | - version | ||
+ | - source | ||
+ | - comment | ||
+ | - possible use | ||
+ | |||
+ | |||
+ | |||
+ | * Data flow / Transaction model | ||
+ | - does one big model for all exchanges between all module work? | ||
+ | ... | ||
+ | |||
+ | '''Action Items 2011-11-29''' | ||
+ | |||
+ | * Develop a controlled vocabulary for all items that needs to be query-able: | ||
+ | ** Data type | ||
+ | ** Curation status | ||
+ | ** Possible use | ||
+ | ** Source | ||
+ | ** Type of Annotation | ||
+ | ** Lexical variation | ||
− | + | * Make a first version of the annotation data model | |
− | + | * Think about how to track changes in tokenization of papers in annotation database | |
+ | ** When papers get reformatted (sentence identification improve) how to port old annotations | ||
− | + | * Manifestation of Data Model in Textpresso database | |
− | + | ** Import current SVM results | |
+ | ** Import current HMM results | ||
+ | ** Import current CCC results | ||
+ | ** Import current gene-gene interaction results | ||
+ | ** Import current Molecules results | ||
+ | ** How will category markup go into Textpresso database? | ||
− | *( | + | * Design (and later implement) first version of Textpresso Curator interface |
Latest revision as of 19:21, 29 November 2011
General considerations: Specification of data models, markup languages, and flow now is important.
Searching and Category/Ontology Development
- Control panel: loading papers from existing corpora into a viewer, incorporation of PubMed queries; search results will be used to import full text from PMC or journal site
- Searches - existing corpora, list of external identifiers, combination of both
- External identifiers - which ones? PMIDs, doi's, MOD paper IDs, others?
- Use case: Perform a PubMed search and then port the resulting IDs to a Textpresso search
- Paper Exclusion List - user supplied, from external data file, e.g. Gene Ontology Annotation File (gaf)
- Use cases: TAIR CCC black list and filtering based upon papers already annotated with Component term in gaf file
- External identifiers - which ones? PMIDs, doi's, MOD paper IDs, others?
- Categories and Keywords
- Organization of categories by task? e.g. GO curation, Phenotype curation, Expression patterns, etc.
- Create and display category metadata - source, possible use, version, last updated
- Restrict search to a subset of a category
- Use case: FlyBase CCC search using only gene names that start with CG
- How quickly could new searches be performed with modified categories
- Search Filters
- Bibliographic filters - year, journal, paper type, etc.
- Data Type Flagging - NLP results - data models and storage
- Development of NLP toolbox: pattern matching, statistics, svm, hmm, crf
- Index of all NLP results for faster querying
- Display most current precision and recall statistics so users can assess the accuracy of an NLP tool
- Data Type Flagging - author or curator flags
- Information stored in postgres on tazendra
- Textpresso search score cut-off - view scores (mean, range) for curatable papers (this would depend on search criteria)
- Use case: search all SVM predicted negatives and return those papers with a score >n
- Curation Status - integration with curation databases - which ones?
- View previously made annotations - source? tie to a sentence where possible?
- Robust back-end infrastructure with internal Textpresso database holding all annotations
- Adapt data models and tables from postgres curation database on tazendra?
- Robust back-end infrastructure with internal Textpresso database holding all annotations
- Textpresso Category and Ontology viewer and editor
- Stand-alone or interface with curation or both
- Incorporate statistical analyses (word frequency in positive vs negative sentences, how often is a term the only one from a given category, etc.)
Viewing Search Results
- Viewer: selecting terms, importing them into OA, prepopulating entries of forms; display results from NLP tools; initiate new NLP analyses (pattern matching, statistical, machine learning)
This will require a uniform representation of all machine learning results w.r.t. papers in Textpresso. Annotation markup language comes to my mind.
- Viewing options
- Sort by score, year, journal (like PubMed)
- See search results within the context of the paper
- Paper viewer
- See existing annotations, if tied to a sentence
- Additional mark-up options? e.g. alleles, reagents, genes
- Paper viewer
Annotating and Curating
We would need to develop a markup language (XML) for data flows. This should be a generic as possible.
- OA and its interaction with TC
- Curators would like to be able to view the search results while curating and make annotations from the true positive sentences.
- Flag the paper and/or sentence(s) as curatable, relevant but not curatable (sentence only), false positive (break down further?), unannotated - basically TP, FP, FN, TN
- Have ability to annotate NLP results in bulk
- Click on a sentence and, depending upon the curation needs, the curation tool is pre-populated with relevant entities.
- Would need to know species (usually OK for elegans, could be trickier for mammalian species)
- Click on words to add to category
- Current Textpresso-based curation forms:
- CCC (Cellular Component Curation) form
- Pros:
- sentences are seen on the same page as annotations
- form pre-populates curation fields with protein names, category terms, and suggested annotations
- easy to mark sentences if not curatable
- Cons:
- duplicating or making multiple annotations is cumbersome
- don't see term info for proteins or GO terms
- don't see additional annotations for proteins mentioned in sentences
- Pros:
- the interaction configuration of the OA
- Any others? Ask other WB curators.
- CCC (Cellular Component Curation) form
- Add curator comments to a paper, sentence, term
- Output of curation
- to Textpresso database
- to MOD or other project database (e.g., BioGRID)
- downloadable file - what formats?
Data Models and Flow
- Integrate Textpresso categories (TCAT), NLP results and curator annotation (CA) into one big data class
Model needs following elements; not all elements are populated at all times - term (TCAT: lexicon entry; NLP: term, sentence identified in paper if applicable; CA: term manually annotated) - annotation (TCAT: category term with possible attributes; NLP: machine-learningID or describing term; CA: manual annotation) - paper location: PaperID, SentenceID, PosID - allowed lexical variations (plural, tenses) - ownership (who can change entry) - what else? - timestamp - version - source - comment - possible use
- Data flow / Transaction model
- does one big model for all exchanges between all module work? ...
Action Items 2011-11-29
- Develop a controlled vocabulary for all items that needs to be query-able:
- Data type
- Curation status
- Possible use
- Source
- Type of Annotation
- Lexical variation
- Make a first version of the annotation data model
- Think about how to track changes in tokenization of papers in annotation database
- When papers get reformatted (sentence identification improve) how to port old annotations
- Manifestation of Data Model in Textpresso database
- Import current SVM results
- Import current HMM results
- Import current CCC results
- Import current gene-gene interaction results
- Import current Molecules results
- How will category markup go into Textpresso database?
- Design (and later implement) first version of Textpresso Curator interface