Textpresso Central
From WormBaseWiki
Jump to navigationJump to searchGeneral considerations: Specification of data models, markup languages, and flow now is important.
Searching and Category/Ontology Development
- Control panel: loading papers from existing corpora into a viewer, incorporation of PubMed queries; search results will be used to import full text from PMC or journal site
- Searches - existing corpora, list of external identifiers, combination of both, exclusion list
- External identifiers - which ones? PMIDs, doi's, MOD paper IDs, others?
- Categories and Keywords
- Organization of categories by task? e.g. GO curation, Phenotype curation, Expression patterns, etc.
- Create and display category metadata - source, possible use, version, last updated
- Organization of categories by task? e.g. GO curation, Phenotype curation, Expression patterns, etc.
- Search Filters
- Bibliographic filters - year, journal, paper type, etc.
- Data Type Flagging - NLP results - data models and storage
- Development of NLP toolbox: pattern matching, statistics, svm, hmm, crf
- Index of all NLP results for faster querying
- Textpresso search score cut-off
- Curation Status - integration with curation databases - which ones?
- View previously made annotations - source? tie to a sentence where possible?
- Robust back-end infrastructure with internal Textpresso database holding all annotations
- Adapt data models and tables from postgres curation database on tazendra?
- Robust back-end infrastructure with internal Textpresso database holding all annotations
- Textpresso Category and Ontology viewer and editor
Viewing Search Results
- Viewer: selecting terms, importing them into OA, prepopulating entries of forms; display results from NLP tools; initiate new NLP analyses (pattern matching, statistical, machine learning)
This will require a uniform representation of all machine learning results w.r.t. papers in Textpresso. Annotation markup language comes to my mind.
- Viewing options
- Sort by score, year, journal (like PubMed)
- See search results within the context of the paper
- Paper viewer
- See existing annotations, if tied to a sentence
- Additional mark-up options? e.g. alleles, reagents, genes
- Paper viewer
Annotating and Curating
We would need to develop a markup language (XML) for data flows. This should be a generic as possible.
- OA and its interaction with TC
- Curators would like to be able to view the search results while curating and make annotations from the true positive sentences.
- Flag the paper and/or sentence(s) as curatable, relevant but not curatable, false positive (break down further?), unannotated
- Click on a sentence and, depending upon the curation needs, the curation tool is pre-populated with relevant entities.
- Would need to know species (usually OK for elegans, could be trickier for mammalian species)
- Click on words to add to category
- Current Textpresso-based curation forms:
- CCC (Cellular Component Curation) form
- Pros: sentences are seen on the same page as annotations
- CCC (Cellular Component Curation) form
form pre-populates curation fields with protein names, category terms, and suggested annotations easy to mark sentences if not curatable
- Cons: duplicating or making multiple annotations is cumbersome
don't see term info for proteins or GO terms don't see additional annotations for proteins mentioned in sentences
- the interaction configuration of the OA
Any others? Ask other WB curators.
Data Models and Flow
- Integrate Textpresso categories (TCAT), NLP results and curator annotation (CA) into one big data class
Model needs following elements; not all elements are populated at all times - term (TCAT: lexicon entry; NLP: term, sentence identified in paper if applicable; CA: term manually annotated) - annotation (TCAT: category term with possible attributes; NLP: machine-learningID or describing term; CA: manual annotation) - paper location: PaperID, SentenceID, PosID - allowed lexical variations (plural, tenses) - ownership (who can change entry) - what else? - timestamp
- Data flow / Transaction model
- does one big model for all exchanges between all module work? ...