Lexicon Development Tool
Lexicon Development Tool (LDT) is built on top of existing Textpresso search functionalities. It employed a three-tier architecture using CGI in Perl for the front end web interface, middle tier functional process logic and Postgres relational database for backend data storage.Briefly, it allows a curator to login (see screenshot 1_CE_Login), setup a project (see screenshot 2_CE_Project) and submit Textpresso search (see screenshot 3_CE_Search), and the search results (sentences) are stored in the back end database.These sentences can be retrieved by the curator and displayed on web page.Curator can then select sentences of interest to make categories based on word frequency or save words/phrases based on the context in the sentence to include in appropriate categories or exclusion list (see screenshot 4_CE_SentenceAnnotation).These newly made categories can be included in a new round of search and this process can thus go on iteratively.
This utility can be applied in both datatype identification and automatic extraction of text for curation.In datatype identification, curator can upload a small training set containing 50 positiveID and negativeID of a datatype into LDT and optimize search query based on recall and precision measurement.Once satisfying recall and precision values are obtained, the batch mode version of LDT (e.g. the search module) can be run in the automatic pipeline using the search query optimized on new paperIDs and those IDs identified as positive for the datatype can be deposited into the tracking database.
In automatic extraction of text for curation, curator can upload a single article of a datatype and conduct the same iterative process of Search-Making Category-Search until the search query is so specific that only those sentences containing relevant curation information are returned.The same search query can then be applied to all the other papers of the datatype need to be curated.These selected sentences can then be exported into a textfile in the user's folder (see screenshot 5_CE_Results) in the format which can then be directly imported into a curation tool such as Phenote for further curation.