Difference between revisions of "WormBase Paper Categorization"
Line 21: | Line 21: | ||
* Phenotypes | * Phenotypes | ||
* Commonly referenced gene sets | * Commonly referenced gene sets | ||
− | * Topics commonly studied in C. elegans due to | + | * Topics commonly studied in C. elegans due to its advantages as a model system |
= Methods & Approaches for Paper Categorization = | = Methods & Approaches for Paper Categorization = |
Revision as of 19:07, 19 July 2013
Contents
Statement of Purpose
In an attempt to improve curation efficiency, provide distinct curation milestones, and create a common curation goal for curators, we would like to classify the C. elegans corpus into biologically relevant categories. These categories could represent different biological processes, molecular functions, anatomy terms, disease relevance, signaling pathways, phenotypes or any other categorical distinction we see as appropriate as we investigate this organizational scheme. This Wiki page is intended to collect and organize our thoughts and proposals as to how to best organize the C. elegans corpus (or for nematodes in general), including but not limited to:
- What fundamental categories and/or hierarchies we can create to categorize papers
- Methods or approaches for assigning individual papers to these categories
- How we go about choosing a category as a curation priority
- Determining what our goals should be for tackling a category or topic
- Distributing the curation efforts among curators
Categorical Schemes
Categories could be devised or arranged in a number of ways. Current or proposed approaches include:
- Using WormBook chapters as a basis for categories
- Pathways (Wnt, MAPK, TGF, Ras, etc.)
- Processes (Aging, Sex Determination, Meiosis)
- Phenotypes
- Commonly referenced gene sets
- Topics commonly studied in C. elegans due to its advantages as a model system
Methods & Approaches for Paper Categorization
- Collecting (manually) lists of relevant keywords, and performing (manual) Textpresso or PubMed searches
- Running Textpresso scripts
- SVM-based, supervised vs. unsupervised, keywords
- Requires positives and negatives for training
- Chris is in the process of assembling all WormBook articles for an initial training round
- Collecting common keywords from papers in out Author First Pass list of papers
- Yuling has run scripts to determine word frequencies among these papers
Choosing a Curation Priority
Once we are satisfied with a categorization scheme(s), we may want to select a single category for curators to focus their efforts. Our criteria for choosing a category may depend on a number of factors:
- Number of papers in each data type backlog
- Number of papers in a category
- Distribution of data types (or required curator effort) for papers in a category
- Current representation of category topic in WormBase
- Highly represented: we may want to "polish off" what we have to generate a complete picture
- Lowly represented: May be low-hanging fruit for covering new topics
- Current representation of gene function for genes represented in category
- We could focus on genes with little or no known function
Goals for a Curation Milestone
Some goals that have been discussed are:
- Completing curation backlog for all data types for a given category
- Completing curation of human disease-relevance for a given category
- Generating (or filling out) a WormBase Process page and WikiPathway
- Goals could be set for each curation upload (every ~two months)
- We can post the results of the milestone on the WormBase homepage/blog