Difference between revisions of "WormBase Paper Categorization"

From WormBaseWiki
Jump to navigationJump to search
m
m
Line 3: Line 3:
 
= Statement of Purpose =
 
= Statement of Purpose =
  
In an attempt to improve curation efficiency and provide distinct curation milestones, we would like to classify the C. elegans corpus into biologically relevant categories. These categories could represent different biological processes, molecular functions, anatomy terms, disease relevance, signaling pathways, phenotypes or any other categorical distinction we see as appropriate as we investigate this organizational scheme. This Wiki page is intended to collect and organize our thoughts and proposals as to how to best organize the C. elegans corpus (or for nematodes in general), including but not limited to:
+
In an attempt to improve curation efficiency, provide distinct curation milestones, and create a common curation goal for curators, we would like to classify the C. elegans corpus into biologically relevant categories. These categories could represent different biological processes, molecular functions, anatomy terms, disease relevance, signaling pathways, phenotypes or any other categorical distinction we see as appropriate as we investigate this organizational scheme. This Wiki page is intended to collect and organize our thoughts and proposals as to how to best organize the C. elegans corpus (or for nematodes in general), including but not limited to:
  
 
# What fundamental categories and/or hierarchies we can create to categorize papers
 
# What fundamental categories and/or hierarchies we can create to categorize papers
Line 33: Line 33:
 
= Choosing a Curation Priority =
 
= Choosing a Curation Priority =
  
 +
Once we are satisfied with a categorization scheme(s), we may want to select a single category for curators to focus their efforts. Our criteria for choosing a category may depend on a number of factors:
 +
 +
* Number of papers in each data type backlog
 +
* Number of papers in a category
 +
* Distribution of data types (or required curator effort) for papers in a category
 +
* Current representation of category topic in WormBase
 +
** Highly represented: we may want to "polish off" what we have to generate a complete picture
 +
** Lowly represented: May be low-hanging fruit for covering new topics
 +
* Current representation of gene function for genes represented in category
 +
** We could focus on genes with little or no known function
  
  
 
= Goals for a Curation Milestone =
 
= Goals for a Curation Milestone =
  
 +
Some goals that have been discussed are:
 +
 +
* Completing curation backlog for all data types for a given category
 +
* Completing curation of human disease-relevance for a given category
 +
* Generating (or filling out) a WormBase Process page and WikiPathway
 +
* Goals could be set for each curation upload (every ~two months)
  
  
 
= Distributing Curation Efforts =
 
= Distributing Curation Efforts =

Revision as of 19:02, 19 July 2013

WormBase Paper Categorization

Statement of Purpose

In an attempt to improve curation efficiency, provide distinct curation milestones, and create a common curation goal for curators, we would like to classify the C. elegans corpus into biologically relevant categories. These categories could represent different biological processes, molecular functions, anatomy terms, disease relevance, signaling pathways, phenotypes or any other categorical distinction we see as appropriate as we investigate this organizational scheme. This Wiki page is intended to collect and organize our thoughts and proposals as to how to best organize the C. elegans corpus (or for nematodes in general), including but not limited to:

  1. What fundamental categories and/or hierarchies we can create to categorize papers
  2. Methods or approaches for assigning individual papers to these categories
  3. How we go about choosing a category as a curation priority
  4. Determining what our goals should be for tackling a category or topic
  5. Distributing the curation efforts among curators

Categorical Schemes

Categories could be devised or arranged in a number of ways. Current or proposed approaches include:

  • Using WormBook chapters as a basis for categories
  • Pathways (Wnt, MAPK, TGF, Ras, etc.)
  • Processes (Aging, Sex Determination, Meiosis)
  • Phenotypes
  • Commonly referenced gene sets

Methods & Approaches for Paper Categorization

  • Collecting (manually) lists of relevant keywords, and performing (manual) Textpresso or PubMed searches
  • Running Textpresso scripts
    • SVM-based, supervised vs. unsupervised, keywords
    • Requires positives and negatives for training
  • Collecting common keywords from papers in out Author First Pass list of papers
    • Yuling has run scripts to determine word frequencies among these papers


Choosing a Curation Priority

Once we are satisfied with a categorization scheme(s), we may want to select a single category for curators to focus their efforts. Our criteria for choosing a category may depend on a number of factors:

  • Number of papers in each data type backlog
  • Number of papers in a category
  • Distribution of data types (or required curator effort) for papers in a category
  • Current representation of category topic in WormBase
    • Highly represented: we may want to "polish off" what we have to generate a complete picture
    • Lowly represented: May be low-hanging fruit for covering new topics
  • Current representation of gene function for genes represented in category
    • We could focus on genes with little or no known function


Goals for a Curation Milestone

Some goals that have been discussed are:

  • Completing curation backlog for all data types for a given category
  • Completing curation of human disease-relevance for a given category
  • Generating (or filling out) a WormBase Process page and WikiPathway
  • Goals could be set for each curation upload (every ~two months)


Distributing Curation Efforts