WormBase Paper Categorization

From WormBaseWiki
Jump to navigationJump to search

Statement of Purpose

In an attempt to improve curation efficiency, provide distinct curation milestones, and create a common curation goal for curators, we would like to classify the C. elegans corpus into biologically relevant categories. These categories could represent different biological processes, molecular functions, anatomy terms, disease relevance, signaling pathways, phenotypes or any other categorical distinction we see as appropriate as we investigate this organizational scheme. This Wiki page is intended to collect and organize our thoughts and proposals as to how to best organize the C. elegans corpus (or for nematodes in general), including but not limited to:

  1. What fundamental categories and/or hierarchies we can create to categorize papers
  2. Methods or approaches for assigning individual papers to these categories
  3. How we go about choosing a category as a curation priority
  4. Determining what our goals should be for tackling a category or topic
  5. Distributing the curation efforts among curators


Categorical Schemes

Categories could be devised or arranged in a number of ways. Current or proposed approaches include:

  • Using WormBook chapters as a basis for categories
  • Pathways (Wnt, MAPK, TGF, Ras, etc.)
  • Processes (Aging, Sex Determination, Meiosis)
  • Phenotypes
  • Commonly referenced gene sets
  • Topics commonly studied in C. elegans due to its advantages as a model system


Pilot Category Tree

WormBook Chapter categories:

Genetics and Genomics

  • Genetics
    • Complementation
    • Essential genes
    • Gene duplications and genetic redundancy
  • Genomics

Developmental Control Signal Transduction Molecular Biology Post-embryonic Development Neurobiology and Behavior Biochemistry Sex Determination Evolution and Ecology Cell Biology The Germ Line Disease Models and Drug Discovery

Methods & Approaches for Paper Categorization

  • Collecting (manually) lists of relevant keywords, and performing (manual) Textpresso or PubMed searches
  • Running Textpresso scripts
    • SVM-based, supervised vs. unsupervised, keywords
    • Requires positives and negatives for training
    • Chris is in the process of assembling all WormBook articles for an initial training round
  • Collecting common keywords from papers in out Author First Pass list of papers
    • Yuling has run scripts to determine word frequencies among these papers


Choosing a Curation Priority

Once we are satisfied with a categorization scheme(s), we may want to select a single category for curators to focus their efforts. Our criteria for choosing a category may depend on a number of factors:

  • Number of papers in each data type backlog
  • Number of papers in a category
  • Distribution of data types (or required curator effort) for papers in a category
  • Current representation of category topic in WormBase
    • Highly represented: we may want to "polish off" what we have to generate a complete picture
    • Lowly represented: May be low-hanging fruit for covering new topics
  • Current representation of gene function for genes represented in category
    • We could focus on genes with little or no known function


Goals for a Curation Milestone

Some goals that have been discussed are:

  • Completing curation backlog for all data types for a given category
  • Completing curation of human disease-relevance for a given category
  • Generating (or filling out) a WormBase Process page and WikiPathway
  • Goals could be set for each curation upload (every ~two months)
  • We can post the results of the milestone on the WormBase homepage/blog


Distributing Curation Efforts