Processing Gene and Protein Names for Searches and Curation

From WormBaseWiki
Revision as of 17:59, 8 August 2011 by Vanaukenk (talk | contribs)
Jump to navigationJump to search

General Issue

Effective Textpresso searches, and subsequent curation, require the most complete lists of gene and protein names possible.

MODs can supply a list of gene (and possibly protein) names they have in their database, mapped to database gene IDs, but variations on gene and protein names can still appear in the literature.

Common variations include changes in case and additions of prefixes to gene or protein names, e.g. At or Ce.

Steps for Gene and Protein Name Processing

  • Get a mappings file from the MOD or database that maps each database ID to a list of gene or protein names and synonyms

An example of what we got from TAIR:

AT1G65480 FT FLOWERING LOCUS T

  • Process the list of gene or protein names and synonyms to include variations in case and, where needed, strip prefixes
  • From processed list, create a Textpresso category for searching
  • From processed list, create an expanded mappings file for the CCC curation form
  • Determine a schedule for updating the list based upon how often the MOD or database revises their gene or protein name list, e.g. daily, weekly, with each new database release, etc.

Adding Synonyms

How do each of the MODs handle curation of gene and protein synonyms?

Are there efficient ways of adding synonyms and getting them propagated to the gene mappings file?


Back to Gene Ontology