Processing Gene and Protein Names for Searches and Curation

From WormBaseWiki
Revision as of 18:39, 17 August 2011 by Arunr (talk | contribs)
Jump to navigationJump to search

General Issue

Effective Textpresso searches, and subsequent curation, require the most complete lists of gene and protein names possible.

MODs can supply a list of gene (and possibly protein) names they have in their database, mapped to database gene IDs, but variations on gene and protein names can still appear in the literature.

Common variations include changes in case and additions of prefixes to gene or protein names, e.g. At or Ce.

Steps for Gene and Protein Name Processing

Will the gene and protein name processing be done by each MOD or is that something we will do?

Based on my GSA experience: we need to work closely with each MOD to know where to get access to their entities from and how to filter them. For example, in case of WormBase, we get all classes except Gene from acedb and Gene list comes from postgres. SGD and FlyBase provide us with files on the web, which our scripts download every week. In case of FlyBase, the latest file has to be dynamically found out by the download script using the current month and year. Each MOD has its own rules for filtering and case variations, which were decided by closely working with the MOD curators and doing the linking on few test articles (like we are doing for GSA GO linking). --Arunr 18:39, 17 August 2011 (UTC)

If we did the processing:

  • Get a mappings file from the MOD or database that maps each database ID to a list of gene or protein names and synonyms

An example of what we got from TAIR's ftp site:

 locus_name    symbol     full_name  
 AT1G65480	FT	   FLOWERING LOCUS T

Would be best if this file always had a standard name, e.g. like gene_aliases.latest.txt

  • Process the list of gene or protein names and synonyms to include variations in case and, where needed, strip prefixes
  • From processed list, create a Textpresso category for searching
  • From processed list, create an expanded mappings file for the CCC curation form
 Also, need a standard file name for this processed file
  • Determine a schedule for updating the list based upon how often the MOD or database revises their gene or protein name list, e.g. daily, weekly, with each new database release, etc.

Removing Problematic Gene Names

Some gene names may generate a lot of false positive returns. For dicty, examples of these included gene names 'actin', '2C', '3B', and '7E'.

These gene names can be removed from the list prior to processing.

Adding Synonyms

How do each of the MODs handle curation of gene and protein synonyms?

Are there efficient ways of adding synonyms and getting them propagated to the gene mappings file?


Back to Gene Ontology