Processing Gene and Protein Names for Searches and Curation
Effective Textpresso searches, and subsequent curation, require the most complete lists of gene and protein names possible.
MODs can supply a list of gene (and possibly protein) names they have in their database, mapped to database gene IDs, but variations on gene and protein names can still appear in the literature.
Common variations include changes in case and additions of prefixes to gene or protein names, e.g. At or Ce.
In addition, some types of curation may require matching gene names only, others matching protein names only, or some may require both.
Steps for Gene and Protein Name Processing
Will the gene and protein name processing be done by each MOD or is that something we will do?
Based on my GSA experience: we need to work closely with each MOD to know where to get access to their entities from and how to filter them. For example, in case of WormBase, we get all classes except Gene from acedb and Gene list comes from postgres. SGD and FlyBase provide us with files on the web, which our scripts download every week. In case of FlyBase, the latest file has to be dynamically found out by the download script using the current month and year. Each MOD has its own rules for filtering and case variations, which were decided by closely working with the MOD curators and doing the linking on few test articles (like we are doing for GSA GO linking). --Arunr 18:39, 17 August 2011 (UTC)
If we did the processing:
- Get a mappings file from the MOD or database that maps each database ID to a list of gene or protein names and synonyms
An example of what we got from TAIR's ftp site:
locus_name symbol full_name AT1G65480 FT FLOWERING LOCUS T
Would be best if this file always had a standard name, e.g. like gene_aliases.latest.txt
- Process the list of gene or protein names and synonyms to include variations in case and, where needed, strip prefixes
- From processed list, create a Textpresso category for searching
- From processed list, create an expanded mappings file for the CCC curation form
Tanya suggested that TAIR will do the "stripping" and adding variations by themselves from next time, this is because that CCC curation form needs this gene_aliases.latest.txt to populate the columns in the form too, thus it's better to keep only one copy of this file singly from TAIR. (instead of adding/modifying on our end, causing different copies of this file). YLi 08/22/2011
Also, need a standard file name for this processed file
- Determine a schedule for updating the list based upon how often the MOD or database revises their gene or protein name list, e.g. daily, weekly, with each new database release, etc.
Removing Problematic Gene Names
Some gene names may generate a lot of false positive returns. For dicty, examples of these included gene names 'actin', '2C', '3B', and '7E'.
These gene names can be removed from the list prior to processing.
In GSA project, these are done by having exclusions list. Once the entities are downloaded and filtered using MOD-specific rules, the lexicon former script excludes any entity in the exclusion list. The exclusions list is easy to maintain and entities can always be added or deleted from it.--Arunr 18:48, 17 August 2011 (UTC)
TAIR has an exclusion list on CCC wiki page too. YLi 08/22/2011
How do each of the MODs handle curation of gene and protein synonyms?
Are there efficient ways of adding synonyms and getting them propagated to the gene mappings file?
I believe most MODs keep a list of synonyms for each entity, so we may be able to use them as static strings. If we are looking for a set of generic rules for all MODs and try pattern matching (like case-insensitive matching), then we may find difficulties with MODs like FlyBase.--Arunr 18:48, 17 August 2011 (UTC)
Back to Gene Ontology