Difference between revisions of "Pipeline for identifying papers with disease or disease gene ortholog"

From WormBaseWiki
Jump to navigationJump to search
Line 65: Line 65:
 
These words are placed in an exclusion list.  Removing these terms decreases the number of false positives for elegans literature. The exclusion list will be different for a different organism.  
 
These words are placed in an exclusion list.  Removing these terms decreases the number of false positives for elegans literature. The exclusion list will be different for a different organism.  
  
(variations on capitalization included):
+
(variations on capitalization and plural (check this with James) included):
Agitated
+
Agitation
+
Agitation       Hyperactivity      Hypersensitivity         
Amelia
+
Infection                              Amelia                 Amended           Bends            
Amended
+
Confused             Confusion         Corn               Corns  
Bends
+
Dependence               Fit
Confused
+
Hermaphroditism       Hyperactivity   Hypersensitivity   Infection Interferon  
Confusion
+
Interferons     Intestinal         Intestine  
Corn
+
Intestines       Longevities         Longevity   Orf   Overdose   Paralysed   Paralysis  
Corns
+
Prolapse         Recruitment
Dependence
+
Reflux           Restlessness       Rupture     Scar       Scarring  
Fit
+
Starvation       Suppression         Tag         Tear       boil
Hermaphroditism
+
ectodermal       feminization       intestine   leanness   leannesses longevities longevity
Hyperactivity
+
overdose         recruitment         rupture     scar       starvation temp trauma  
Hypersensitivity
+
infertility     infertile           volvulus   ganglion   locally   fracture
Infection
+
disorder           deformity         DNA fragmentation       plaque     roundworm   cysts      
Interferon
+
anatomical         abnormality       unconscious             tumor       morphology
Interferons
+
vaccinia           DNA damage         convulsions             seizure     seizures
Intestinal
+
hyperactive behavior                 complex I               complex II acyl-CoA dehydrogenases  
Intestine
+
disease
Intestines
 
Longevities
 
Longevity
 
Orf
 
Overdose
 
Paralysed
 
Paralysis
 
Prolapse
 
Recruitment
 
Reflux
 
Restlessness
 
Rupture
 
Scar
 
Scarring
 
Starvation
 
Suppression
 
Tag
 
Tear
 
boil
 
ectodermal
 
feminization
 
intestine
 
leanness
 
leannesses
 
longevities
 
longevity
 
overdose
 
recruitment
 
rupture
 
scar
 
starvation
 
temp
 
trauma
 
infertility
 
infertile
 
volvulus
 
ganglion
 
locally
 
fracture
 
disorder
 
deformity
 
DNA fragmentation
 
plaque
 
roundworm
 
cysts
 
anatomical abnormality
 
unconscious
 
tumor morphology
 
vaccinia
 
DNA damage
 
convulsions
 
seizure
 
seizures
 
hyperactive behavior
 
complex I
 
complex II
 
acyl-CoA dehydrogenases
 
disease
 
  
  

Revision as of 21:44, 7 May 2012

Aim of Project: To Use Textpresso to identify C. elegans papers that describe either orthologs of human disease genes or a model for the disease in C. elegans.

This method uses one or more categories and keywords to identify sentences in the text of the paper.

Keywords:

1. The keywords ‘ortholog’, ‘homolog’, ‘similar’, ‘relate’ and ‘model’ and various forms of the words as they may occur in the following example phrases:

  • is the ortholog
  • is an ortholog
  • is orthologous
  • is homologous
  • is the homolog
  • is similar to

Script looks for (an asterisk at the end denotes wild-card):

  • ortholog*
  • homolog*
  • similar
  • relate*
  • model*

2. The keywords ‘C. elegans’ or ‘elegans’

Example sentence: We used C. elegans as a model system for <human disease>, and other variations.

Categories:

Category 1: C. elegans gene or protein

Requirement: The gene (or protein list) needs to be kept up to date with the model organism database.

Category 2: Human disease

The following sources were used to build the lexicon for the human disease category:

1. Human disease list from Neuroscience - NIFSTD owl file for class NIF-Dysfunction located at http://ontology.neuinfo.org/NIF/Dysfunction/NIF-Dysfunction.owl

2. Human disease ontology file from http://www.obofoundry.org/cgi-bin/detail.cgi?id=disease_ontology

Pick ‘term names’ and ‘synonyms’ from this ontology. The more term variations that the ‘master’ list has the better it will be for picking up terms.

3. Textpresso category file ‘disease (h. sapiens)’

4. OMIM morbid map disease terms (the first column from ftp://ftp.ncbi.nih.gov/repository/OMIM/morbidmap)

Rules for Script:

Example: The term ‘Spinal muscular Atrophy’ would have been missed because it exists in the Neuroscience list as ‘Scapuloperoneal Spinal Muscular Atrophy’ but could be found in the human disease ontology mentioned above. As of now we need the following variations for the disease terms (note the capitalization of different letters at the beginning of each word in the term):

Spinal Muscular Atrophy Spinal muscular atrophy spinal muscular atrophy (probably not used this way but safe to include)

To deal with different forms of capitalization, a case-insensitive search is done. Plurals are formed by using heuristics like myopia -> myopias atrophy -> atrophies Textpresso script cap+plural.pl handles this reasonably well.


Exclusion list for Disease Lexicon:

These words are placed in an exclusion list. Removing these terms decreases the number of false positives for elegans literature. The exclusion list will be different for a different organism.

(variations on capitalization and plural (check this with James) included):

Agitation        Hyperactivity      Hypersensitivity          
Infection                               Amelia                 Amended           Bends              
Confused              Confusion         Corn                Corns 
Dependence                Fit
Hermaphroditism       Hyperactivity    Hypersensitivity    Infection  Interferon   
Interferons      Intestinal          Intestine 
Intestines       Longevities         Longevity   Orf   Overdose   Paralysed   Paralysis 
Prolapse         Recruitment
Reflux           Restlessness        Rupture     Scar        Scarring 
Starvation       Suppression         Tag         Tear        boil
ectodermal       feminization        intestine   leanness    leannesses longevities longevity
overdose         recruitment         rupture     scar        starvation temp trauma 
infertility      infertile           volvulus    ganglion    locally    fracture  
disorder           deformity          DNA fragmentation       plaque      roundworm   cysts       
anatomical         abnormality        unconscious             tumor       morphology
vaccinia           DNA damage         convulsions             seizure     seizures
hyperactive behavior                  complex I               complex II  acyl-CoA dehydrogenases 
disease


Other Exclusions: In addition to the above terms, exclude

  • any entry with string length 4 or less
  • any entry that is a C. elegans gene i.e. that matches the regex [a-z]{3}-\d+
  • generally words inside brackets will be thrown away:

(disorder) (disease) [disorder] [disease]

Sentences with these types of phrases are excluded:

  • Cancer Research Center
  • Cancer Center
  • Center for Childhood Diabetes
  • Center for Diabetes

These phrases are excluded:

  • term followed by 'cells' or 'cell line'

[eg., Similarly , ectopic expression of CED-4 in 293T human embryonic kidney cells and MCF7 breast carcinoma cells induced rapid apoptosis ( Fig . 1 ) , even though a proteolytic activity has not been ascribed to CED-4 .}

  • term followed by 'toxin' eg., diphtheria toxin
  • term followed by 'virus'

These are also excluded:

1. Articles of type: ‘meeting_abstract’ and 'Congresses'

2. Supplementary material: Doc IDs of type:WBPaper00037683.sup.1

3. ‘Materials and Methods’ section of paper

4. When matching sentences are from section ‘References’ classify paper into a separate group, so it becomes easier for curator to go through. --The script has a default mode in which it searches for relevant sentences in all fields other than references. The script has a references mode in which it runs only on sentences from references.

5. Whole genome papers Eg., Doc Id WBPaper00030997 Title: Draft genome of the filarial nematode parasite Brugia malayi. --Implemented by excluding lines that are very long. Currently, I have set this limit to 250 words i.e. If line has more than 250 words, then it is excluded.

6. Should not be picked up: relate in 'correlated' model in 'remodelling' relate in 'unrelated'


Other strategies: 1. Used the OMIM morbid map, to extract the human gene list: ftp://ftp.ncbi.nih.gov/repository/OMIM/morbidmap A preliminary run was done with the human gene list along with the disease lexicon to extract sentences that have both a human disease and a human gene. This method results in a large number of false positives. For now, this strategy will not be pursued further.

Precision:

Out of 100 papers, there were 29 false positives, putting the precision at 71%.

Recall:

Out of a 100 random papers that were selected from the elegans literature there were 2 papers that the script missed: (Note that these papers belong to the second class of papers of interest where one of the categories ‘elegans gene’ and the keyword ‘ortholog’, ‘homolog’, are missing).