Difference between revisions of "Pipeline for identifying papers with disease or disease gene ortholog"
Line 199: | Line 199: | ||
genes = green | genes = green | ||
− | elegans = red | + | "elegans", C. elegans = red |
disease term = blue | disease term = blue | ||
− | keywords ‘ortholog’, ‘homolog’, ‘similar’, ‘relate’ and ‘model’ = | + | keywords ‘ortholog’, ‘homolog’, ‘similar’, ‘relate’ and ‘model’ = purple or brown? |
* A new run on the entire elegans corpus will be done and the papers will be categorized by year. | * A new run on the entire elegans corpus will be done and the papers will be categorized by year. |
Revision as of 21:05, 9 September 2013
Contents
Method
Aim of Project: To Use Textpresso to identify C. elegans papers that describe either orthologs of human disease genes or a model for the disease in C. elegans.
This method uses one or more categories and keywords to identify sentences in the text of the paper.
Keywords
1. The keywords ‘ortholog’, ‘homolog’, ‘similar’, ‘relate’ and ‘model’ and various forms of the words as they may occur in the following example phrases:
- is the ortholog
- is an ortholog
- is orthologous
- is homologous
- is the homolog
- is similar to
Script looks for (an asterisk at the end denotes wild-card):
- ortholog*
- homolog*
- similar
- relate*
- model*
2. The keywords ‘C. elegans’ or ‘elegans’
Example sentence: We used C. elegans as a model system for <human disease>, and other variations.
Categories
Category 1: C. elegans gene or protein
Requirement: The gene (or protein list) needs to be kept up to date with the model organism database.
Category 2: Human disease
The following sources were used to build the lexicon for the human disease category:
1. http://www.berkeleybop.org/ontologies/doid.obo
2. Human disease list from Neuroscience - NIFSTD owl file for class NIF-Dysfunction located at http://ontology.neuinfo.org/NIF/Dysfunction/NIF-Dysfunction.owl
Pick ‘term names’ and ‘synonyms’ from this ontology. The more term variations that the ‘master’ list has the better it will be for picking up terms.
3. Textpresso category file ‘disease (h. sapiens)’
4. OMIM morbid map disease terms (the first column from ftp://ftp.ncbi.nih.gov/repository/OMIM/morbidmap)
Rules for Script
Example: The term ‘Spinal muscular Atrophy’ would have been missed because it exists in the Neuroscience list as ‘Scapuloperoneal Spinal Muscular Atrophy’ but could be found in the human disease ontology mentioned above. As of now we need the following variations for the disease terms (note the capitalization of different letters at the beginning of each word in the term):
Spinal Muscular Atrophy
Spinal muscular atrophy
spinal muscular atrophy (probably not used this way but safe to include)
To deal with different forms of capitalization, a case-insensitive search is done. Plurals are formed by using heuristics like myopia -> myopias atrophy -> atrophies Textpresso script cap+plural.pl handles this reasonably well.
Exclusion list for disease lexicon
Exclusion list file is at: /data1/Users/liyuling/Curator_related/disease/exclusions/disease.txt
These words are placed in an exclusion list. Removing these terms decreases the number of false positives for elegans literature. The exclusion list will be different for a different organism.
(variations on capitalization and plural (check this with James) included):
Agitation Hyperactivity Hypersensitivity Infection Amelia Amended Bends Confused Confusion Corn Corns Dependence Fit Hermaphroditism Hyperactivity Hypersensitivity Infection Interferon Interferons Intestinal Intestine Intestines Longevities Longevity Orf Overdose Paralysed Paralysis Prolapse Recruitment Reflux Restlessness Rupture Scar Scarring Starvation Suppression Tag Tear boil ectodermal feminization intestine leanness leannesses longevities longevity overdose recruitment rupture scar starvation temp trauma infertility infertile volvulus ganglion locally fracture disorder deformity DNA fragmentation plaque roundworm cysts anatomical abnormality unconscious tumor morphology vaccinia DNA damage convulsions seizure seizures hyperactive behavior complex I complex II acyl-CoA dehydrogenases disease
Other Exclusions:
In addition to the above terms, exclude
- any entry with string length 4 or less
- any entry that is a C. elegans gene i.e. that matches the regex [a-z]{3}-\d+
- generally words inside brackets will be thrown away:
(disorder) (disease) [disorder] [disease]
Sentences with these types of phrases are excluded:
- Cancer Research Center
- Cancer Center
- Center for Childhood Diabetes
- Center for Diabetes
These phrases are excluded:
- term followed by 'cells' or 'cell line'
[eg., Similarly , ectopic expression of CED-4 in 293T human embryonic kidney cells and MCF7 breast carcinoma cells induced rapid apoptosis ( Fig . 1 ) , even though a proteolytic activity has not been ascribed to CED-4 .}
- term followed by 'toxin' eg., diphtheria toxin
- term followed by 'virus'
These are also excluded:
1. Articles of type: ‘meeting_abstract’ and 'Congresses'
2. Supplementary material: Doc IDs of type:WBPaper00037683.sup.1
3. ‘Materials and Methods’ section of paper
4. When matching sentences are from section ‘References’ classify paper into a separate group, so it becomes easier for curator to go through. --The script has a default mode in which it searches for relevant sentences in all fields other than references. The script has a references mode in which it runs only on sentences from references.
5. Whole genome papers Eg., Doc Id WBPaper00030997 Title: Draft genome of the filarial nematode parasite Brugia malayi. --Implemented by excluding lines that are very long. Currently, I have set this limit to 250 words i.e. If line has more than 250 words, then it is excluded.
6. Should not be picked up: relate in 'correlated' model in 'remodelling' relate in 'unrelated'
Sources for disease lexicon, precision and recall
1. Used the OMIM morbid map, to extract the human gene list: ftp://ftp.ncbi.nih.gov/repository/OMIM/morbidmap A preliminary run was done with the human gene list along with the disease lexicon to extract sentences that have both a human disease and a human gene. This method results in a large number of false positives. For now, this strategy will not be pursued further.
Precision:
Out of 100 papers, there were 29 false positives, putting the precision at 71%.
Recall:
Out of a 100 random papers that were selected from the elegans literature there were 2 papers that the script missed: (Note that these papers belong to the second class of papers of interest where one of the categories ‘elegans gene’ and the keyword ‘ortholog’, ‘homolog’, are missing).
Location of scripts and files on Textpresso
1. extractRelevantSent.pl: This is the major script that works with the keywords and categories and the exclusion list (82 words). Location:/data1/Users/liyuling/Curator_related/disease on Textpresso dev
2. getRelevantDocids.pl: Picks up all the positive IDs and outputs to a text file called positive_docids.txt We have a total of 412 papers, as of Aug, 6th, 2013.
Location:/data1/Users/liyuling/Curator_related/disease/output on Textpresso dev
3. lexicon files at :/data1/Users/liyuling/Curator_related/disease/lexicon
Script runs every 6th of the month, e-mails curator with the location of files:
From: Hans-Michael Muller Subject: Auto-email (textpresso):new disease ortholog results available Date: May 6, 2012 2:26:11 AM PDT To: Ranjana Kishore This is an automatic email sent to you by the textpresso script that runs every month Results for new papers are available at http://textpresso-dev.caltech.edu/disease/output/5-6-2012.html . Old results can be accessed at http://textpresso-dev.caltech.edu/disease/output/ .
All the webpages that map to the output also in this directory, /output
To Do, Aug 2013
- Need to add terms and synonyms from the Disease Ontology (DO), source of file: OBO-foundry: http://www.berkeleybop.org/ontologies/doid.obo
- Use Name and Synonym from this file
For Synonym:
--Take everything within double quotes, match pattern without double quotes
--Ignore everything after the last double quotes
--If words are comma- separated ignore word after comma
--For the word (disorder) remove parentheses
--Ignore '(morphologic abnormality)'
- In the Aug run, I see 2 Worm Breeders Gazette articles. Meeting abstracts and Worm Breeder gazette articles should be removed from corpus that script works on. Gazette articles can be recognized by the tag 'Type Gazette_article'. Ex. WBPaper00014072
- We need to adopt a standard color scheme for terms:
I suggest genes = green
"elegans", C. elegans = red
disease term = blue
keywords ‘ortholog’, ‘homolog’, ‘similar’, ‘relate’ and ‘model’ = purple or brown?
- A new run on the entire elegans corpus will be done and the papers will be categorized by year.
- Compare this list with the current positive ids (412) and generate a list of missing positive IDs.
1. From doing a first run with the addition (DO) to the disease lexicon:
Results had too many false positives because of: (coming from the Synonym field of obo file)
can as was or als total defect deficiency flu mec ad cm
Back to Disease and Drugs