Pipeline for identifying papers with disease or disease gene ortholog

From WormBaseWiki
Revision as of 05:17, 25 September 2014 by Rkishore (talk | contribs) (→‎Number of Papers flagged and curated for disease)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search


Aim of Project: To Use Textpresso to identify C. elegans papers that describe either orthologs of human disease genes or a model for the disease in C. elegans.

This method uses one or more categories and keywords to identify sentences in the text of the paper.


1. The keywords ‘ortholog’, ‘homolog’, ‘similar’, ‘relate’ and ‘model’ and various forms of the words as they may occur in the following example phrases:

  • is the ortholog
  • is an ortholog
  • is orthologous
  • is homologous
  • is the homolog
  • is similar to

Script looks for (an asterisk at the end denotes wild-card):

  • ortholog*
  • homolog*
  • similar
  • relate*
  • model*

2. The keywords ‘C. elegans’ or ‘elegans’

Example sentence: We used C. elegans as a model system for <human disease>, and other variations.


Category 1: C. elegans gene or protein

Requirement: The gene (or protein list) needs to be kept up to date with the model organism database.

Category 2: Human disease

The following sources were used to build the lexicon for the human disease category:

1. http://www.berkeleybop.org/ontologies/doid.obo

2. Human disease list from Neuroscience - NIFSTD owl file for class NIF-Dysfunction located at http://ontology.neuinfo.org/NIF/Dysfunction/NIF-Dysfunction.owl

Pick ‘term names’ and ‘synonyms’ from this ontology. The more term variations that the ‘master’ list has the better it will be for picking up terms.

3. Textpresso category file ‘disease (h. sapiens)’

4. OMIM morbid map disease terms (the first column from ftp://ftp.ncbi.nih.gov/repository/OMIM/morbidmap)

Rules for Script

Example: The term ‘Spinal muscular Atrophy’ would have been missed because it exists in the Neuroscience list as ‘Scapuloperoneal Spinal Muscular Atrophy’ but could be found in the human disease ontology mentioned above. As of now we need the following variations for the disease terms (note the capitalization of different letters at the beginning of each word in the term):

Spinal Muscular Atrophy
Spinal muscular atrophy
spinal muscular atrophy (probably not used this way but safe to include)

To deal with different forms of capitalization, a case-insensitive search is done. Plurals are formed by using heuristics like myopia -> myopias atrophy -> atrophies Textpresso script cap+plural.pl handles this reasonably well.

Additional filter to capture only disease model papers, May 2014

This filter will be run on the papers already flagged from the disease pipeline,

1. Search for papers that have all three in one sentence

  • the word 'C.elegans' (all variations) or 'worm'
  • the word 'model'
  • any disease from the disease category

the gene category is optional and not a requirement.

2. Search for papers that have the above three in +5 and -5 sentences, meaning identify the paragraph.

Results of a first run: From 979 disease results papers we have up to now, criteria being: “elegans”+”model”+ [some disease],('worm' or 'nematode' not included).

If within one sentence: 287 papers are hit

If within five sentences range(+- 5 sentences): 681 papers are hit

Tasks: Rerun script including the 'worm' or 'nematode'.

Compare these papers with the papers in the Disease OA tables that have already been curated-

the table names are: dis_paperexpmod and dis_paperdisrel, both are multivalue tables, so compare to all values.

We need: Unique set of papers from TP pipeline=

Unique set of papers from Postgres disease data=

Overlap between the two=

Papers missed by TP pipeline (but in the OA)=

Papers missed by curator (not in OA, but in TP pipeline)= (either true or false positives)

From Yuling's email, May 20th, 2014


one_sentence_hit is within one-sentence hit: 308 hits by adding “nematode” and “worm”, from 287

five_sentence_hit is within five-sentence hit: 692 hits by adding “nematode” and “worm”, from 681

pipeline_unique_one means “unique IDs from pipeline by looking at one sentence only”, 258 ids

pipeline_unique_five means “unique IDs from pipeline by looking at five sentence range”, 610 ids

postgres_unique_one means “unique IDs from curated forms compared with one-sentence only” 183 ids

postgres_unique_five means “unique IDs from curated forms compared with five sentence range”, 151 ids

pp_postgres_overlap_one means “overlap IDs from curated forms with one-sentence only” 50 ids

pp_postgres_overlap_five means “overlap IDs from curated forms with five sentence range” 82 ids.

Exclusion list for disease lexicon

Exclusion list file is at: /data1/Users/liyuling/Curator_related/disease/exclusions/disease.txt

These words are placed in an exclusion list. Removing these terms decreases the number of false positives for elegans literature. The exclusion list will be different for a different organism.

(variations on capitalization and plural (check this with James) included):

Agitation         Hyperactivity      Hypersensitivity          
Infection                            Amelia                 Amended     Bends              
Confused          Confusion          Corn                   Corns 
Dependence        Fit
Hermaphroditism   Hyperactivity      Hypersensitivity       Infection   Interferon   
Interferons       Intestinal         Intestine 
Intestines        Longevities        Longevity              Orf         Overdose   Paralysed  Paralysis 
Prolapse          Recruitment
Reflux            Restlessness       Rupture                Scar        Scarring 
Starvation        Suppression        Tag                    Tear        boil
ectodermal        feminization       intestine              leanness    leannesses 
longevities       longevity
overdose          recruitment        rupture                scar        starvation   temp trauma 
infertility       infertile          volvulus               ganglion    locally      fracture  
disorder          deformity          DNA fragmentation      plaque      roundworm    cysts       
anatomical        abnormality        unconscious            tumor       morphology
vaccinia          DNA damage         convulsions            seizure     seizures
hyperactive       behavior           complex I              complex II  acyl-CoA dehydrogenases 

Other Exclusions: In addition to the above terms, exclude

  • any entry with string length 4 or less
  • any entry that is a C. elegans gene i.e. that matches the regex [a-z]{3}-\d+
  • generally words inside brackets will be thrown away:

(disorder) (disease) [disorder] [disease]

Sentences with these types of phrases are excluded:

  • Cancer Research Center
  • Cancer Center
  • Center for Childhood Diabetes
  • Center for Diabetes

These phrases are excluded:

  • term followed by 'cells' or 'cell line'

[eg., Similarly , ectopic expression of CED-4 in 293T human embryonic kidney cells and MCF7 breast carcinoma cells induced rapid apoptosis ( Fig . 1 ) , even though a proteolytic activity has not been ascribed to CED-4 .}

  • term followed by 'toxin' eg., diphtheria toxin
  • term followed by 'virus'

These are also excluded:

1. Articles of type: ‘meeting_abstract’ and 'Congresses'

2. Supplementary material: Doc IDs of type:WBPaper00037683.sup.1

3. ‘Materials and Methods’ section of paper

4. When matching sentences are from section ‘References’ classify paper into a separate group, so it becomes easier for curator to go through. --The script has a default mode in which it searches for relevant sentences in all fields other than references. The script has a references mode in which it runs only on sentences from references.

5. Whole genome papers Eg., Doc Id WBPaper00030997 Title: Draft genome of the filarial nematode parasite Brugia malayi. --Implemented by excluding lines that are very long. Currently, I have set this limit to 250 words i.e. If line has more than 250 words, then it is excluded.

6. Should not be picked up: relate in 'correlated' model in 'remodelling' relate in 'unrelated'

Sources for disease lexicon, precision and recall

1. Used the OMIM morbid map, to extract the human gene list: ftp://ftp.ncbi.nih.gov/repository/OMIM/morbidmap A preliminary run was done with the human gene list along with the disease lexicon to extract sentences that have both a human disease and a human gene. This method results in a large number of false positives. For now, this strategy will not be pursued further.


Out of 100 papers, there were 29 false positives, putting the precision at 71%.


Out of a 100 random papers that were selected from the elegans literature there were 2 papers that the script missed: (Note that these papers belong to the second class of papers of interest where one of the categories ‘elegans gene’ and the keyword ‘ortholog’, ‘homolog’, are missing).

Location of scripts and files on Textpresso


1. extractRelevantSent.pl:

  • This is the major script that works with the keywords and categories and the exclusion list (82 words).

Location:/data1/Users/liyuling/Curator_related/disease on Textpresso dev

2. getRelevantDocids.pl:

  • Picks up all the positive IDs and outputs to a text file called positive_docids.txt

We have a total of 412 papers, as of Aug, 6th, 2013.

Location:/data1/Users/liyuling/Curator_related/disease/output on Textpresso dev

3. lexicon files at :/data1/Users/liyuling/Curator_related/disease/lexicon

Location of files:

As of Sept, 2014, location of files: http://textpresso-dev.caltech.edu/disease/script_runs/20140908/

Adding the DO.obo file to the lexicon

  • Use Name and Synonym from this file

For Synonym:

--Take everything within double quotes, match pattern without double quotes

--Ignore everything after the last double quotes

--If words are comma- separated ignore word after comma

--For the word (disorder) remove parentheses

--Ignore '(morphologic abnormality)'

  • In the Aug run, I see 2 Worm Breeders Gazette articles. Meeting abstracts and Worm Breeder gazette articles should be removed from corpus that script works on. Gazette articles can be recognized by the tag 'Type Gazette_article'. Ex. WBPaper00014072
  • We need to adopt a standard color scheme for terms:

I suggest genes = green

"elegans", and "C. elegans" = red

disease term = blue

keywords ‘ortholog’, ‘homolog’, ‘similar’, ‘relate’ and ‘model’ = brown

  • A new run on the entire elegans corpus will be done and the papers will be categorized by year.
  • Compare this list with the current positive ids (412) and generate a list of missing positive IDs.

1. From doing a first run with the addition (DO) to the disease lexicon:

Results had too many false positives because of: (coming from the Synonym field of obo file)

can  as  was  or  als  total  defect  deficiency  flu  mec  ad  cm

2. Script will be rerun with the above rules, but on 100 papers, starting from recent papers, going backwards.

--Results looked good, color scheme works, we need to add the following to the exclusion list:

AFD  male infertility  cf  DIC  AML  Noma  homocysteine  adhesions  fructose-1  

--'elegans' not being picked up by script, see WBPaper00044069, 44057, 44069, 40364.

3. Script rerun with another 100 papers, 09.12.2013:

--Discard sentences that begin with the phrase 'KEY WORDS' or 'Key Words', eg. WBPaper00043986,

--Include the following exclusion terms

pme  PPR  aip  cold  TIA 

--Add the exclusion terms (10.07.2013 and 10.21.2013):

Group B  cystic  tremor  homocysteine  CRF  WS  wee  LCM


--For disease term, script will not pick up 3 or 2 letter acronyms--capital or small or a mixture. => done. skipped all terms < 4 in length

--WBPaper00040689: only 'syndrome' gets picked up, not Wiscott-Aldrich => "syndrome" is a term which is picked up before "Wiscott-Aldrich"

--WBPaper00040724: Hutchinson Gilford progeria syndrome: only 'progeria' gets picked up. => "hutchinson-gilford progeria" term has a hyphen in lexicon

--WBPaper00041076: Cystic Fibrosis: only 'Fibrosis' gets picked up. => in sentence it is "cystic-fibrosis-associated", not existing in lexicon


--Add the following to exclusion list:

ACNE   dish   ALPS   GIST   ACNE   Gerd    PLAT   mole   DOPS   HGPS

--Exclude disease terms that are 4 letters or less, case insensitive

--If a disease term co-occurs in the same sentence as Center, Centre, Institute, Journal, (case insensitive) it should be ignored

--In the lexicon file, if the disease term has a hyphen, add another copy replacing the hyphen with spaces, eg. Hutchinson-progeria syndrome, also add Hutchinson progeria syndrome

--Script runs on entire corpus

--Results are organized by year

--Print count of papers/year and the total number of papers each time script runs

Changes Aug/Sept 2014

  • Add to exclusion list if these words/phrases occur exactly as below
    • Retinoblastoma
    • cancer/s
    • DNA repair deficiency
    • BRCA1
    • BRCA2
    • obesity
    • inflammatory responses
    • viral infection/s
    • growth retardation
    • male infertility
    • aceyl-CoA-dehydrogenease (with or without hyphens)
    • carcinoma/s
    • anomaly
    • abnormal movement/s
    • infectious disease
    • complex V
    • alkaline phosphatase
    • sex reversal
    • genetic disease
    • X inactivation
    • Muscular Dystrophy Association
  • Group these papers for terms, in a different folder called 'Non-disease conditions'(this is a separate paper list)
    • Nicotine addiction
    • Nicotine dependence
    • Neurotoxicity
    • Obesity
    • Heat stroke
    • alcohol sensitivity
    • alcohol addiction
  • Generate a single list of all disease WBPaperIds for curation status form


  • If the paper has at least one good sentence, it still falls in the 'disease' category
  • If the paper has only bad sentences, it falls in the nondisease category
  • If the title or abstract of a paper has disease term, then it falls in the 'disease catergory', meaning it supercedes all other rules
  • At the top of each directory the output should look like:
    • Number of total papers=
    • Number of papers without Type 'Review' (non-review)= X; also ouput list of WBpaperIds
    • Number of papers with the Type 'Review'; also ouput list of WBPaperIds

Number of Papers flagged and curated for disease

Location of directories: http://textpresso-dev.caltech.edu/disease/script_runs/20140908/Disease/

Location of Juancarlos's script for counting curated disease papers: On Tazendra: /home/acedb/ranjana/human_disease

script name: count_disease.pl We need for 2011, 2012, 2013, 2014:

  1. of TP-flagged non-reviews
  2. of TP-flagged reviews:
  1. of TP-flagged non-reviews curated
  2. of TP-flagged reviews curated

Also, output curated papers not in TP flags (non-reviews)

Back to Disease and Drugs