Molecule

From WormBaseWiki
Jump to navigationJump to search

links to relevant pages
Caltech documentation
Example molecule pages
Molecule model build


Molecule Curation

Molecule curation will capture chemical and drug entities that have been shown to affect the biology of the worm, as well as allow users to link to other databases that deal with these molecule entities in greater detail.

  • What we mean by small molecule
    • drug
    • metabolite (primary and secondary)
    • monomers or very small oligomers of nucleic acids, proteins, and polysaccharides
    • "Large collections of small molecules (molecular weight about 600 or less), of similar or diverse nature which are used for high-throughput screening analysis of the gene function, protein interaction, cellular processing, biochemical pathways, or other chemical interactions." (from nlm.nih.gov and wikipedia)

Drug-phenotype curation

Molecules will be linked to genes based on their influence on gene activity altered by variation, overexpression, and RNAi-based knockdown.

Drug-gene interactions

Molecules will also be linked to genes through their influence on gene activity directly through gene regulation interactions.

Molecule databases

Molecule IDs will be provided, when available, for the following databases:

  • Database "NLM_MeSH" "UID"
  • Database "CTD" "ChemicalID"
  • Database "ChemIDplus" using the CasRN
  • Database "ChEBI" "CHEBI_ID"
  • Database "KEGG COMPOUND" "ACCESSION_NUMBER"

Molecule list

Initially, we will be using MeSH UIDs, assigned by the NLM, as IDs for the molecules in our database. Due to the more comprehensive coverage of the NLM molecules, and the fact that it is more stably funded, this source was thought to be a good starting point for this project. The list we are starting with is a pared down list of molecules from the NLM, that was created by the Comparative Toxicogenomic Database (CTD), which contains over 130,000 terms. For each term, this list contains a term name, CTD ID, MeSH UID, and where available CAS Registry Numbers. Using the CasRNs, we extracted the ChEBI ID from the Chemical Entities of Biological Interest database entity list, where it existed, along with any KEGG Compound accession number.

A sample molecule is:

Molecule : "C009687"
Public_name "wortmannin"
Database "NLM_MeSH" "UID" "C009687"
Database "CTD"  "ChemicalID" "C009687"
Database "ChemIDplus"  "19545-26-7"
Database "ChEBI" "CHEBI_ID" "52289"
Database "KEGG COMPOUND" "ACCESSION_NUMBER" "C15181"

To make a working list of reference molecules for the various curation efforts, we used Textpresso to scan for all terms on the list that have been published in the C. elegans corpus. The resulting list is less than 6000 terms. The terms that have been identified in the corpus are available here:
http://textpresso-dev.caltech.edu/michael/molecule-obo-analysis/By-Frequency/ This is a directory of files of terms based on the number of times the term appears in the corpus.
and here:
http://textpresso-dev.caltech.edu/michael/molecule-obo-analysis/By-Frequency/all This is a list of all terms from the previous files concatenated into one.
This last file is being used as a starting file for molecule look-up by WB curators.
Caveats and notes:

  • The list is now small enough that if we wanted to load it into WB at least we know that every term has some relevance to the literature (although unverified).
  • The list is small enough to be amenable to editing through ontology editors like OBOedit (even though it is not an ontology).
  • We do not have definitions of the terms, nor are the terms arranged in any hierarchical manner; however other databases do, and we provide links to those websites if an ID is available.
  • Terms and synonyms of terms, will be added as needed, this curation effort still needs to be worked out, ideally the list will be incorporated as a selection list for whatever curation tool a curator is using.