WormBase SVMs

From WormBaseWiki
Jump to: navigation, search

Back to Caltech documentation

Storing SVM Results in postgres on tazendra

What to store

  • SVM prediction
    • Positive - high confidence
    • Positive - medium confidence
    • Positive - low confidence
    • Negative
  • Paper ID
  • Data type
    • antibody
    • geneint
    • geneprod_GO
    • genereg
    • newmutant
    • otherexpr
    • overexpr
    • rnai
    • seqchange
    • structcorr
  • SVM version
  • Date SVM was performed? I don't know if this is critical or not.
  • Curator assessment
    • Positive
    • Negative
  • Curator ID
  • Curator comment
    • Curator comment will be tied to curator assessment
    • Curator comments will use controlled vocabulary and drop-down list
  • Timestamp
  • Do we also want to store if a paper was used in an SVM training set?

How to populate

  • SVM results could come directly from output (talk with Yuling)
  • Curator assessment will depend on the data type and where the curated data is stored
    • Data curated into postgres
      • If data type is curated in postgres, then curation of an SVM positive would indicate curator positive (i.e., true positive)
      • If data type is curated in postgres, then curation of an SVM negative would indicate curator positive (i.e., false negative)
      • If the data is curated in postgres, though, distinguishing false positives and true negatives will require curators to specifically mark papers as such. There will be no way to tell the difference between a paper that just hasn't been curated yet (in the to-do pile) from a false positive if a curator doesn't specifically mark that paper as a false positive. Likewise, predicted negatives can only be marked as true negatives if a curator has looked at the paper and confirmed the negative prediction. In practice, most of the predicted negatives will likely stay as predictions without curator assessment.
  • Data not curated into postgres (e.g., geneace, BioGRID)
    • In theory, this data could be populated from WS or the file we'll get from BioGRID.
    • In practice, I think what we do for these data types will depend on what the individual curators for these data types want to do.
    • One option would be to populate from a web form, i.e. curators would have check boxes for individual paper IDs or large boxes to upload lists of papers according to their classification. For this option, we would like to display all four classifications: True Positive, False Positive, True Negative, False Negative. The reason for doing this is really just pragmatic and has to do with how curators are thinking about the papers while they're looking at the SVM results.

How to visualize

  • We will want to visualize data on a web display
    • Curators may want the option to search for a data type, a paper ID, the time period over which the papers were classified (i.e., the SVM dates that we're now all used to seeing, e.g., 051812_042012_antibody)
    • Curators may want the ability to filter results, if possible
    • Curators may want the ability to sort particular "columns", if possible
    • If a paper has been tested by several different SVM models, then I think we would want the option to see the results from the different models?

How to query

  • This refers to how to easily get Yuling the data he needs.
    • sql queries?
    • Download feature of the web form?


Training Set



seqchange and genesymbol

Training Set



SVM status form

A form intended to analyze SVM data has been developed http://mangolassi.caltech.edu/~postgres/cgi-bin/svm_results.cgi

Before moving on with the SVM form curators have to reach a consensus on how the analysis should be done. Should we keep track on SVM analysis on single documents or combined documents? In other words, do we keep the supplementary material separated or do we merge everything in a combined PDF file that will then be scanned by SVM?

Please add/edit below pros and cons on having the analysis performed on single documents(i.e. .main; .sup1; .sup2..)

  • Pros
    • specificity -> curators don't have to scan through all the supplementary materials but could open just the relevant document
  • Cons of having it to single documents:
    • We cannot perform statistics on what has been done so far
    • Curators have to record in the OA in which document was the info. e.g.: dropdown .sup1. Is it really time consuming?

In October 2012 we decided to move on and perform SVM on combined documents other than single documents (e.g. .mai, sup.1, sup.2 will now become .concat). In this way the SVM form will not be used as such

From Juancarlos:

svm_results.cgi now using cur_svmdocs table instead of cur_svmdata . To keep the svm docs separate from svm for papers. This form's done until someone wants to do something, this form is only on the sandbox, never went live.

  • To make it live, it needs to be documented along with the script
  • /home/postgres/work/pgpopulation/cur_curation/cur_svmdocs/populate_svm_result.pl
  • And to have these scripts / files :
  • /home/postgres/work/pgpopulation/cur_curation/cur_svmdocs/
  • main_only
  • create_svm_tables.pl
  • populate_svm_result.pl

Back to Caltech documentation