Difference between revisions of "SVM Guidelines"

From WormBaseWiki
Jump to navigationJump to search
 
(7 intermediate revisions by one other user not shown)
Line 1: Line 1:
To be trained, SVMs require positive and negative training sets.
+
'''What are the training sets?'''
  
Ideally and since there is sometimes discrepancy between what has been flagged and what actually gets curated, the best positive training set is the set of papers for which the data type of interest either has been or will be curated.
+
    To be trained, SVMs require positive and negative training sets.  In the case of first pass curation, this means
 +
    positive and negative papers, i.e. papers that do and do not have the data type of interest.
  
 +
'''What papers should we use for training sets?'''
 +
 +
    Ideally, and since there is sometimes discrepancy between what has been flagged and what actually gets curated, the
 +
    best positive training set is the set of papers for which the data type of interest either has been or will be curated.
  
 
'''How many papers are enough?'''
 
'''How many papers are enough?'''
Line 14: Line 19:
  
 
     Not all data types have 400 curated papers.  Some have much less.  In these cases, the SVM approach may not be the  
 
     Not all data types have 400 curated papers.  Some have much less.  In these cases, the SVM approach may not be the  
     best tactic.  Other options include development of Textpresso categories and author flags.
+
     best tactic.  Other options include development of [http://www.wormbase.org/wiki/index.php?title=How_to_make_a_new_Textpresso_category Textpresso categories] and author curation.
  
'''What is needed to perform the SVMs?'''
+
'''Do we need the full text?'''
  
 
     The full text of research articles is needed to perform SVMs for first pass curation.  If the full text is not available
 
     The full text of research articles is needed to perform SVMs for first pass curation.  If the full text is not available
Line 24: Line 29:
  
 
[[Caltech documentation]]
 
[[Caltech documentation]]
 +
 +
[[Category:Curation]]

Latest revision as of 22:31, 16 August 2010

What are the training sets?

    To be trained, SVMs require positive and negative training sets.  In the case of first pass curation, this means 
    positive and negative papers, i.e. papers that do and do not have the data type of interest.

What papers should we use for training sets?

    Ideally, and since there is sometimes discrepancy between what has been flagged and what actually gets curated, the 
    best positive training set is the set of papers for which the data type of interest either has been or will be curated.

How many papers are enough?

    The answer to this question depends, in part, on the data type.  Previous experience with SVMs for first pass curation 
    has indicated that 400 papers is a good starting point, but this number is a guideline, not a fixed rule.  Reasonable 
    results may be achieved with fewer papers if the features of a particular data type are distinct, more papers may be 
    needed if features are not as distinct.

What if we don't have enough papers?

    Not all data types have 400 curated papers.  Some have much less.  In these cases, the SVM approach may not be the 
    best tactic.  Other options include development of Textpresso categories and author curation.

Do we need the full text?

    The full text of research articles is needed to perform SVMs for first pass curation.  If the full text is not available
    from an existing Textpresso implementation, Ruihua will need to have the PMID of relevant papers so she can retrieve the
    full text.  


Caltech documentation