Difference between revisions of "SVM Guidelines"

From WormBaseWiki
Jump to navigationJump to search
(New page: To be trained, SVMs require positive and negative training sets. Ideally and since there is sometimes discrepancy between what has been flagged and what has actually been curated, the bes...)
 
Line 3: Line 3:
 
Ideally and since there is sometimes discrepancy between what has been flagged and what has actually been curated, the best positive training set is the set of papers for which the data type of interest has actually been curated.
 
Ideally and since there is sometimes discrepancy between what has been flagged and what has actually been curated, the best positive training set is the set of papers for which the data type of interest has actually been curated.
  
How many papers are enough?
+
'''How many papers are enough?'''
 +
   
 +
    The answer to this question depends, in part, on the data type.  Previous experience with SVMs for first pass curation has indicated that 400 papers is a good starting point, but this number is a guideline, not a fixed rule.  Reasonable results may be achieved with fewer papers if the features of a particular data type are distinct, more papers may be needed if features are not as distinct.
 +
 
 +
'''What if we don't have enough papers?'''
 +
 
 +
    Not all data types have 400 curated papers.  Some have much less.  In these cases, the SVM approach may not be the best tactic.  Other options include, development of Textpresso categories and author flags.

Revision as of 13:19, 20 November 2009

To be trained, SVMs require positive and negative training sets.

Ideally and since there is sometimes discrepancy between what has been flagged and what has actually been curated, the best positive training set is the set of papers for which the data type of interest has actually been curated.

How many papers are enough?

    The answer to this question depends, in part, on the data type.  Previous experience with SVMs for first pass curation has indicated that 400 papers is a good starting point, but this number is a guideline, not a fixed rule.  Reasonable results may be achieved with fewer papers if the features of a particular data type are distinct, more papers may be needed if features are not as distinct.

What if we don't have enough papers?

    Not all data types have 400 curated papers.  Some have much less.  In these cases, the SVM approach may not be the best tactic.  Other options include, development of Textpresso categories and author flags.