Difference between revisions of "November 25, 2009 - Sequence Curation Flags"

From WormBaseWiki
Jump to navigationJump to search
 
(40 intermediate revisions by one other user not shown)
Line 3: Line 3:
 
==Call information==
 
==Call information==
  
4:30pm GMT | 11:30am EST | 10:30am CST | 8:30am PST
+
In the future (thank you, Raymond):
  
US: 1-877-384-2311, +1-480-629-1629
+
starting the next conference call in December, please follow the new instructions:
  
UK: 0800-358-3475, +44-207-154-0025
+
8:30 AM Pacific Time
  
Canada: 1-866-243-1291
+
* EFFECTIVE DEC 2009
  
participant access code: 822114
+
*tel. USA/Canada 866-528-2256
  
Location: wherever you are
+
*tel. UK 0808-234-3475
 +
 
 +
*tel. (caller paid) 216-706-7052
 +
 
 +
*participant code 714646
 +
 
 +
Location: wherever you choose
 +
 
 +
Contact Person: Raymond Lee
 +
 
 +
Contact Email: raymond@caltech.edu, 626-310-1144
  
 
==Participants==
 
==Participants==
 +
Caltech - Karen, Kimberly, Paul, Ruihua
 +
 +
Sanger - Paul, Gary
 +
 +
WashU - John, Phil, Tamberlyn
  
 
==Review sequence-related first pass flags==
 
==Review sequence-related first pass flags==
Line 23: Line 38:
  
 
==Pipelines and options for flagging==
 
==Pipelines and options for flagging==
 +
Data may come into WormBase via various pipelines (e.g. Genbank, Knockout Consortia, user submissions) but for flagging data in published papers, here are the current pipelines:
 +
 
[http://www.wormbase.org/wiki/index.php/Caltech_documentation#SVMs_for_First_Pass_Curation SVMs]
 
[http://www.wormbase.org/wiki/index.php/Caltech_documentation#SVMs_for_First_Pass_Curation SVMs]
 
*Curated papers are the best training set.  Flagged papers can be used, if flagging was generally consistent.
 
*Curated papers are the best training set.  Flagged papers can be used, if flagging was generally consistent.
 
*[http://tazendra.caltech.edu/~postgres/cgi-bin/curation_status.cgi Curation status form] has lists, but not completely up-to-date.
 
*[http://tazendra.caltech.edu/~postgres/cgi-bin/curation_status.cgi Curation status form] has lists, but not completely up-to-date.
 +
*How to evaluate the [http://caprica.caltech.edu/celegans/svm_results/ results] of the SVMs
 +
**Precision - Of the returned positives, how many are true positives?
 +
**Recall - Of the true positives, how many were returned? (need to look at negatives for particular data type)
 +
**What precision and recall values are acceptable?
 
*Is there any way for curators to see a list of features for each SVM?  May help with understanding false positives.  
 
*Is there any way for curators to see a list of features for each SVM?  May help with understanding false positives.  
*How to evaluate the [http://caprica.caltech.edu/celegans/svm_results/ results] of the SVMs
+
 
  
 
[http://www.wormbase.org/wiki/index.php/Caltech_documentation#Textpresso Textpresso]
 
[http://www.wormbase.org/wiki/index.php/Caltech_documentation#Textpresso Textpresso]
Line 33: Line 54:
 
*category searches
 
*category searches
  
Author flags
+
 
 +
[http://www.wormbase.org/wiki/index.php/Texpresso/Author/Curator_interim_form Author flags]
 
*Curators need to tell Juancarlos they'd like to receive emails when authors flag a data type.
 
*Curators need to tell Juancarlos they'd like to receive emails when authors flag a data type.
 
*Caltech needs to supply list of papers flagged since September 2009.
 
*Caltech needs to supply list of papers flagged since September 2009.
Line 52: Line 74:
 
|-
 
|-
 
! Flag name
 
! Flag name
 +
! Flag information
 
! Number of papers flagged manually (from curation status form)
 
! Number of papers flagged manually (from curation status form)
 
! Flag email (from first pass form)
 
! Flag email (from first pass form)
 
! Getting author flags?
 
! Getting author flags?
! Current approach
+
! Approach
 
! Curator(s)
 
! Curator(s)
 
! Comments
 
! Comments
Line 61: Line 84:
 
|-
 
|-
 
! gene symbol
 
! gene symbol
 +
! newly cloned gene or new name for previously known gene
 
! 342
 
! 342
 
! genenames, vanauken
 
! genenames, vanauken
Line 67: Line 91:
 
! Kimberly, Mary Ann?
 
! Kimberly, Mary Ann?
 
! Currently being combined with seqchange.  Could possibly employ secondary screen with categories.
 
! Currently being combined with seqchange.  Could possibly employ secondary screen with categories.
!
+
! Still being assessed.
 
|-
 
|-
 
! mapping data
 
! mapping data
 +
! genetic mapping data
 
! 194
 
! 194
 
! genenames
 
! genenames
 
!
 
!
!  
+
! possibly keywords or Textpresso categories
 
!  
 
!  
 
!
 
!
Line 79: Line 104:
 
|-
 
|-
 
! sequence features
 
! sequence features
 +
! regulatory sequence features, includes promoters, enhancer, elements in mRNA
 
! 248
 
! 248
 
! worm-bug, stlouis, xiaodong (xdwang)
 
! worm-bug, stlouis, xiaodong (xdwang)
 
!
 
!
 +
! possibly keywords or Textpresso categories
 
!
 
!
 
!
 
!
!
+
! SVM tried, but recall was low
!
 
 
!
 
!
 
|-
 
|-
 
! mass spectrometry
 
! mass spectrometry
 +
! mass spec analysis
 
! 65
 
! 65
 
! gw3, worm-bug
 
! gw3, worm-bug
 
!
 
!
! Textpresso categories
+
! keywords or Textpresso categories
! Ruihua, Gary?
+
! Ruihua, Gary
!
 
 
!
 
!
 +
! will likely be a small number (10?) of papers each year
 
!
 
!
 
|-
 
|-
 
! structure correction
 
! structure correction
 +
! gene structure corrections (see comments)
 
! 333
 
! 333
 
! worm-ticket, worm-bug
 
! worm-ticket, worm-bug
 
!
 
!
! tried SVM
+
! re-try SVM with new training set provided by Paul Davis
 
! Gary, Paul Davis
 
! Gary, Paul Davis
 
! Ideally divided into four categories: a change in a gene's structure, the addition of an isoform, a change to one of the SL1/SL2 or polyA site features, a sequence correction in the N2 reference genome
 
! Ideally divided into four categories: a change in a gene's structure, the addition of an isoform, a change to one of the SL1/SL2 or polyA site features, a sequence correction in the N2 reference genome
Line 108: Line 136:
 
|-
 
|-
 
! sequence change
 
! sequence change
 +
! sequence of mutant alleles
 
! 981
 
! 981
 
! genenames
 
! genenames
 
!
 
!
 
! SVM
 
! SVM
!
+
! Kimberly has assessed some, will put on wiki, anyone from Sanger interested in checking?
!
+
!  
!
+
! Being assessed.
 
|-
 
|-
 
! new SNPs
 
! new SNPs
 +
! new polymorphisms
 
! 50
 
! 50
 
! tbieri
 
! tbieri
Line 122: Line 152:
 
!
 
!
 
!
 
!
!
+
! Most new SNP information comes via user submission emails.  Can remove this data type from the pipeline.
!
 
|-
 
! new mutant - alleles
 
! 1372
 
! Erich, Gary, Jolene
 
!
 
!
 
!
 
!
 
 
!
 
!
 
|-
 
|-
Line 137: Line 158:
 
|}
 
|}
  
==Minutes, Action Items==
+
==Action Items==
 +
 
 +
'''Author-generated emails'''
 +
*Sanger and St. Louis should now be getting emails for any author flags for structcorr. 
 +
Below are the other addresses that got emails from [http://tazendra.caltech.edu/~postgres/cgi-bin/curator_first_pass.cgi manual first pass curation].
 +
If you'd like to get emails when authors flag this data type, please confirm by putting Yes or No next to the email name.
 +
 
 +
*genesymbol - genenames
 +
*mapping data - genenames
 +
*sequence features - worm-bug, stlouis
 +
*mass spec - gw3, worm-bug
 +
*seqchange - genenames
 +
 
 +
'''Author-flagged papers since September'''
 +
 
 +
*Get list of papers flagged for sequence-related data by authors since September from Juancarlos.
 +
From Juancarlos:
 +
*genesymbol [[(10 papers)]]
 +
*mapping data [[(2 papers)]]
 +
*sequence features [[(2 papers)]]
 +
*mass spec [[(1 paper)]]
 +
*structcorr [[(4 papers)]]
 +
*seqchange [[(11 papers)]]
 +
*new SNP  - no author flags
 +
 
 +
 
 +
 
 +
Back to [[Caltech documentation]]
 +
 
 +
[[Category:Curation]]

Latest revision as of 18:57, 10 August 2010

Back to Caltech documentation

Call information

In the future (thank you, Raymond):

starting the next conference call in December, please follow the new instructions:

8:30 AM Pacific Time

  • EFFECTIVE DEC 2009
  • tel. USA/Canada 866-528-2256
  • tel. UK 0808-234-3475
  • tel. (caller paid) 216-706-7052
  • participant code 714646

Location: wherever you choose

Contact Person: Raymond Lee

Contact Email: raymond@caltech.edu, 626-310-1144

Participants

Caltech - Karen, Kimberly, Paul, Ruihua

Sanger - Paul, Gary

WashU - John, Phil, Tamberlyn

Review sequence-related first pass flags

http://tazendra.caltech.edu/~postgres/cgi-bin/curator_first_pass.cgi

What type of data is going into each flag?

Pipelines and options for flagging

Data may come into WormBase via various pipelines (e.g. Genbank, Knockout Consortia, user submissions) but for flagging data in published papers, here are the current pipelines:

SVMs

  • Curated papers are the best training set. Flagged papers can be used, if flagging was generally consistent.
  • Curation status form has lists, but not completely up-to-date.
  • How to evaluate the results of the SVMs
    • Precision - Of the returned positives, how many are true positives?
    • Recall - Of the true positives, how many were returned? (need to look at negatives for particular data type)
    • What precision and recall values are acceptable?
  • Is there any way for curators to see a list of features for each SVM? May help with understanding false positives.


Textpresso

  • pattern matching
  • category searches


Author flags

  • Curators need to tell Juancarlos they'd like to receive emails when authors flag a data type.
  • Caltech needs to supply list of papers flagged since September 2009.
  • Stats on return rates as of November 12, 2009 (supplied by Juancarlos):

Since Sept 1st, we have sent out 195 requests, and gotten back 72 results (36.9%).

Since Oct 1st, we have sent out 147 requests, and gotten back 52 results (35.3%).

Since Nov 1st, we have sent out 18 requests, and gotten back 7 results (38.9%).

Current status of each sequence-related flag

}
Flag name Flag information Number of papers flagged manually (from curation status form) Flag email (from first pass form) Getting author flags? Approach Curator(s) Comments Current pipeline sufficient?
gene symbol newly cloned gene or new name for previously known gene 342 genenames, vanauken no-vanauken SVM (see comments) Kimberly, Mary Ann? Currently being combined with seqchange. Could possibly employ secondary screen with categories. Still being assessed.
mapping data genetic mapping data 194 genenames possibly keywords or Textpresso categories
sequence features regulatory sequence features, includes promoters, enhancer, elements in mRNA 248 worm-bug, stlouis, xiaodong (xdwang) possibly keywords or Textpresso categories SVM tried, but recall was low
mass spectrometry mass spec analysis 65 gw3, worm-bug keywords or Textpresso categories Ruihua, Gary will likely be a small number (10?) of papers each year
structure correction gene structure corrections (see comments) 333 worm-ticket, worm-bug re-try SVM with new training set provided by Paul Davis Gary, Paul Davis Ideally divided into four categories: a change in a gene's structure, the addition of an isoform, a change to one of the SL1/SL2 or polyA site features, a sequence correction in the N2 reference genome
sequence change sequence of mutant alleles 981 genenames SVM Kimberly has assessed some, will put on wiki, anyone from Sanger interested in checking? Being assessed.
new SNPs new polymorphisms 50 tbieri Most new SNP information comes via user submission emails. Can remove this data type from the pipeline.

Action Items

Author-generated emails

  • Sanger and St. Louis should now be getting emails for any author flags for structcorr.

Below are the other addresses that got emails from manual first pass curation. If you'd like to get emails when authors flag this data type, please confirm by putting Yes or No next to the email name.

  • genesymbol - genenames
  • mapping data - genenames
  • sequence features - worm-bug, stlouis
  • mass spec - gw3, worm-bug
  • seqchange - genenames

Author-flagged papers since September

  • Get list of papers flagged for sequence-related data by authors since September from Juancarlos.

From Juancarlos:


Back to Caltech documentation