Difference between revisions of "First-pass flagging pipelines"

From WormBaseWiki
Jump to navigationJump to search
 
(78 intermediate revisions by 4 users not shown)
Line 1: Line 1:
=Flagging a paper vs. alerting a data curator=
+
[[Caltech documentation]]<br>
At this moment, papers are flagged for specific data types through two different tables, the curator first-pass table and the Textpresso first-pass table.  A third table, SVM first-pass, will be implemented shortly. <br>
+
[[First-pass to Curation]]<br>
 +
[[First-pass schedule, instructions, automation]]
  
Data curators are alerted to a paper containing data relevant to their data type by the presence of data in three different tables, the cfp, tfp, and the author first-pass table.  
+
=Flagging mechanisms=
 +
Papers are flagged for specific data types through four different methods:
 +
*curator flagging - a WB curator flags the paper manually, a curator goes through the paper and enters a check mark or comment for a data type in a curator first-pass form; data are stored in the cfp table.
 +
*author/journal flagging - authors flag their own papers, authors are sent a link to either the author first pass form or a journal first pass form (for GSA and G3 journals only) and they enter a check mark or comment for data types, data are stored in the afp table. 
 +
*Textpresso regular expression - curators work with textpresso developers to search the full text of papers to find key expressions, words/phrases from categories or the concurrence of these words/phrases in sentences, etc., to flag a paper as positive for a data type, data are stored in the tfp table; see [http://wiki.wormbase.org/index.php/Textpresso Textpresso flagging pipelines]
 +
*Support Vector Machine algorithms - SVM is set up for a specific data type such that based on a known positive and known negative set of papers, a reasonable balance in recall and precision is determined for automatically flagging the most recent papers indexed by the textpresso, papers are ranked as high, medium, low, or negative as regards the probability of being a true positive for a given data type, data are stored on caprica in data indexed folders, access to those sets of papers <strike>and the flagging results are should be available here: http://tazendra.caltech.edu/~postgres/cgi-bin/svm_results.cgi</strike> that have been SVMed as .concat (for all files as a concatenated file) are in postgres in the cur_svmdata tables.
  
=Three ways a paper can be flagged=
 
''Name of the postgres table is noted in the parentheses.''
 
  
 +
==Alerting the data type curator==
 +
*Alerts from the cfp are sent when a first-pass curator flags a paper using the cfp_form and only if the "send" checkbox is on.
 +
*Alerts from the afp tables are sent when the form has been submitted by an author, if the curator as agreed to be alerted.
 +
*Alerts from tfp are set up by each individual curator with the Textpresso group and Juancarlos.
 +
 +
=Curator flagging=
 
==Curator first-pass (cfp) form==
 
==Curator first-pass (cfp) form==
This form is accessed by clicking "curate!" for a paper on the WBPaperEditor first-pass checkout UI.<br>
+
This form has three purposes
This form contains columns for all the different first-pass tables, currently, the Textpresso first-pass table, the author first-pass table, and the curator first-pass table.  
+
*for curators to manually flag a paper
 +
*to collect and display flags from all other methods of flagging papers
 +
*to evaluate the input from authors and other automated flagging methods
 +
 
 +
This form is accessed by clicking "Curate !" for a paper on the cfp cgi<br>
 +
This form contains columns for the different first-pass tables, currently, the Textpresso first-pass table, the author first-pass table, and the curator first-pass table.
 +
 
 +
Textpresso FP results are currently not being displayed in the cfp_form for those data types that have been considered "good" (sufficiently automated) by curators.  These data fields have been removed from the cfp_form,  so they would have to be added back if we want to see them.  <br>
 +
These data types are:
 +
*antibody
 +
*transgene
 +
*extvariation (new alleles)
 +
n.b. because these data types have been removed from the cfp table, first-pass curator flags are not being counted for them any longer.
  
====Curator first-pass table (cfp)====
+
===Curator first-pass table (cfp)===
Data is entered directly into postgres through this table.  The curator uses the text boxes to enter data based on their own paper reading or to agree with or modify data entered by authors, textpresso, or SVM (not implemented yet).
+
Data is entered directly into postgres through this table.  The curator uses the text boxes to enter data based on their own paper reading or to agree with or modify data entered by authors, textpresso (obsoleted?)  
  
Upon hitting 'flag!' data entered into the cfp, afp, or tfp is sent to the e-mail that corresponds with the data field that contains data.     
+
Upon hitting 'Flag !' data entered into the cfp, afp, and/or tfp is sent to the e-mail that corresponds with the curator associated with that data type.     
  
 
However, for the purposes of the curation status form, only those data types with entries in the cfp table are counted as flagged.  So for the paper to be considered flagged for the curation status form, the first-pass data curator must merge the data from the author data or textpresso data into the cfp box, or else type something else in the box.   
 
However, for the purposes of the curation status form, only those data types with entries in the cfp table are counted as flagged.  So for the paper to be considered flagged for the curation status form, the first-pass data curator must merge the data from the author data or textpresso data into the cfp box, or else type something else in the box.   
  
 +
==Using the curator first-pass form==
 +
===<i>FP curators are expected to approve or reject author-entered data</i>===
 +
*Author-entered data is set by default as 'approved'. 
 +
*To reject the author-entered data, the curator should uncheck the check box.
 +
*<b>Approved and rejected author data is stored in postgres and can be queried by looking at the afp_ table, which has paperID, data, author timestamp, curatorID, approve/reject, curator timestamp</b>
 +
 +
===<i>Approving author-entered data alone DOES NOT flag the paper in a statistical way</i>===
 +
When a fp curator hits "flag!" an e-mail will be sent to the data curator if there is any data in any author-entered data field, so the data curator will be alerted to the presence of the paper. <br>
 +
However, the paper will not be counted as flagged in the cfp for that data type unless the curator enters data into the fp curator column (i.e., into the cfp table); therefore,  <b>approved author data needs to be manually entered into the curator first-pass table.</b><br>
 +
Some actions a fp curator can take with author-entered data:<br>
 +
*If you agree with everything the author says, click "merge" to enter the data into the cfp.
 +
*If you just think "yes", type "yes" in the cfp
 +
*If you partially agree with author-entered data, merge it and edit it in the cfp<br> <i>The point of the merge link was to save clicks in copy-pasting when a curator partially agreed with an author -- Juancarlos</i>
 +
*If you rejected the author-entered data, and you think the paper should not be flagged for that data type,  do not enter anything in the cfp.
 +
 +
==Curation_status cgi==
 +
*http://tazendra.caltech.edu/~postgres/cgi-bin/curation_status.cgi
 +
*Only flags from the cfp_ tables are considered flagged
 +
**someone's should be looking at afp flags and moving the afp_ data to cfp_ if it is correct.
 +
**afp_ tables are only being looked at for newmutant and overexpr to change the color in the curation_status.cgi for those datatypes if they're already in cfp_
 +
**SVM data is not automatically added to the cgi
 +
 +
=Author flagging=
 +
==Author first-pass form==
 +
An email is sent to authors containing a password protected link to an author first pass form that was generated by scripts in /home/postgres/work/pgpopulation/afp_papers.  Email sent from cronjob of /home/postgres/work/pgpopulation/afp_papers/assign_passwd.pl runs at 1pm on Thursdays.
 +
 +
 +
This form is sent to the first e-mail address of a paper extracted by Textpresso. Textpresso scans the paper for an e-mail address based on the '@'. The first e-mail it finds is used as the author contact. The first line with @ stuff is stored in an output here : http://textpresso-dev.caltech.edu/azurebrd/grep_output
 +
 +
The addresses are parsed into email addresses stored on tazendra here : /home/postgres/work/pgpopulation/afp_papers/textpresso_emails (the first few addresses are junk, but this list is the source of the paper and e-mail connection)
 +
 +
The papers with textpresso body are here : http://textpresso-dev.caltech.edu/azurebrd/textpresso_has_body
 +
 +
This extraction doesn't work with all journals :
 +
* BMC journals - there may be a systematic problem with a consistent pattern of having a "* - " right in front of the corresponding author's e-mail. (6/20/09 The code was changed to kludge this particular problem to look for the "* -")
 +
 +
Continuing problems, which result in no author e-mail sent:
 +
* the email isn't correct because of tokenizing (e.g. 00032111 and 00032933 splits the sentence in the middle of the email because of a dot (e.g. BMC journals))
 +
* the PDF is a provisional PDF
 +
 +
Articles whose authors are not requested for first-pass include:
 +
* any article with no e-mail contact information
 +
* old articles
 +
* book chapters
  
====Author first-pass table (afp)====
 
  
This form is sent to the first e-mail address of a paper recognized by script as described here, [[Author first pass requests]]<br>
+
E-mails are sent out on a weekly basis every Thursday.  If no e-mail is recognized, no link is sent; however if an e-mail is available for an author that has later-on verified the paper as theirs the link will be sent at that time. E-mails are sent out in batches of no more than 50 a weekThese data are stored in the afp tables.
The e-mails are sent out on a weekly basis every Thursday, in batches of no more than 50 papers.   
 
  
This form contains all the same data fields as the curator first-pass table.<br>
+
==Author first-pass form revisions==
  
First-pass curators are expected to approve or reject author data by a check box. The check box is checked on by default, meaning the author data is set by default as 'approved'To reject the author-entered data, the curator should uncheck the check box. <b>Approved and rejected author data is stored in Postgres and can be queried by looking at the afp_ table, which has paperID, data, author timestamp, curatorID, approve/reject, curator timestamp</b>
+
    '''The afp form contains all the same data fields as the cfpThe afp form is currently under revision.'''<br>
  
Although an e-mail alert for author data entered will be sent to the data curator, the paper will not be considered flagged for that data type unless the curator enters the author data (or any data that reflects what the curator thinks) into the curator column (i.e. into the cfp table); therefore, <b>approved author data needs to be manually entered into the curator first-pass table, by clicking "merge" for the paper to be added to the 'flagged' number on the curation status form</b>. (If you agree with it, if you just think ``yes'', type ``yes''.  If you partially agree with it, merge it and edit it. The point of the merge link was to save clicks in copy-pasting when a curator partially agreed with an author -- Juancarlos)
+
    '''AFP SOP:'''
 +
    1. Authors will be emailed a form to verify flagged data (via Textpresso Central pipelines), add data flags, and submit annotations.
  
==Textpresso automated (tfp) scripts==
+
    2. Submitted annotations will use existing forms on: http://www.wormbase.org/about/userguide/submit_data#01--10
Data entered here is fed directly to the data curator. <br>
 
Data also shows up in the tfp column on the curator first-pass form. <br>
 
*If the textpresso results are correct, the fpcurator merges the data into the cfp table entry box.
 
*If the textpresso results are incorrect, the fpcurator should write "false positive" in the cfp table entry box.  
 
  
<i>not implemented yet</i>The number of papers flagged for a data type by Textpresso will be noted on the curation status pages in its own column and added to the total number of flags.  
+
    3. Check with Juancarlos about code that excludes Genetics and G3 papers from the afp pipeline, since authors will no longer be asked for flagging via the mark-up pipeline.
  
 +
{|class="wikitable sortable"
 +
|-
 +
! Submission Form
 +
! Currently Used?
 +
! Responsible staff member
 +
! Comment
 +
! Future action for afp form
 +
|-
 +
| Phenotype Data Submission
 +
| Yes
 +
| Chris, Gary S.
 +
|
 +
| Link out
 +
|-
 +
| Allele Sequence Data Submission
 +
| Yes
 +
| Paul Davis
 +
|
 +
| Link out
 +
|-
 +
| Expression Data Micropublication
 +
| Yes
 +
| Daniela
 +
| This is different from data submission from already published paper.
 +
| Keep separate.
 +
|-
 +
| Gene Description
 +
| Yes
 +
| Ranjana
 +
| Continue to use?  Remove from afp form, but keep on WB page?
 +
| Do not link out from afp.
 +
|-
 +
| Updating contact information
 +
| Yes
 +
| Cecilia
 +
|
 +
| Link out (pre-populate with what we have in postgres for confirmation)
 +
|-
 +
| Updating intellectual lineage information
 +
| Yes
 +
| Cecilia
 +
|
 +
| Link out
 +
|-
 +
| Gene/Sequence Links (present three times on the WB page)
 +
| Yes
 +
| Paul Davis, Tim Schedl
 +
| emails wormbase-genenames
 +
| Link out
 +
|-
 +
| Gene Expression and Regulation
 +
| Yes
 +
| Daniela
 +
| Rename just 'Gene Expression'?
 +
| Link out
 +
|-
 +
| Breakpoint Data
 +
| Testing
 +
| Paul Davis, Gary Williams
 +
| emails wormbase-genenames
 +
| Retire?
 +
|-
 +
| Deletion/Duplication Data
 +
| Testing
 +
| Paul Davis, Gary Williams
 +
| emails wormbase-genenames
 +
| Retire?
 +
|-
 +
| Multipoint Cross Data
 +
| Testing
 +
| Paul Davis, Gary Williams
 +
| emails wormbase-genenames
 +
| Retire?
 +
|-
 +
| Two-point Cross Data
 +
| Testing
 +
| Paul Davis, Gary Williams
 +
| emails wormbase-genenames
 +
| Retire?
 +
|-
 +
| New/Altered Rearrangement Data
 +
| Testing
 +
| Paul Davis, Gary Williams
 +
| emails wormbase-genenames
 +
| Retire?
 +
|-
 +
| New Gene Class Name Proposal
 +
| Testing
 +
| Paul Davis, Tim Schedl
 +
| emails wormbase-genenames
 +
| Link out
 +
|-
 +
| Wild Isolate Strain Data
 +
| Yes
 +
| Paul Davis, Gary Williams
 +
| emails wormbase-genenames
 +
| used recently, but largely by just one submitter
 +
|-
 +
|}
  
 +
Authors can resubmitted the form as often as they want as along as they use the link provided in the original e-mail. Their resubmitted data will rewrite data in the afp tables.
  
==SVM (sfp)...not implemented yet==
+
Note that the author email address is always populated at the bottom of the form, but that this field is exempt from the reminder about previously submitted data.
Data entered here is fed directly to the data curator. <br>
 
Data also shows up in the sfp column on the curator first-pass form. <br>
 
*If the SVM results are correct, the fpcurator merges the data into the cfp table entry box.
 
*If the SVM results are incorrect, the fpcurator should write "false positive" in the cfp table entry box.  
 
  
The number of papers flagged for a data type by SVM will be noted on the curation status pages in its own column and added to the total number of flags.  
+
Articles marked as review by the author are entered as "checked" in the afp_nocuratable table. Curators can disagree with this, or any author flagged value, through the curator first pass form by unchecking "yes" and not merging the information into the cfp_table accessed here: http://tazendra.caltech.edu/~postgres/cgi-bin/curator_first_pass.cgi.
  
 +
==Journal first-pass form==
 +
A link to this form is sent to authors by Genetics editors. This form is a much shorter version of the afp, and only contains those data fields corresponding to the data types that are marked up in the GSA mark-up pipeline. The purpose of this form is to alert the GSA markup QC curators that there are new objects in the paper that do not exist in WB yet. The flagging done by this form is not complete as it does not ask the authors to alert curators for all data types that need to be curated.  These data are stored in the afp tables rather than its own table.
  
=Other ways data objects are being collected=
+
Objects are collected for the following data types:  
==Journal first-pass (jfp), through GENETICS only==
 
This table contains data objects entered directly by Genetics authors from a URL generated by Tracey De Pellegin Connelly through the doi ticket form.  The data objects collected in this table are objects that do not exist in WB already for the following data types:  
 
 
*genesymbol
 
*genesymbol
*extravariation
+
*extvariation
 
*newstrains
 
*newstrains
 
*newbalancers
 
*newbalancers
Line 61: Line 218:
 
*newcell
 
*newcell
  
This data is being collected so that Arun can mark-up the Genetics paper and provide links for objects that are not in WB yet, but will be in the future.<br>
+
All of these fields, except genesymbol, do not show on the normal afp_form.  We opted to make a hybrid of the afp_form for the Genetics authors so that they would not be requested to fill out another WB generated form for us after their paper was published and because we needed this extra information from them asap.  It is also my understanding that these authors would be required to fill out the form as part of the publication process, so this was an opportunity to have 100% author feedback for paper flagging.
 +
 
 +
When the journal first pass form is submitted,  all data entered is automatically added to the lexica the markup scripts used for entity recognition and linking.
 +
*the QC curator is alerted by e-mail so that bad entries can be silenced (a '~' is placed before the text to be silenced).  This silencing is necessary when data are entered in a bad format.  The QC curator needs to make sure entities are entered in a way that is consistent with the known entities. 
 +
*the curator responsible for the data type entered is alerted by e-mail.  They are sent an e-mail that lets them know the information is coming from the GSA markup pipeline.  These curators do not have access to the paper right away as the paper is not available in PubMed and has not been indexed by Textpresso yet.
 +
 
 +
=Reporting/removing false positive flags=
 +
If a flag received by any of the forms turns out to be a false flag, that is, the paper does not actually contain the indicated data, you should remove the flag.
 +
*Go to the paper editor http://tazendra.caltech.edu/~postgres/cgi-bin/paper_editor.cgi
 +
*Scroll to the bottom
 +
*Click "Flag False Positives", you will be taken to a new page
 +
*From the drop down menu select the data type that was incorrectly flagged
 +
*Enter the WBPaperID (can enter just the numbers)
 +
*Click "Enter False Positive"
 +
Note, you can also see all false positives for a given data type by clicking "Show False Positives"
 +
=Textpresso automated (tfp) scripts=
 +
Data entered here is fed directly to the data curator. <br>
 +
Data also shows up in the tfp column on the cfp_form. <br>
 +
*If the fp curator agrees with the textpresso results, they should merge the data into the cfp entry box.
 +
*If the fp curator does not agree with the textpresso results they should leave the cfp table entry box blank or correct the info.  If something is written in the cfp column, the paper will be counted as a positive flag for that data type on the curator status form. <br>
 +
 
 +
<i>not implemented yet:  The number of papers flagged for a data type by Textpresso will be noted on the curation status pages in its own column and added to the total number of flags.</i>
  
The form that authors use to enter data into the jfp table combines these data fields with the normal author first-pass form. In effect, the author is entering data into two separate tables using one form.  <br>
+
=SVM (sfp) see [http://www.wormbase.org/wiki/index.php/Caltech_documentation SVMs for First Pass Curation]=
 +
<i>Data entered here is fed directly to the data curator. <br>
 +
Data also shows up in the sfp column on the cfp_form. <br>
 +
* If the fp curator agrees with the results, they should merge the data into the cfp entry box or type yes.
 +
* If the fp curator does not agree with the results they should leave the cfp table entry box blank or correct the info. If something is written in the cfp column, the paper will be counted as a positive flag for that data type on the curator status form.
 +
   
 +
The number of papers flagged for a data type by SVM will be noted on the curation status pages in its own column and added to the total number of flags.
 +
</i>
 +
==update==
 +
SVM results are here: <br>
 +
http://131.215.52.209/celegans/svm_results/<br>
 +
and are called on to create this form: <br>
 +
http://tazendra.caltech.edu/~postgres/cgi-bin/svm_results.cgi
  
The pipeline for alerting data curators from this table still needs to be worked out, right now, it is dealt with manually.
+
--[[User:Kyook|kjy]] 20:43, 8 March 2012 (UTC)
  
==First-pass details==
+
=First-pass details=
 
[[First-pass schedule, instructions, automation]]
 
[[First-pass schedule, instructions, automation]]
 +
 +
 +
[[Category:Curation]]
 +
[[Category:First Pass]]

Latest revision as of 16:42, 11 January 2018

Caltech documentation
First-pass to Curation
First-pass schedule, instructions, automation

Flagging mechanisms

Papers are flagged for specific data types through four different methods:

  • curator flagging - a WB curator flags the paper manually, a curator goes through the paper and enters a check mark or comment for a data type in a curator first-pass form; data are stored in the cfp table.
  • author/journal flagging - authors flag their own papers, authors are sent a link to either the author first pass form or a journal first pass form (for GSA and G3 journals only) and they enter a check mark or comment for data types, data are stored in the afp table.
  • Textpresso regular expression - curators work with textpresso developers to search the full text of papers to find key expressions, words/phrases from categories or the concurrence of these words/phrases in sentences, etc., to flag a paper as positive for a data type, data are stored in the tfp table; see Textpresso flagging pipelines
  • Support Vector Machine algorithms - SVM is set up for a specific data type such that based on a known positive and known negative set of papers, a reasonable balance in recall and precision is determined for automatically flagging the most recent papers indexed by the textpresso, papers are ranked as high, medium, low, or negative as regards the probability of being a true positive for a given data type, data are stored on caprica in data indexed folders, access to those sets of papers and the flagging results are should be available here: http://tazendra.caltech.edu/~postgres/cgi-bin/svm_results.cgi that have been SVMed as .concat (for all files as a concatenated file) are in postgres in the cur_svmdata tables.


Alerting the data type curator

  • Alerts from the cfp are sent when a first-pass curator flags a paper using the cfp_form and only if the "send" checkbox is on.
  • Alerts from the afp tables are sent when the form has been submitted by an author, if the curator as agreed to be alerted.
  • Alerts from tfp are set up by each individual curator with the Textpresso group and Juancarlos.

Curator flagging

Curator first-pass (cfp) form

This form has three purposes

  • for curators to manually flag a paper
  • to collect and display flags from all other methods of flagging papers
  • to evaluate the input from authors and other automated flagging methods

This form is accessed by clicking "Curate !" for a paper on the cfp cgi
This form contains columns for the different first-pass tables, currently, the Textpresso first-pass table, the author first-pass table, and the curator first-pass table.

Textpresso FP results are currently not being displayed in the cfp_form for those data types that have been considered "good" (sufficiently automated) by curators. These data fields have been removed from the cfp_form, so they would have to be added back if we want to see them.
These data types are:

  • antibody
  • transgene
  • extvariation (new alleles)

n.b. because these data types have been removed from the cfp table, first-pass curator flags are not being counted for them any longer.

Curator first-pass table (cfp)

Data is entered directly into postgres through this table. The curator uses the text boxes to enter data based on their own paper reading or to agree with or modify data entered by authors, textpresso (obsoleted?)

Upon hitting 'Flag !' data entered into the cfp, afp, and/or tfp is sent to the e-mail that corresponds with the curator associated with that data type.

However, for the purposes of the curation status form, only those data types with entries in the cfp table are counted as flagged. So for the paper to be considered flagged for the curation status form, the first-pass data curator must merge the data from the author data or textpresso data into the cfp box, or else type something else in the box.

Using the curator first-pass form

FP curators are expected to approve or reject author-entered data

  • Author-entered data is set by default as 'approved'.
  • To reject the author-entered data, the curator should uncheck the check box.
  • Approved and rejected author data is stored in postgres and can be queried by looking at the afp_ table, which has paperID, data, author timestamp, curatorID, approve/reject, curator timestamp

Approving author-entered data alone DOES NOT flag the paper in a statistical way

When a fp curator hits "flag!" an e-mail will be sent to the data curator if there is any data in any author-entered data field, so the data curator will be alerted to the presence of the paper.
However, the paper will not be counted as flagged in the cfp for that data type unless the curator enters data into the fp curator column (i.e., into the cfp table); therefore, approved author data needs to be manually entered into the curator first-pass table.
Some actions a fp curator can take with author-entered data:

  • If you agree with everything the author says, click "merge" to enter the data into the cfp.
  • If you just think "yes", type "yes" in the cfp
  • If you partially agree with author-entered data, merge it and edit it in the cfp
    The point of the merge link was to save clicks in copy-pasting when a curator partially agreed with an author -- Juancarlos
  • If you rejected the author-entered data, and you think the paper should not be flagged for that data type, do not enter anything in the cfp.

Curation_status cgi

  • http://tazendra.caltech.edu/~postgres/cgi-bin/curation_status.cgi
  • Only flags from the cfp_ tables are considered flagged
    • someone's should be looking at afp flags and moving the afp_ data to cfp_ if it is correct.
    • afp_ tables are only being looked at for newmutant and overexpr to change the color in the curation_status.cgi for those datatypes if they're already in cfp_
    • SVM data is not automatically added to the cgi

Author flagging

Author first-pass form

An email is sent to authors containing a password protected link to an author first pass form that was generated by scripts in /home/postgres/work/pgpopulation/afp_papers. Email sent from cronjob of /home/postgres/work/pgpopulation/afp_papers/assign_passwd.pl runs at 1pm on Thursdays.


This form is sent to the first e-mail address of a paper extracted by Textpresso. Textpresso scans the paper for an e-mail address based on the '@'. The first e-mail it finds is used as the author contact. The first line with @ stuff is stored in an output here : http://textpresso-dev.caltech.edu/azurebrd/grep_output

The addresses are parsed into email addresses stored on tazendra here : /home/postgres/work/pgpopulation/afp_papers/textpresso_emails (the first few addresses are junk, but this list is the source of the paper and e-mail connection)

The papers with textpresso body are here : http://textpresso-dev.caltech.edu/azurebrd/textpresso_has_body

This extraction doesn't work with all journals :

  • BMC journals - there may be a systematic problem with a consistent pattern of having a "* - " right in front of the corresponding author's e-mail. (6/20/09 The code was changed to kludge this particular problem to look for the "* -")

Continuing problems, which result in no author e-mail sent:

  • the email isn't correct because of tokenizing (e.g. 00032111 and 00032933 splits the sentence in the middle of the email because of a dot (e.g. BMC journals))
  • the PDF is a provisional PDF

Articles whose authors are not requested for first-pass include:

  • any article with no e-mail contact information
  • old articles
  • book chapters


E-mails are sent out on a weekly basis every Thursday. If no e-mail is recognized, no link is sent; however if an e-mail is available for an author that has later-on verified the paper as theirs the link will be sent at that time. E-mails are sent out in batches of no more than 50 a week. These data are stored in the afp tables.

Author first-pass form revisions

    The afp form contains all the same data fields as the cfp.  The afp form is currently under revision.
    AFP SOP:
    1. Authors will be emailed a form to verify flagged data (via Textpresso Central pipelines), add data flags, and submit annotations.
    2. Submitted annotations will use existing forms on: http://www.wormbase.org/about/userguide/submit_data#01--10
    3. Check with Juancarlos about code that excludes Genetics and G3 papers from the afp pipeline, since authors will no longer be asked for flagging via the mark-up pipeline.
Submission Form Currently Used? Responsible staff member Comment Future action for afp form
Phenotype Data Submission Yes Chris, Gary S. Link out
Allele Sequence Data Submission Yes Paul Davis Link out
Expression Data Micropublication Yes Daniela This is different from data submission from already published paper. Keep separate.
Gene Description Yes Ranjana Continue to use? Remove from afp form, but keep on WB page? Do not link out from afp.
Updating contact information Yes Cecilia Link out (pre-populate with what we have in postgres for confirmation)
Updating intellectual lineage information Yes Cecilia Link out
Gene/Sequence Links (present three times on the WB page) Yes Paul Davis, Tim Schedl emails wormbase-genenames Link out
Gene Expression and Regulation Yes Daniela Rename just 'Gene Expression'? Link out
Breakpoint Data Testing Paul Davis, Gary Williams emails wormbase-genenames Retire?
Deletion/Duplication Data Testing Paul Davis, Gary Williams emails wormbase-genenames Retire?
Multipoint Cross Data Testing Paul Davis, Gary Williams emails wormbase-genenames Retire?
Two-point Cross Data Testing Paul Davis, Gary Williams emails wormbase-genenames Retire?
New/Altered Rearrangement Data Testing Paul Davis, Gary Williams emails wormbase-genenames Retire?
New Gene Class Name Proposal Testing Paul Davis, Tim Schedl emails wormbase-genenames Link out
Wild Isolate Strain Data Yes Paul Davis, Gary Williams emails wormbase-genenames used recently, but largely by just one submitter

Authors can resubmitted the form as often as they want as along as they use the link provided in the original e-mail. Their resubmitted data will rewrite data in the afp tables.

Note that the author email address is always populated at the bottom of the form, but that this field is exempt from the reminder about previously submitted data.

Articles marked as review by the author are entered as "checked" in the afp_nocuratable table. Curators can disagree with this, or any author flagged value, through the curator first pass form by unchecking "yes" and not merging the information into the cfp_table accessed here: http://tazendra.caltech.edu/~postgres/cgi-bin/curator_first_pass.cgi.

Journal first-pass form

A link to this form is sent to authors by Genetics editors. This form is a much shorter version of the afp, and only contains those data fields corresponding to the data types that are marked up in the GSA mark-up pipeline. The purpose of this form is to alert the GSA markup QC curators that there are new objects in the paper that do not exist in WB yet. The flagging done by this form is not complete as it does not ask the authors to alert curators for all data types that need to be curated. These data are stored in the afp tables rather than its own table.

Objects are collected for the following data types:

  • genesymbol
  • extvariation
  • newstrains
  • newbalancers
  • antibody
  • transgene
  • newsnp
  • newcell

All of these fields, except genesymbol, do not show on the normal afp_form. We opted to make a hybrid of the afp_form for the Genetics authors so that they would not be requested to fill out another WB generated form for us after their paper was published and because we needed this extra information from them asap. It is also my understanding that these authors would be required to fill out the form as part of the publication process, so this was an opportunity to have 100% author feedback for paper flagging.

When the journal first pass form is submitted, all data entered is automatically added to the lexica the markup scripts used for entity recognition and linking.

  • the QC curator is alerted by e-mail so that bad entries can be silenced (a '~' is placed before the text to be silenced). This silencing is necessary when data are entered in a bad format. The QC curator needs to make sure entities are entered in a way that is consistent with the known entities.
  • the curator responsible for the data type entered is alerted by e-mail. They are sent an e-mail that lets them know the information is coming from the GSA markup pipeline. These curators do not have access to the paper right away as the paper is not available in PubMed and has not been indexed by Textpresso yet.

Reporting/removing false positive flags

If a flag received by any of the forms turns out to be a false flag, that is, the paper does not actually contain the indicated data, you should remove the flag.

  • Go to the paper editor http://tazendra.caltech.edu/~postgres/cgi-bin/paper_editor.cgi
  • Scroll to the bottom
  • Click "Flag False Positives", you will be taken to a new page
  • From the drop down menu select the data type that was incorrectly flagged
  • Enter the WBPaperID (can enter just the numbers)
  • Click "Enter False Positive"

Note, you can also see all false positives for a given data type by clicking "Show False Positives"

Textpresso automated (tfp) scripts

Data entered here is fed directly to the data curator.
Data also shows up in the tfp column on the cfp_form.

  • If the fp curator agrees with the textpresso results, they should merge the data into the cfp entry box.
  • If the fp curator does not agree with the textpresso results they should leave the cfp table entry box blank or correct the info. If something is written in the cfp column, the paper will be counted as a positive flag for that data type on the curator status form.

not implemented yet: The number of papers flagged for a data type by Textpresso will be noted on the curation status pages in its own column and added to the total number of flags.

SVM (sfp) see SVMs for First Pass Curation

Data entered here is fed directly to the data curator.
Data also shows up in the sfp column on the cfp_form.

  • If the fp curator agrees with the results, they should merge the data into the cfp entry box or type yes.
  • If the fp curator does not agree with the results they should leave the cfp table entry box blank or correct the info. If something is written in the cfp column, the paper will be counted as a positive flag for that data type on the curator status form.

The number of papers flagged for a data type by SVM will be noted on the curation status pages in its own column and added to the total number of flags.

update

SVM results are here:
http://131.215.52.209/celegans/svm_results/
and are called on to create this form:
http://tazendra.caltech.edu/~postgres/cgi-bin/svm_results.cgi

--kjy 20:43, 8 March 2012 (UTC)

First-pass details

First-pass schedule, instructions, automation