Difference between revisions of "GSA Markup SOP"

From WormBaseWiki
Jump to navigationJump to search
 
(14 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
[http://wiki.wormbase.org/index.php/Caltech_documentation Caltech_documentation]
 
[http://wiki.wormbase.org/index.php/Caltech_documentation Caltech_documentation]
  
[[GSA linking pipeline]]
+
[[GSA linking pipeline | Visual overviews of linking pipeline]]<br>
 +
[http://textpresso-dev.caltech.edu/gsa/worm/html/ Marked-up papers]
  
===GENETICS Editors duties===
+
=GENETICS Editors duties=
*Retrieve WBPaperID- GSA Editors enter the approved paper doi into the WBPaperID ticketer [http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/journal/journal_paper_ticket.cgi here] to retrieve a WBPaperID and link to a Journal First Pass form (JFP) for the paper. Note: this step can be inadvertently missed by the editors. The ticketer captures the last six digits of the doi of the article, which is used as an identifier throughout the linking and QC process. Mappings between doi and WBPaperIDs can be found [http://tazendra.caltech.edu/~postgres/cgi-bin/journal/journal_all.cgi here].
+
Retrieving/Assigning a WBPaperID
*GSA Editors send a notice of approval to the author along with the link to the JFP.
+
*GSA Editors enter the approved paper doi into the WBPaperID [http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/journal/journal_paper_ticket.cgi ticketer form].  
'''Ask Genetics editors if they can submit the paper using the WBPaperID or send more info on the paper'''
+
**The ticketer captures the last six digits of the doi of the article, which is used as an identifier throughout the linking and QC process.  
+
**The ticketer assigns an official WBPaperID, which is needed to track the paper through WB curation paths
===WBPaperID ticketer===
+
* Retrieve the link to a Journal First Pass form (JFP) from the ticketer. ''Note: this step is sometimes missed by the GSA editors causing confusion when the paper comes through the normal WB paper pipeline.
*The ticketer generates a link to a Journal First Pass form (JFP); the link for each paper is accessible  [http://tazendra.caltech.edu/~postgres/cgi-bin/journal/journal_all.cgi here]. The GOid mappings were put in place for GO linking and is not currently used.
+
* Editors send a notice of approval to the author along with the link to the JFP.
*The WBPaperID ticketer enters the paper into postgres as a doi object only. All information about the paper (title, authors, abstract, PMID, etc.) is entered when the paper comes through the normal paper pipeline.  
+
**GSA editors control the message to the authors - to make changes to the message, contact GSA editors. The message reads as follows:
*The ticketer sends an alert to the QCFast curator and WBPaper curator (Kimberly) that a WBPaperID was assigned.  
+
<pre>
*The WBPaper curator (Kimberly) uses the doi to manually identify the paper during the normal pubmed paper pipeline.
+
Dear Dr. XXXX:
*The paper should show up in Pubmed within a couple days/weeks, depending on when the doi was entered and where the paper is in the processing stage.
 
  
*In the normal paper pipeline, a cron job fetches PMIDs from PubMed, note: the cron job doesn't look for doi's; it only looks for PMIDs. When a new Genetics or G3 paper shows up in the paper editor, Kimberly manually looks for the corresponding doi to make sure the paper has been entered already and the PMID is then manually added. Once the PMID is added, Kimberly approves the paper through the regular paper pipeline.
+
DOI: 10.1534/genetics.11X.XXXXXX
 +
WBPaper: X
  
====Journal author first pass form (JFP)====
+
The Genetics Society of America is working with textpresso (www.textpresso.org) and WormBase (www.wormbase.org) to create links between genetic and genomic objects in your article to the appropriate page in WormBase. These links will be included in both the online full text and PDF versions of the published article.
 +
 
 +
If you want any genes, alleles, transgenes, CGC-destined strains, anatomy terms, etc., previously known or new/described for the first time in your article to be linked to WormBase please enter the names of these objects using the form at the following link:
 +
 
 +
http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/journal/journal_first_pass.cgi?action=Curate&paper=X&passwd=X.X
 +
 
 +
 
 +
Because your submitted data will be processed automatically, please follow the examples carefully. If you have questions, or want to upload a file instead, contact Karen Yook, Wormbase Curator, at karen@wormbase.org.
 +
 
 +
 
 +
Thank you.
 +
 
 +
XXXX XXXX
 +
Editorial Assistant
 +
</pre>
 +
 
 +
 
 +
Mappings between doi and WBPaperIDs can be found [http://tazendra.caltech.edu/~postgres/cgi-bin/author_fp_display.cgi?afp_jfp=jfp here].
 +
 
 +
=WBPaperID ticketer=
 +
*GSA editors need to enter the paper in to the [http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/journal/journal_paper_ticket.cgi Journal_paper_ticketer] to retrieve a link to the journal first pass form.
 +
*GSA editors enter the paper doi into the [http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/journal/journal_paper_ticket.cgi Journal_paper_ticketer]; full doi expressions need to be used, e.g., 10.1534/genetics.113.157685; the link for each paper is accessible on the [http://tazendra.caltech.edu/~postgres/cgi-bin/author_fp_display.cgi?afp_jfp=jfp Journal Data Display], see below.
 +
*The ticketer enters the paper into postgres as a doi object only. All information about the paper (title, authors, abstract, PMID, etc.) is entered when the paper comes through the normal paper pipeline.
 +
*The ticketer sends an alert to the QCFast curator and WB Paper curator (Kimberly) that a WBPaperID was assigned.
 +
*The paper should show up in Pubmed within a couple days/weeks, depending on when the doi was entered and where the paper is in the processing stage.
 +
*The WBPaper curator (Kimberly) uses the doi to manually identify the paper during the normal paper pipeline. In the normal paper pipeline, a cron job fetches PMIDs from PubMed, note: the cron job doesn't look for doi's; it only looks for PMIDs.  When a new Genetics or G3 paper shows up in the paper editor, Kimberly manually looks for the corresponding doi to make sure the paper has been entered already and the PMID is then manually added.  Once the PMID is added, Kimberly approves the paper through the regular paper pipeline.
 +
*
 +
 
 +
==Journal author first pass form (JFP)==
 
*An alert is sent to the QCFast curator when an author submits data through the JFP. The QCFast curator must check the entered data to make sure it has been entered in a useful format, see following. Alerts (flags) are also sent to the appropriate WB curator to alert them to new objects.
 
*An alert is sent to the QCFast curator when an author submits data through the JFP. The QCFast curator must check the entered data to make sure it has been entered in a useful format, see following. Alerts (flags) are also sent to the appropriate WB curator to alert them to new objects.
  
Line 42: Line 71:
  
 
*Format for data entered through the form:
 
*Format for data entered through the form:
**Data need to be comma-separated and need to be in the standard format for each data type and all extraneous information hidden (see below).
+
**Data need to be comma-separated and in the standard format for each data type, all extraneous information hidden (see below).
 
**Gene
 
**Gene
 
***public_name
 
***public_name
Line 55: Line 84:
 
**Alleles  
 
**Alleles  
 
***currently only elegans alleles;  priority on curating alleles of other species is very low
 
***currently only elegans alleles;  priority on curating alleles of other species is very low
**Strains
+
**Strains (only those accepted by the CGC for deposit into the permanent collection)
 
***currently only marking up known C. elegans strains, mainly available through the CGC; priority on curating other species is very low
 
***currently only marking up known C. elegans strains, mainly available through the CGC; priority on curating other species is very low
  
 +
*Changes to the form need to go through Juancarlos
  
===Entity linking scripts===
+
==Journal Data Display==
[[GSA linking pipeline]]<br>
+
This site displays information about ticketed papers<br>
[[Linking script pipeline]]
+
http://tazendra.caltech.edu/~postgres/cgi-bin/author_fp_display.cgi?afp_jfp=jfp<br>
 +
columns include:
 +
*DOI - entered through [http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/journal/journal_paper_ticket.cgi Journal_paper_ticketer]
 +
*WBPaper - auto generated by ticketer, links to http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/paper_display.cgi?data_number=00050589&action=Search+%21 paper display
 +
*data link - links to jfp, 'link' signifies data was entered, 'no data' indicates author didn't submit any information
 +
*textpresso html - links to html view of paper from species specific [http://textpresso-dev.caltech.edu/gsa/worm/html/037184.html html directory]
 +
*proofs - links to paper proof uploaded manually to species [http://textpresso-dev.caltech.edu/gsa/worm/proofs/ proof directory] on textpresso-dev
 +
**to upload proofs
 +
''download proofs from DJS into a local directory 'proofs', copy to textpresso-dev''
 +
>scp -r ./proofs kyook@textpresso-dev.caltech.edu:/home/kyook
 +
>ssh kyook@textpresso-dev.caltech.edu
 +
>sudo cp ./proofs/* /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/worm/proofs/
 +
''make all pdfs readable''
 +
>sudo chmod a+r -R /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/worm/proofs/
 +
 
 +
Proofs should be available on Journal Data Display
 +
 
 +
*pdf link
 +
*temp pdf
 +
*go link
 +
 
 +
=Linking steps=
 +
[[GSA linking pipeline | Visual overviews of linking pipeline]]<br>
 +
[[Linking script pipeline | Linking scripts]]<br>
 +
[[Markup excerpts]]<br>
  
 
Dartmouth Journal Services (DJS), the GSA journal publishers uploads the file, as a NML-formatted XML, by ftp to the textpresso server.<br>
 
Dartmouth Journal Services (DJS), the GSA journal publishers uploads the file, as a NML-formatted XML, by ftp to the textpresso server.<br>
Line 71: Line 125:
  
 
The QCFast curator goes through the marked-up html view to make sure the links are correct and no objects are missed. <br>
 
The QCFast curator goes through the marked-up html view to make sure the links are correct and no objects are missed. <br>
*If there are multiple objects not linked, the QCFast curator can add all missing objects to the JFP for that paper and rerun the linking script. Otherwise the QCFast curator can add and delete links through the QCFast interace itself.
+
*The QCFast curator can add or remove links through the QCFast interface.
*If links are not able to be made, e.g., due to XML formatting, the QCFast curator can download the file and edit the links manually. The name of the file must remane the same as the original file or uploading the manually edited file will result in an error.   
+
*If there are multiple objects not linked, the QCFast curator can opt to add all missing objects to the JFP for that paper and rerun the linking script (WB interface only).
 +
*If links are not able to be made, e.g., due to XML formatting, the QCFast curator can download the file and edit the links manually. The name of the file must remain the same as the original file or uploading the manually edited file will result in an error.   
  
The final edited linked file, as an html, is uploaded by ftp to the DJS servers.  
+
The final edited linked file, as an html, is uploaded, by ftp, to the DJS server.  
 
*A final e-mail is sent to the QCFast curator that confirms the upload.  
 
*A final e-mail is sent to the QCFast curator that confirms the upload.  
 
*Uploading the file will also launch a script that creates a final entity link table.  
 
*Uploading the file will also launch a script that creates a final entity link table.  
 
*This final e-mail should be received within an hour of the upload.
 
*This final e-mail should be received within an hour of the upload.
  
===Proofreading===
+
See [[Entity classes and problems]] for linking issues and how to deal with them.
 +
 
 +
=ftp to DJS=
 +
05ftpAndEmailDjs.pl <br>
 +
 
 +
QCFast for all MODs have a button to FTP the edited file to DJS. This button launches the 05ftpAndEmailDjs.pl script, which:
 +
*creates a final entity table in /data1/Users/arunr/gsa/worm/entity_link_tables
 +
*makes a log entry in /data1/Users/arunr/gsa/worm/logs
 +
*deposits the linked file in /data1/Users/arunr/gsa/worm/linked_xml
 +
*creates a record in /data1/Users/arunr/gsa/worm/done
 +
*e-mails a link to the final entity_link_table to the MOD curator
 +
*e-mails an alert to the file on the ftp server to DJS
 +
 
 +
To re-ftp a file the files created by the script AND the file deposited on the ftp server need to be removed. The DJS server is password protected by DJS, and account and access to the server needs to be assigned by them.
 +
 
 +
=Proofreading=
 
*DJS sends WormBase the pdf formatted proof before sending the article back to the author.  
 
*DJS sends WormBase the pdf formatted proof before sending the article back to the author.  
 
*DJS requests a 24 hour turnaround.
 
*DJS requests a 24 hour turnaround.
Line 86: Line 156:
 
*If author gets back to DJS about errors or suggested links. DJS consults the curator about the feedback and curator recommends an action (remove the link or send a corrected URL).
 
*If author gets back to DJS about errors or suggested links. DJS consults the curator about the feedback and curator recommends an action (remove the link or send a corrected URL).
  
==Linking/Proofing Issues==
+
=Project issues=
 
+
==Ambiguous names==
*[[Entity classes and problems]]
+
This problem refers links being made to the wrong entity based on a name used to describe multiple objects within a species or between many species.  
*Examples
+
*Intraspecies issue: In WB some clone names and strain names are identical. This problem has not been able to be resolved at the script level; these ambiguous names are put on an exclusion list instead.
**[http://textpresso-dev.caltech.edu/gsa/worm/html/ Marked-up papers]
+
*Interspecies issue:  Gene names, including aliases, from SGD, Flybase, and MGI should be collected and compared with our own gene lists, including all synonyms.   Arun will be making a splash page on textpresso dev to house all these ambiguous names, to direct readers if a proper URL cannot be resolved (2/18/10 - last action taken)
**[[Markup excerpts]]
+
**Example: general anatomy terms can be linked to the wrong species <br>
+
Problem: in [http://textpresso-dev.caltech.edu/gsa/worm/html/GEN115188fin_WB.html this] ''C. elegans'' paper, 'epithelium'  was linked to the elegans epithelium page <br>
===Entities don't exist in WB yet===
 
New entities being described for the first time can still be linked to pages in WB. These links remain silent until the objects are created in subsequent releases.  These new objects need to be added to entity lists via the JFP form, see above. If authors submit the form as soon as they receive the link to the form, the objects are added to the entity list before the XML is received and the linking scripts will link the new objects.  One problem is that the authors enter information on the form in such a way that the scripts cannot read them. A curator must visually check the data from the JFP to make sure the information was entered properly.  The QCFast curator is alerted when an author submits a form and in that alert, a summary of the information is presented.  Bad entries need to be manually blocked from being read by the JFP scripts by placing two tildas "~~" before the entry. 
 
 
 
====Gene names don't exist in WB====
 
Gene names need to be approved by WB nomenclature curators.  In most cases, the gene name not only has been approved, but already exists on the ace server so already has been added to the gene list. Authors should declare any new gene names before the linking scripts are run, but authors do not always follow through.  One action is to link the name to the official WBGene sequence ID.  This will require that the authors have explicitly stated what the gene sequence is. Linking the name to the gene should not affect what happens in acedb, the link lives within the online paper alone.  If an author is not explicit,  the best option is to contact the author, if there is no response within a reasonable time, no link will be created.  This situation is not a frequent event.
 
 
 
For example a paper has a couple gene names, "cbr-dpy-1" and "cpy-4", both referring to the same briggsae sequence, but neither name existed in acedb. In this case make a link to the WBGene page rather than create a link with cbr-dpy-1 or cpy-4 as the public name, as the other links are made. 
 
 
 
===Formatting problems break recognition===
 
Entity recognition relies on exact string matches and XML formatting cues.  If entities are not formatted correctly in the XML then the entity may not be recognized and linked. QCFast does not allow links to be made that do not follow XML formatting restrictions, to link these entities, it is necessary to download the html, manually add the link, then upload the fixed html.
 
 
 
* Case (a): <nowiki><i>Cbr-dpy-1</i>  instead of  Cbr-<i>dpy-1</i></nowiki> <br>
 
::Otherwise the <nowiki><i></nowiki> tag becomes part of the string and string matching fails on Cbr-dpy-1, even if we have it in the wormbase entity list.
 
 
 
* Case (b): <nowiki><i>unc-119</i> instead of <i>unc</i>-<i>119</i></nowiki>
 
 
 
* Case (c): Parentheses shouldn't be italicized since they are used as an indicator of string starts and stops for entity recognition. <br>
 
::<nowiki> <i>cbr-unc-119</i>(<i>st20000</i>) instead of <i>cbr-unc-119(st20000)</i></nowiki>
 
 
 
===combined entity expressions===
 
* double mutants, double alleles
 
* dpy-11/12
 
 
 
==Project issues==
 
===Ambiguous name lists===
 
''We are still working out the details for this issue''<br>
 
From a meeting on 2/18/10 we are moving forward on dealing with ambiguous entity names, that is a name in a paper that may be used to describe multiple objects and which may result in the creation of a faulty link.  This is a problem both with in a single species paper as well as a problem will need to address in order to expand the project to other species and to multi-species papers.  
 
*Intraspecies issue: This problem is pertinent to entity lists within a single species paper.  We have run into this problem with clone names and strain names being identical (for example) and not being able to resolve this discrepancy at the script level.  Our initial response to these ambiguous names has been to put these names on an exclusion list.
 
*Interspecies issue:  This is a particularly pertinent when dealing with a paper that contains gene names from different species.  We are collecting gene names, including aliases, from SGD, and Flybase, and will compare them with our own gene lists, including all synonyms. We will expand the list comparison to mouse gene names.  Arun will be making a splash page on textpresso dev to house all these ambiguous names, and to use to direct readers if a proper URL cannot be resolved.
 
**General anatomy terms can be linked to the wrong species <br>
 
Problem: in [http://textpresso-dev.caltech.edu/gsa/worm/html/GEN115188fin_WB.html this] C. elegans paper, 'epithelium'  was linked to the elegans epithelium page <br>
 
 
"T cells migrate toward (Fong et al. 2002). Additionally, loss of GRK3, which is highly expressed in mouse olfactory epithelium (Schleicher et al. 1993), "<br>  
 
"T cells migrate toward (Fong et al. 2002). Additionally, loss of GRK3, which is highly expressed in mouse olfactory epithelium (Schleicher et al. 1993), "<br>  
Action: "epithelium" were added to the anatomy exclusion list. From this paper, we also added, 'head', 'tail', 'neuron', 'sensory neuron' to the exclusion list as well.
+
Action: "epithelium" was added to the anatomy exclusion list. From this paper, we also added, 'head', 'tail', 'neuron', 'sensory neuron' to the exclusion list as well.
  
===Longevity of links===
+
==Longevity of links==
Thing for consideration
+
Things for consideration
 
*How can we ensure the links are maintained?
 
*How can we ensure the links are maintained?
 
*How can we ensure non-WB links are maintained?
 
*How can we ensure non-WB links are maintained?
Line 140: Line 179:
 
The script that forms the entity table can be run periodically for each file and the statuses can be checked. If the links are 'silent' for a paper that was published a long time ago, then the responsible curator needs to be alerted.
 
The script that forms the entity table can be run periodically for each file and the statuses can be checked. If the links are 'silent' for a paper that was published a long time ago, then the responsible curator needs to be alerted.
  
===Entity Sources===  
+
==People==
Entity lists are available [http://textpresso-dev.caltech.edu/gsa/worm/known_entities/ here]
+
'''Caltech''' Karen Yook, Hans-Michael Müller, Paul Sternberg. ''Alum'': Arun Rangarajan, Chris Grove, Daniela Raciti, James Done, Juancarlos Chan <br>
 +
'''GSA:''' Tim Schedl, Tracey DePellegrin-Connelly, Ruth Isaacson, Virginia Ingerson<br>
 +
'''DJS (Dartmouth Journal Services):'''  Sharon Faelten, Diana Schaeffer, Michelle Kerns. ''Alum'': Lolly Otis, Stephen Haenel <br>
 +
'''SGD''' Marek Skrzypek, Rob Nash<br>
 +
'''Flybase''' Steven Marygold,  Aoife Larkin. ''Alum'': Raymund Stefancsik
  
01downloadModEntities.pl downloads entities<br>
+
==Alerts==
From acedb on spica
+
E-mail alerts are sent when
*Anatomy_name
+
* a paper is entered through the ticketer  Subject line: "GENETICS: WormBase data for accepted paper GENETICS/2015/185272"
*Clone
+
* a jfp form becomes active for the author Subject line: "new paper ticket DOI created"
*Rearrangement
+
* an article is entered into the markup pipeline Subject line "GSA auto-email: WB 179705 article received from GSA."
*Strain
+
* the linked article is ready for QC Subject line: "GSA auto-email: WB 179705 linked file available"
*Variation
+
* linked and QC'd article has been ftp'd to DJS server Subject line: "GSA auto-email: WB 179705 article FTPed"
From postres on tazendra
+
* a paper goes through any of the MOD markup pipelines.
*Genes
 
*Transgenes
 
  
02formSortedLexicon.pl forms the lexicon from the known_entities.
+
E-mail addresses are stored in the emails folder for each MOD instance.<br>
 +
Email folders contain files with respective addresses, use sudo to change/update the addresses
  
===URL constructors===
+
/data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/worm/emails
http://www.wormbase.org/wiki/index.php/Linking_To_WormBase
+
*allele_tickets.txt -> worm-bug@sanger.ac.uk
  http://www.wormbase.org/db/get?name=X;class=Y
+
*alleles.txt ->  worm-bug@sanger.ac.u karen@wormbase.org, mueller@caltech.ed
 +
*allele_developers.txt -> mueller@caltech.edu, karen@wormbase.org
 +
*developers.txt -> karen@wormbase.org, mueller@caltech.edu
 +
*curators.txt -> karen@wormbase.org, mueller@caltech.edu, pws@caltech.edu, vanauken@caltech.edu
 +
*final.txt -> karen@wormbase.org, mueller@caltech.edu, Genetics_Specialist.djs@sheridan.com, GGG_Specialist.djs@sheridan.com, virginia.ingerson@sheridan.com, pws@caltech.edu, vanauken@caltech.edu
  
Currently we are forming URLs based on public names versus object class ID's. These links are redirected to the correct page, which is the URL based on object ID. By using the public name for the linking instead of the ID, we can avoid problems that may result in links dying due to objects getting merged or killed.
+
/data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/fly/emails/
 +
*developers.txt -> karen@wormbase.org, mueller@caltech.edu
 +
*curators.txt -> al806@cam.ac.uk, sjm41@cam.ac.uk, karen@wormbase.org, mueller@caltech.edu
 +
*final.txt -> mueller@caltech.edu, sjm41@cam.ac.uk, al806@cam.ac.uk, Genetics_Specialist.djs@sheridan.com, GGG_Specialist.djs@sheridan.com, virginia.ingerson@sheridan.com, karen@wormbase.org
  
===Entities===
 
Object -> ''source'' ; web page link link <br>
 
<example object pages><br>
 
  
'''Gene''' -> ''ace server (more current than dev)'' ; Gene Summary page URL uses public gene name, which is redirected to WBGeneID<br>
+
/data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/yeast/emails/
<http://www.wormbase.org/db/get?name=mpk-1;class=Gene><br>
+
*curators.txt -> mueller@caltech.edu, marek.skrzypek@stanford.edu, rnash@stanford.edu, karen@wormbase.org
<http://www.wormbase.org/db/get?name=F43C1.1;class=Gene><br>
+
*developers.txt -> karen@wormbase.org, mueller@caltech.edu
<br>
+
*final.txt -> mueller@caltech.edu, Genetics_Specialist.djs@sheridan.com, GGG_Specialist.djs@sheridan.com, virginia.ingerson@sheridan.com, marek.skrzypek@stanford.edu, rnash@stanford.edu, karen@wormbase.or
 
 
'''Protein''' -> ''ace server'' ; Gene Summary page:<br>
 
<http://www.wormbase.org/db/get?name=egl-4;class=Gene><br>
 
<br>
 
 
 
'''Allele/SNP''' -> ''acedb database on tazendra at /home3/acedb/ws/acedb'' ; Variation Report page<br>
 
<http://www.wormbase.org/db/get?name=ga117;class=Variation> <br>
 
<http://www.wormbase.org/db/get?name=hw42941;class=Variation><br>
 
<br>
 
 
 
'''Transgene''' (inserted transgenes only) -> ''acedb database on tazendra at /home3/acedb/ws/acedb'' ;  Transgene Summary page <br>
 
<http://www.wormbase.org/db/get?name=oxIs12;class=Transgene><br>
 
<br>
 
 
 
'''Rearrangement''' -> ''acedb database on tazendra at /home3/acedb/ws/acedb'' <br>
 
<br>
 
 
 
'''Strain''' -> ''acedb database on tazendra at /home3/acedb/ws/acedb''; Strain Report page<br>
 
<http://www.wormbase.org/db/get?name=MH37;class=Strain ><br>
 
<br>
 
 
 
'''Clone''' -> ''acedb database on tazendra at /home3/acedb/ws/acedb''; Sequence Summary page<br>
 
<http://www.wormbase.org/db/get?name=yk1106g06.3;class=Sequence><br>
 
<http://www.wormbase.org/db/get?name=OSTR153G5_1;class=Sequence ><br>
 
<br>
 
 
 
'''suppress for now''' Anatomy/cell -> ''acedb database on tazendra at /home3/acedb/ws/acedb'' with corrections by Raymond ; Summary of anatomy ontology term<br>
 
<http://www.wormbase.org/db/get?name=HSN;class=Anatomy_term><br>
 
<http://www.wormbase.org/db/get?name=Z1.ppp;class=Anatomy_term><br>
 
<br>
 
 
 
'''suppress for now''' Author -> ''The links are formed using the URL pattern <br>
 
http://www.wormbase.org/db/misc/person?name=$url_encoded_name;paper=$wbpaper_id, <br>
 
where $url_encoded_name is the full name of an author with the following modifications: <br>
 
i) periods after middle names removed and <br>
 
ii) spaces converted to %20, which is HTML equivalent of a space. <br>
 
 
 
Examples:<br>
 
Name: Eve W. L. Chow <br>
 
Link: http://www.wormbase.org/db/misc/person?name=Eve%20W%20L%20Chow;paper=WBPaper00038245
 
 
 
Name: I. Russel Lee <br>
 
Link: http://www.wormbase.org/db/misc/person?name=I%20Russel%20Lee;paper=WBPaper00038245
 
 
 
(There may be also be other special characters with accents, umlauts, etc., The best thing to do would be to check WormBase and put the correct link, if the author name already exists. Else the English equivalent of the special character could be visually determined and a new author link can be formed and notified to Cecilia.)
 
 
 
--[[User:Kyook|kjy]] 22:04, 19 April 2012 (UTC)
 
 
 
==People==
 
'''Caltech''': Arun Rangarajan, Karen Yook, Juancarlos Chan, Paul Sternberg, Hans-Michael Müller, Chris Grove, Daniela Raciti, James Done<br>
 
'''GSA:''' Tim Schedl, Tracey DePellegrin-Connelly, Ruth Isaacson<br>
 
'''DJS (Dartmouth Journal Services):''' Stephen Haenel, Sharon Faelten, Lolly Otis<br>
 
  
 +
email files are also on the sandbox:
 +
/data2/srv/textpresso-dev.caltech.edu/www/docroot/karen/gsa/
  
  
 +
--[[User:Kyook|Kyook]] ([[User talk:Kyook|talk]]) 19:53, 26 February 2016 (UTC)
 
[[Category:Curation]]
 
[[Category:Curation]]
 
[[Category:GSA_markup]]
 
[[Category:GSA_markup]]

Latest revision as of 20:50, 10 January 2017

Caltech_documentation

Visual overviews of linking pipeline
Marked-up papers

GENETICS Editors duties

Retrieving/Assigning a WBPaperID

  • GSA Editors enter the approved paper doi into the WBPaperID ticketer form.
    • The ticketer captures the last six digits of the doi of the article, which is used as an identifier throughout the linking and QC process.
    • The ticketer assigns an official WBPaperID, which is needed to track the paper through WB curation paths
  • Retrieve the link to a Journal First Pass form (JFP) from the ticketer. Note: this step is sometimes missed by the GSA editors causing confusion when the paper comes through the normal WB paper pipeline.
  • Editors send a notice of approval to the author along with the link to the JFP.
    • GSA editors control the message to the authors - to make changes to the message, contact GSA editors. The message reads as follows:
Dear Dr. XXXX:

DOI: 10.1534/genetics.11X.XXXXXX
WBPaper: X

The Genetics Society of America is working with textpresso (www.textpresso.org) and WormBase (www.wormbase.org) to create links between genetic and genomic objects in your article to the appropriate page in WormBase. These links will be included in both the online full text and PDF versions of the published article.

If you want any genes, alleles, transgenes, CGC-destined strains, anatomy terms, etc., previously known or new/described for the first time in your article to be linked to WormBase please enter the names of these objects using the form at the following link:

http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/journal/journal_first_pass.cgi?action=Curate&paper=X&passwd=X.X


Because your submitted data will be processed automatically, please follow the examples carefully. If you have questions, or want to upload a file instead, contact Karen Yook, Wormbase Curator, at karen@wormbase.org.


Thank you.

XXXX XXXX
Editorial Assistant


Mappings between doi and WBPaperIDs can be found here.

WBPaperID ticketer

  • GSA editors need to enter the paper in to the Journal_paper_ticketer to retrieve a link to the journal first pass form.
  • GSA editors enter the paper doi into the Journal_paper_ticketer; full doi expressions need to be used, e.g., 10.1534/genetics.113.157685; the link for each paper is accessible on the Journal Data Display, see below.
  • The ticketer enters the paper into postgres as a doi object only. All information about the paper (title, authors, abstract, PMID, etc.) is entered when the paper comes through the normal paper pipeline.
  • The ticketer sends an alert to the QCFast curator and WB Paper curator (Kimberly) that a WBPaperID was assigned.
  • The paper should show up in Pubmed within a couple days/weeks, depending on when the doi was entered and where the paper is in the processing stage.
  • The WBPaper curator (Kimberly) uses the doi to manually identify the paper during the normal paper pipeline. In the normal paper pipeline, a cron job fetches PMIDs from PubMed, note: the cron job doesn't look for doi's; it only looks for PMIDs. When a new Genetics or G3 paper shows up in the paper editor, Kimberly manually looks for the corresponding doi to make sure the paper has been entered already and the PMID is then manually added. Once the PMID is added, Kimberly approves the paper through the regular paper pipeline.

Journal author first pass form (JFP)

  • An alert is sent to the QCFast curator when an author submits data through the JFP. The QCFast curator must check the entered data to make sure it has been entered in a useful format, see following. Alerts (flags) are also sent to the appropriate WB curator to alert them to new objects.
  • The JFP differs from the author first pass form sent out to authors of publications in other journals; although all the datatypes on the form are the same, the JFP highlights those data types that are linked and requests the authors to declare new objects that don't exist in WB at the time.
  • Journal author data entered into linked classes of the form are added automatically to the corresponding class entity list.
  • Since the linking scripts perform exact string match (and formatting cue recognition), if data is not entered correctly into the JFP,
    • the object may not be recognized in the paper
    • extraneous information will be added to the entity list and used to mark up objects in the paper
  • The QCFast curator needs to check the JFP entries for a given paper to make sure everything is entered correctly or to silence strings not meant to be marked up- silencing strings is achieved by placing two tildas "~~", before any text not meant to be added to an entity list.

For example

"ZK1128.2, mett-10, which is orthologous to the human methyltransferase-10 domain containing protein (METT10D), and corresponds to ZK1128.2."    
to
"ZK1128.2, mett-10~~, which is orthologous to the human methyltransferase-10 domain containing protein (METT10D), and corresponds to ZK1128.2."

In the first case every word would have been added to the gene entity list (as it was entered in the genesymbol data field), the "~~" hides everything that comes after it so that only the gene names are added (although in this case, ZK1128.9, does not necessarily need to be added, as it should already be on the gene list).

  • To see author-submitted data for all JFPs
**access the JFP index here
**choose "textpresso" from the drop down menu in the upper left corner of the page
**click "Show Data"
**data are listed by paper and data type
  • Format for data entered through the form:
    • Data need to be comma-separated and in the standard format for each data type, all extraneous information hidden (see below).
    • Gene
      • public_name
      • CDs
      • unapproved gene names need to be cleared by genenames@wormbase
    • Transgene
      • integrated transgene nomenclature format only
      • 'Ex' objects that exist in WormBase
      • entities within transgenes/construct expressions are linked, e.g. in the expression 'sur-5::GFP', 'sur-5' is linked.
    • Antibody - no longer linked
    • Proteins - link will be to gene page
    • Alleles
      • currently only elegans alleles; priority on curating alleles of other species is very low
    • Strains (only those accepted by the CGC for deposit into the permanent collection)
      • currently only marking up known C. elegans strains, mainly available through the CGC; priority on curating other species is very low
  • Changes to the form need to go through Juancarlos

Journal Data Display

This site displays information about ticketed papers
http://tazendra.caltech.edu/~postgres/cgi-bin/author_fp_display.cgi?afp_jfp=jfp
columns include:

download proofs from DJS into a local directory 'proofs', copy to textpresso-dev
>scp -r ./proofs kyook@textpresso-dev.caltech.edu:/home/kyook 
>ssh kyook@textpresso-dev.caltech.edu
>sudo cp ./proofs/* /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/worm/proofs/
make all pdfs readable
>sudo chmod a+r -R /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/worm/proofs/

Proofs should be available on Journal Data Display

  • pdf link
  • temp pdf
  • go link

Linking steps

Visual overviews of linking pipeline
Linking scripts
Markup excerpts

Dartmouth Journal Services (DJS), the GSA journal publishers uploads the file, as a NML-formatted XML, by ftp to the textpresso server.

  • DJS requests 48 hours to receive back the linked file.
  • The QCFast Curator receives an e-mail alert that a paper has been uploaded and to expect the linked paper soon.
  • The paper goes through the linking script and the create_entity_table script, URLS to the linked html as well as the entity list table are e-mailed to the curator.
  • When there are no problems with the file and linking, the curator can expect the follow up e-mail with the links no later than an hour after the initial e-mail. If there is a problem with the file, the process is stopped until the QCFast developer can fix the pipeline.
  • n.b. there is a bug in the linking script that will stall the process based on certain characters, unfortunately, the bug is frequent enough that it is noticeable, but not frequent enough to warrant a full debugging.

The QCFast curator goes through the marked-up html view to make sure the links are correct and no objects are missed.

  • The QCFast curator can add or remove links through the QCFast interface.
  • If there are multiple objects not linked, the QCFast curator can opt to add all missing objects to the JFP for that paper and rerun the linking script (WB interface only).
  • If links are not able to be made, e.g., due to XML formatting, the QCFast curator can download the file and edit the links manually. The name of the file must remain the same as the original file or uploading the manually edited file will result in an error.

The final edited linked file, as an html, is uploaded, by ftp, to the DJS server.

  • A final e-mail is sent to the QCFast curator that confirms the upload.
  • Uploading the file will also launch a script that creates a final entity link table.
  • This final e-mail should be received within an hour of the upload.

See Entity classes and problems for linking issues and how to deal with them.

ftp to DJS

05ftpAndEmailDjs.pl

QCFast for all MODs have a button to FTP the edited file to DJS. This button launches the 05ftpAndEmailDjs.pl script, which:

  • creates a final entity table in /data1/Users/arunr/gsa/worm/entity_link_tables
  • makes a log entry in /data1/Users/arunr/gsa/worm/logs
  • deposits the linked file in /data1/Users/arunr/gsa/worm/linked_xml
  • creates a record in /data1/Users/arunr/gsa/worm/done
  • e-mails a link to the final entity_link_table to the MOD curator
  • e-mails an alert to the file on the ftp server to DJS

To re-ftp a file the files created by the script AND the file deposited on the ftp server need to be removed. The DJS server is password protected by DJS, and account and access to the server needs to be assigned by them.

Proofreading

  • DJS sends WormBase the pdf formatted proof before sending the article back to the author.
  • DJS requests a 24 hour turnaround.
  • QCFast curator relays all suggested changes to DJS, as the curator cannot make any changes to the doc itself.
  • DJS sends proof to author.
  • If author gets back to DJS about errors or suggested links. DJS consults the curator about the feedback and curator recommends an action (remove the link or send a corrected URL).

Project issues

Ambiguous names

This problem refers links being made to the wrong entity based on a name used to describe multiple objects within a species or between many species.

  • Intraspecies issue: In WB some clone names and strain names are identical. This problem has not been able to be resolved at the script level; these ambiguous names are put on an exclusion list instead.
  • Interspecies issue: Gene names, including aliases, from SGD, Flybase, and MGI should be collected and compared with our own gene lists, including all synonyms. Arun will be making a splash page on textpresso dev to house all these ambiguous names, to direct readers if a proper URL cannot be resolved (2/18/10 - last action taken)
    • Example: general anatomy terms can be linked to the wrong species

Problem: in this C. elegans paper, 'epithelium' was linked to the elegans epithelium page
"T cells migrate toward (Fong et al. 2002). Additionally, loss of GRK3, which is highly expressed in mouse olfactory epithelium (Schleicher et al. 1993), "
Action: "epithelium" was added to the anatomy exclusion list. From this paper, we also added, 'head', 'tail', 'neuron', 'sensory neuron' to the exclusion list as well.

Longevity of links

Things for consideration

  • How can we ensure the links are maintained?
  • How can we ensure non-WB links are maintained?
  • Link checking can be used to make sure silent links have been dealt with.

At WormBase we will be addressing the stability of our links by collecting the URLs we use/create and periodically running a link checker to make sure the links are still alive. The link checker will point out any links that have gone dead; however, it will not check if the page is relevant. We will be working out the specifics of this aspect of the project.

All the linked entities in a paper, along with their link status at the time of sending the linked XML back to DJS, are available here

The script that forms the entity table can be run periodically for each file and the statuses can be checked. If the links are 'silent' for a paper that was published a long time ago, then the responsible curator needs to be alerted.

People

Caltech Karen Yook, Hans-Michael Müller, Paul Sternberg. Alum: Arun Rangarajan, Chris Grove, Daniela Raciti, James Done, Juancarlos Chan
GSA: Tim Schedl, Tracey DePellegrin-Connelly, Ruth Isaacson, Virginia Ingerson
DJS (Dartmouth Journal Services): Sharon Faelten, Diana Schaeffer, Michelle Kerns. Alum: Lolly Otis, Stephen Haenel
SGD Marek Skrzypek, Rob Nash
Flybase Steven Marygold, Aoife Larkin. Alum: Raymund Stefancsik

Alerts

E-mail alerts are sent when

  • a paper is entered through the ticketer Subject line: "GENETICS: WormBase data for accepted paper GENETICS/2015/185272"
  • a jfp form becomes active for the author Subject line: "new paper ticket DOI created"
  • an article is entered into the markup pipeline Subject line "GSA auto-email: WB 179705 article received from GSA."
  • the linked article is ready for QC Subject line: "GSA auto-email: WB 179705 linked file available"
  • linked and QC'd article has been ftp'd to DJS server Subject line: "GSA auto-email: WB 179705 article FTPed"
  • a paper goes through any of the MOD markup pipelines.

E-mail addresses are stored in the emails folder for each MOD instance.
Email folders contain files with respective addresses, use sudo to change/update the addresses

/data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/worm/emails
  • allele_tickets.txt -> worm-bug@sanger.ac.uk
  • alleles.txt -> worm-bug@sanger.ac.u karen@wormbase.org, mueller@caltech.ed
  • allele_developers.txt -> mueller@caltech.edu, karen@wormbase.org
  • developers.txt -> karen@wormbase.org, mueller@caltech.edu
  • curators.txt -> karen@wormbase.org, mueller@caltech.edu, pws@caltech.edu, vanauken@caltech.edu
  • final.txt -> karen@wormbase.org, mueller@caltech.edu, Genetics_Specialist.djs@sheridan.com, GGG_Specialist.djs@sheridan.com, virginia.ingerson@sheridan.com, pws@caltech.edu, vanauken@caltech.edu
/data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/fly/emails/
  • developers.txt -> karen@wormbase.org, mueller@caltech.edu
  • curators.txt -> al806@cam.ac.uk, sjm41@cam.ac.uk, karen@wormbase.org, mueller@caltech.edu
  • final.txt -> mueller@caltech.edu, sjm41@cam.ac.uk, al806@cam.ac.uk, Genetics_Specialist.djs@sheridan.com, GGG_Specialist.djs@sheridan.com, virginia.ingerson@sheridan.com, karen@wormbase.org


/data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/yeast/emails/
  • curators.txt -> mueller@caltech.edu, marek.skrzypek@stanford.edu, rnash@stanford.edu, karen@wormbase.org
  • developers.txt -> karen@wormbase.org, mueller@caltech.edu
  • final.txt -> mueller@caltech.edu, Genetics_Specialist.djs@sheridan.com, GGG_Specialist.djs@sheridan.com, virginia.ingerson@sheridan.com, marek.skrzypek@stanford.edu, rnash@stanford.edu, karen@wormbase.or

email files are also on the sandbox:

/data2/srv/textpresso-dev.caltech.edu/www/docroot/karen/gsa/


--Kyook (talk) 19:53, 26 February 2016 (UTC)