Difference between revisions of "GSA Markup SOP"

From WormBaseWiki
Jump to navigationJump to search
 
(31 intermediate revisions by 2 users not shown)
Line 1: Line 1:
[http://www.wormbase.org/wiki/index.php/Caltech_documentation Caltech_documentation]
+
[http://wiki.wormbase.org/index.php/Caltech_documentation Caltech_documentation]
==GSA markup==
 
*[[GSA linking pipeline]]
 
*[[Entity classes and problems]]
 
*Examples
 
**[http://www.wormbase.org/wiki/index.php/Papers_that_have_been_markedup Marked up papers]
 
**[[Markup excerpts]]
 
  
 +
[[GSA linking pipeline | Visual overviews of linking pipeline]]<br>
 +
[http://textpresso-dev.caltech.edu/gsa/worm/html/ Marked-up papers]
  
==SOP for linking==
+
=GENETICS Editors duties=
'''Once approved for publication, Genetics enters the paper doi into Juancarlos's WBPaperID ticketer to retrieve a WBPaperID'''. http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/journal/journal_paper_ticket.cgi<br>
+
Retrieving/Assigning a WBPaperID
'''''(Where is the tickter script?)'''''
+
*GSA Editors enter the approved paper doi into the WBPaperID [http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/journal/journal_paper_ticket.cgi ticketer form].
*This ticketer captures the doi of the article, the last 6 digits are used as an identifier throughout the linking and QC process.  
+
**The ticketer captures the last six digits of the doi of the article, which is used as an identifier throughout the linking and QC process.  
*The doi is also used to identify the paper during the normal pubmed paper pipeline.  
+
**The ticketer assigns an official WBPaperID, which is needed to track the paper through WB curation paths
*The assigned WBPaperID is used to track and map the journal first pass form through the first pass process.
+
* Retrieve the link to a Journal First Pass form (JFP) from the ticketer. ''Note: this step is sometimes missed by the GSA editors causing confusion when the paper comes through the normal WB paper pipeline.
 +
* Editors send a notice of approval to the author along with the link to the JFP.
 +
**GSA editors control the message to the authors - to make changes to the message, contact GSA editors. The message reads as follows:
 +
<pre>
 +
Dear Dr. XXXX:
  
At this time the paper has been entered into our local database (postgres) as a doi object only. There is no other information about the paper in postgres.  Information about the paper (title, authors, abstract, PMID, etc.) comes in when the paper comes through the normal pubmed pipeline.  
+
DOI: 10.1534/genetics.11X.XXXXXX
*The doi and WBPaperID mappings can be accessed on http://tazendra.caltech.edu/~postgres/cgi-bin/journal/journal_all.cgi<br> This form also has links to the journal first pass form (first 'link' after the WBPaperID) and the GO term to GOid mappings (second 'link').
+
WBPaper: X
  
'''Entering of the doi and WBPaperID assignment by Genetics leads to an alert being sent to me and to Kimberly'''.<br> 
+
The Genetics Society of America is working with textpresso (www.textpresso.org) and WormBase (www.wormbase.org) to create links between genetic and genomic objects in your article to the appropriate page in WormBase. These links will be included in both the online full text and PDF versions of the published article.
The paper should show up in Pubmed within a couple days depending on when the doi was entered and where the paper is in the processing stage. Juancarlos' script '''''(is this part of the paper pipeline?? )''''' iidentifies the doi as it comes through pubmed to avoid the paper getting assigned a second WBPaperID, which would result in the duplication of the paper in postgres. <br>
 
  
'''Genetics sends a notice of approval to the author along with a link to a journal author first pass form.'''<br>
+
If you want any genes, alleles, transgenes, CGC-destined strains, anatomy terms, etc., previously known or new/described for the first time in your article to be linked to WormBase please enter the names of these objects using the form at the following link:
A link to the journal author form for each paper is accessible, in line, for each paper from http://tazendra.caltech.edu/~postgres/cgi-bin/journal/journal_all.cgi.
 
This form differs from the standard author first pass form that is sent out to elegans authors of publications in other journals;  although all the datatypes on the form are the same, the journal author form highlights a select few data types that the authors should fill out for us, primarily those data types that we provide links to with an emphasis on requesting the authors to tell us about every new object that is in their paper and doesn't exist in WB yet.<br>
 
  
'''Journal author data entered into these select fields are added automatically to Arun's entity lists.'''<br>
+
http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/journal/journal_first_pass.cgi?action=Curate&paper=X&passwd=X.X
  
'''''ANYTHING IN THE PAPER THAT HAS A MATCH ON AN ENTITY LIST, AND HAS NOT BEEN PUT ON AN EXCLUSION LIST, WILL BE MARKED UP; therefore submitted data need to be manually checked by a curator'''''<br>
 
Arun's scripts perform exact string match (and formatting cue recognition). If data is not entered correctly, the object may not be recognized in the paper, also extraneous information will be added to the entity list and used to mark up objects in the paper. When an author submits data through the journal author first pass form, an e-mail alert is sent to me.  I check the entered data to make sure it has been entered in a useful format. <br>
 
  
All author-submitted data can be accessed on tazendra:
+
Because your submitted data will be processed automatically, please follow the examples carefully. If you have questions, or want to upload a file instead, contact Karen Yook, Wormbase Curator, at karen@wormbase.org.
http://tazendra.caltech.edu/~postgres/cgi-bin/journal/journal_all.cgi
 
choose "textpresso" from the drop down menu in the upper left corner of the page
 
click "Show Data"
 
data are listed by paper and data type
 
  
  
'''Data need to be comma-separated and need to be in the standard format for each data type and all extraneous information hidden (see below).'''
+
Thank you.
*'''Gene''' (genesymbol)
 
**public name, CDS
 
**unapproved gene names need to be added to the entity list but Arun needs to be told to link it to a known gene ID  ''Do we need a temporary entity list for this data type?''<br>
 
*'''Transgene'''
 
**in integrated transgene nomenclature format only
 
**no unapproved 'Ex' objects
 
**no genomic expressions,  entities within transgenes/construct expressions are omitted from linking at the script level,  e.g. in the expression 'sur-5::GFP', 'sur-5' will not be linked as 'sur-5:" is recognized by Arun's script and processed as an entity to omit from linking.
 
*'''Antibody'''
 
**only protein name (link will be to gene page)
 
*'''Alleles''' (extvariation)
 
**currently only elegans alleles;  priority on curating alleles of other species is very low
 
*'''Strains''' (newstrains)
 
**currently only marking up known C. elegans strains, mainly available through the CGC; priority on curating other species is very low
 
<br>
 
  
'''Changes to the author data is made to the journal author first pass form available through the paper associated link on http://tazendra.caltech.edu/~postgres/cgi-bin/journal/journal_all.cgi. <br>
+
XXXX XXXX
To hide extraneous information, preface the information with two tildas, that is, reorganize the data so that objects that need to be added to the entity lists are comma-separated and all other information is listed after "~~".
+
Editorial Assistant
change<br>
+
</pre>
"ZK1128.2, mett-10, which is orthologous to the human methyltransferase-10 domain containing protein (METT10D), and corresponds to ZK1128.2."    <br>
 
to<br>
 
"ZK1128.2, mett-10~~, which is orthologous to the human methyltransferase-10 domain containing protein (METT10D), and corresponds to ZK1128.2."  <br>
 
  
In the first case every word would have been added to the gene entity list (as it was entered in the genesymbol data field), the "~~" hides everything that comes after it so that only the gene names are added (although in this case, ZK1128.9, does not necessarily need to be added, as it should already be on the gene list).<br>
 
  
===Entity linking scripts are run===
+
Mappings between doi and WBPaperIDs can be found [http://tazendra.caltech.edu/~postgres/cgi-bin/author_fp_display.cgi?afp_jfp=jfp here].
[[Linking script pipeline]] <br>
 
The XML is sent, via e-mail to Arun, and Arun runs his scripts for linking entities to WB pages<br>
 
''Ideally the author has entered the data and it is sitting correctly in our database before we receive the XML from DJS, usually authors do not submit any data. But Karen goes through the paper after the linking script is run and identifies any new objects.''  <br>
 
  
'''Arun and I go through the marked-up XML to make sure the links are correct and no objects are missed.''' <br>
+
=WBPaperID ticketer=
All objects that were missed by the linking scripts are added to the journal author first pass form by me (which once submitted also alerts the appropriate curator to new objects)It takes 20-40 minutes to manually go through a paper with a lot of genes.<br>
+
*GSA editors need to enter the paper in to the [http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/journal/journal_paper_ticket.cgi Journal_paper_ticketer] to retrieve a link to the journal first pass form.
 +
*GSA editors enter the paper doi into the [http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/journal/journal_paper_ticket.cgi Journal_paper_ticketer]; full doi expressions need to be used, e.g., 10.1534/genetics.113.157685; the link for each paper is accessible on the [http://tazendra.caltech.edu/~postgres/cgi-bin/author_fp_display.cgi?afp_jfp=jfp Journal Data Display], see below.
 +
*The ticketer enters the paper into postgres as a doi object only. All information about the paper (title, authors, abstract, PMID, etc.) is entered when the paper comes through the normal paper pipeline.
 +
*The ticketer sends an alert to the QCFast curator and WB Paper curator (Kimberly) that a WBPaperID was assigned.
 +
*The paper should show up in Pubmed within a couple days/weeks, depending on when the doi was entered and where the paper is in the processing stage.
 +
*The WBPaper curator (Kimberly) uses the doi to manually identify the paper during the normal paper pipeline. In the normal paper pipeline, a cron job fetches PMIDs from PubMed, note: the cron job doesn't look for doi's; it only looks for PMIDsWhen a new Genetics or G3 paper shows up in the paper editor, Kimberly manually looks for the corresponding doi to make sure the paper has been entered already and the PMID is then manually added.  Once the PMID is added, Kimberly approves the paper through the regular paper pipeline.
 +
*
  
Some of the objects; genes, and persons are mined from servers/local databases rather than from a WB release so that they are as up-to-date as possible, such that even if these objects are not on the website they will be recognized.  As stated above, objects that don't exist on any of the lists will be added to Arun's object lists through the author first-pass form, so in practice, curators may receive objects from the author first pass form from either the author, me, or both.   ''See elsewhere on this wiki for discussion about pre-WB approved object names.'' 
+
==Journal author first pass form (JFP)==
 +
*An alert is sent to the QCFast curator when an author submits data through the JFP. The QCFast curator must check the entered data to make sure it has been entered in a useful format, see following. Alerts (flags) are also sent to the appropriate WB curator to alert them to new objects.
  
'''After correcting the entity lists, we run the paper through again before a second in-house proof, then send it back to DJS.'''  <br>
+
*The JFP differs from the author first pass form sent out to authors of publications in other journals; although all the datatypes on the form are the same, the JFP highlights those data types that are linked and requests the authors to declare new objects that don't exist in WB at the time.
We are given 48 hours to markup and return the XML document from Genetics.  If it arrives on a Friday, we have the weekend.
 
  
'''Authors are sent the marked up doc for proofreading''' <br>
+
*Journal author data entered into linked classes of the form are added automatically to the corresponding class entity list.
Authors get back to DJS about errors, we are consulted about the problems and recommend an action (remove the link, or send a corrected URL).
 
  
==Linking/Proofing Issues== 
+
*Since the linking scripts perform exact string match (and formatting cue recognition), if data is not entered correctly into the JFP,
===Entities don't exist in WB yet===
+
**the object may not be recognized in the paper
New entities being described for the first time can still be linked to pages in WB, which will remain silent until the objects are created (and will show up on the public site about 2 months after the paper is received).  These new objects need to be added to Arun's lists via the journal author first pass form.  Authors are asked to fill out this form when they are notified of the acceptance of their paper.  If they submit the form right away, the objects are added to Arun's list before we receive the XML and the linking should identify all new objects for which we create links.  One problem is that the authors enter information on the form in such a way that the scripts cannot read them.  '''A curator must visually check the data from the author form to make sure the information was entered properly.'''  I am alerted when an author submits a form and in that alert, a summary of the information is presented,  which tells me if things were entered properly,  if they aren't I need to go back and fix the entries.  Bad entries are blocked from being read by Arun's scripts by placing two tilda's "~~" before the entry. 
+
**extraneous information will be added to the entity list and used to mark up objects in the paper<br>
  
====Gene names don't exist in WB====
+
*The QCFast curator needs to check the JFP entries for a given paper to make sure everything is entered correctly or to silence strings not meant to be marked up- silencing strings is achieved by placing two tildas "~~", before any text not meant to be added to an entity list.<br>
Gene names need to be approved by our nomenclature curators, Jonathan Hodgkin and Mary Ann TuliIn most cases, an author has already contacted Jonathan and Mary Ann and the gene name not only has been approved, but already exists on the ace server so already has been added to the gene listAuthors are asked to declare these new gene names before the linking scripts are run, but authors do not always follow through. One action is to link the name to the official WBGene sequence IDThis will require that the authors have explicitly stated what the gene sequence is. Linking the name to the gene should not affect what happens in acedb, the link lives within the online paper alone.   If an author is not explicit,  the best option is for us to contact them, if we do not get a response within a reasonable time, no link will be created.  We are still working on a suitably quick SOP for these cases.
+
For example
 +
  "ZK1128.2, mett-10, which is orthologous to the human methyltransferase-10 domain containing protein (METT10D), and corresponds to ZK1128.2."    <br>
 +
  to<br>
 +
  "ZK1128.2, mett-10~~, which is orthologous to the human methyltransferase-10 domain containing protein (METT10D), and corresponds to ZK1128.2." <br> 
 +
In the first case every word would have been added to the gene entity list (as it was entered in the genesymbol data field), the "~~" hides everything that comes after it so that only the gene names are added (although in this case, ZK1128.9, does not necessarily need to be added, as it should already be on the gene list).<br>
  
For example a paper has a couple gene names, "cbr-dpy-1" and "cpy-4", both referring to the same briggsae sequence, but neither name existed in acedb. In this case make a link to the WBGene page rather than create a link with cbr-dpy-1 or cpy-4 as the public name, as the other links are made. 
+
*To see author-submitted data for all JFPs
 +
**access the JFP index [http://tazendra.caltech.edu/~postgres/cgi-bin/journal/journal_all.cgi here]
 +
**choose "textpresso" from the drop down menu in the upper left corner of the page
 +
**click "Show Data"
 +
**data are listed by paper and data type
  
===Formatting problems break recognition===
+
*Format for data entered through the form:
Entity recognition relies on exact string matches and XML formatting cues. If entities are not formatted correctly in the XML then the entity may not be recognized and linked. So if you see an unlinked entity, check to see if the formatting has been alteredif this is the case, a message needs to get sent to Genetics and DJS to correct the formatting at the source as we do not want to change the XML document. You can still finish checking the document for new objects and adding them to the lists in the mean time. 
+
**Data need to be comma-separated and in the standard format for each data type, all extraneous information hidden (see below).
 +
**Gene
 +
***public_name
 +
***CDs
 +
***unapproved gene names need to be cleared by genenames@wormbase
 +
**Transgene
 +
***integrated transgene nomenclature format only
 +
***'Ex' objects that exist in WormBase
 +
***entities within transgenes/construct expressions are linked,  e.g. in the expression 'sur-5::GFP', 'sur-5' is linked.
 +
**Antibody - no longer linked
 +
**Proteins - link will be to gene page
 +
**Alleles
 +
***currently only elegans allelespriority on curating alleles of other species is very low
 +
**Strains (only those accepted by the CGC for deposit into the permanent collection)
 +
***currently only marking up known C. elegans strains, mainly available through the CGC; priority on curating other species is very low
  
* Case (a): <nowiki><i>Cbr-dpy-1</i>  instead of  Cbr-<i>dpy-1</i></nowiki> <br>
+
*Changes to the form need to go through Juancarlos
::Otherwise the <nowiki><i></nowiki> tag becomes part of the string and string matching fails on Cbr-dpy-1, even if we have it in the wormbase entity list.
 
  
* Case (b): <nowiki><i>unc-119</i> instead of <i>unc</i>-<i>119</i></nowiki>
+
==Journal Data Display==
 +
This site displays information about ticketed papers<br>
 +
http://tazendra.caltech.edu/~postgres/cgi-bin/author_fp_display.cgi?afp_jfp=jfp<br>
 +
columns include:
 +
*DOI - entered through [http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/journal/journal_paper_ticket.cgi Journal_paper_ticketer]
 +
*WBPaper - auto generated by ticketer, links to http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/paper_display.cgi?data_number=00050589&action=Search+%21 paper display
 +
*data link - links to jfp, 'link' signifies data was entered, 'no data' indicates author didn't submit any information
 +
*textpresso html - links to html view of paper from species specific [http://textpresso-dev.caltech.edu/gsa/worm/html/037184.html html directory]
 +
*proofs - links to paper proof uploaded manually to species [http://textpresso-dev.caltech.edu/gsa/worm/proofs/ proof directory] on textpresso-dev
 +
**to upload proofs
 +
''download proofs from DJS into a local directory 'proofs', copy to textpresso-dev''
 +
>scp -r ./proofs kyook@textpresso-dev.caltech.edu:/home/kyook
 +
>ssh kyook@textpresso-dev.caltech.edu
 +
>sudo cp ./proofs/* /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/worm/proofs/
 +
''make all pdfs readable''
 +
>sudo chmod a+r -R /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/worm/proofs/
  
* Case (c): Parentheses shouldn't be italicized since they are used as an indicator of string starts and stops for entity recognition. <br>
+
Proofs should be available on Journal Data Display
::<nowiki> <i>cbr-unc-119</i>(<i>st20000</i>) instead of <i>cbr-unc-119(st20000)</i></nowiki>
 
  
===combined entity expressions===
+
*pdf link
* double mutants, double alleles
+
*temp pdf
* dpy-11/12
+
*go link
  
==Project issues==
+
=Linking steps=
===Ambiguous name lists===
+
[[GSA linking pipeline | Visual overviews of linking pipeline]]<br>
''We are still working out the details for this issue''<br>
+
[[Linking script pipeline | Linking scripts]]<br>
From a meeting on 2/18/10 we are moving forward on dealing with ambiguous entity names, that is a name in a paper that may be used to describe multiple objects and which may result in the creation of a faulty link.  This is a problem both with in a single species paper as well as a problem will need to address in order to expand the project to other species and to multi-species papers.
+
[[Markup excerpts]]<br>
*Intraspecies issue: This problem is pertinent to entity lists within a single species paper.  We have run into this problem with clone names and strain names being identical (for example) and not being able to resolve this discrepancy at the script level.  Our initial response to these ambiguous names has been to put these names on an exclusion list.
 
*Interspecies issue:  This is a particularly pertinent when dealing with a paper that contains gene names from different species.  We are collecting gene names, including aliases, from SGD, and Flybase, and will compare them with our own gene lists, including all synonyms.  We will expand the list comparison to mouse gene names.  Arun will be making a splash page on textpresso dev to house all these ambiguous names, and to use to direct readers if a proper URL cannot be resolved.
 
**General anatomy terms can be linked to the wrong species <br>
 
Problem: in [http://textpresso-dev.caltech.edu/gsa/worm/html/GEN115188fin_WB.html this] C. elegans paper, 'epithelium'  was linked to the elegans epithelium page <br>
 
"T cells migrate toward (Fong et al. 2002). Additionally, loss of GRK3, which is highly expressed in mouse olfactory epithelium (Schleicher et al. 1993), "<br>
 
Action: "epithelium" were added to the anatomy exclusion list. From this paper, we also added, 'head', 'tail', 'neuron', 'sensory neuron' to the exclusion list as well.
 
  
===Longevity of links===
+
Dartmouth Journal Services (DJS), the GSA journal publishers uploads the file, as a NML-formatted XML, by ftp to the textpresso server.<br>
''We are still working out the details for this issue''<br>
+
*DJS requests 48 hours to receive back the linked file.<br>
One major concern is how long the links made to the article will be stable. While this project is a collaboration with GENETICS and WormBase, we can control the stability of the links for as long as our funding lastsOnce the project expands to other MODs, each MOD will be responsible for the stability of their own URLs, compounding URL stability issues for any paper that contains multi-mod URLs. <br>
+
*The QCFast Curator receives an e-mail alert that a paper has been uploaded and to expect the linked paper soon. <br>
 +
*The paper goes through the linking script and the create_entity_table script, URLS to the linked html as well as the entity list table are e-mailed to the curator. <br>
 +
*When there are no problems with the file and linking, the curator can expect the follow up e-mail with the links no later than an hour after the initial e-mailIf there is a problem with the file, the process is stopped until the QCFast developer can fix the pipeline.<br>
 +
*n.b. there is a bug in the linking script that will stall the process based on certain characters, unfortunately, the bug is frequent enough that it is noticeable, but not frequent enough to warrant a full debugging.
  
At WormBase we will be addressing the stability of our links by collecting the URLs we use/create and periodically running a link checker to make sure the links are still alive. The link checker will point out any links that have gone dead; however, it will not check if the page is relevant. We will be working out the specifics of this aspect of the project. <br>
+
The QCFast curator goes through the marked-up html view to make sure the links are correct and no objects are missed. <br>
 +
*The QCFast curator can add or remove links through the QCFast interface.
 +
*If there are multiple objects not linked, the QCFast curator can opt to add all missing objects to the JFP for that paper and rerun the linking script (WB interface only).
 +
*If links are not able to be made, e.g., due to XML formatting, the QCFast curator can download the file and edit the links manually. The name of the file must remain the same as the original file or uploading the manually edited file will result in an error.
  
All the linked entities in a paper, along with their link status at the time of sending the linked XML back to DJS, are available
+
The final edited linked file, as an html, is uploaded, by ftp, to the DJS server.
[http://textpresso-dev.caltech.edu/gsa/worm/entity_link_tables/ here] <br>
+
*A final e-mail is sent to the QCFast curator that confirms the upload.  
 +
*Uploading the file will also launch a script that creates a final entity link table.
 +
*This final e-mail should be received within an hour of the upload.
  
The script that forms the above can be run periodically for each file and the statuses can be checked. If the links are 'silent' for a paper that was published a long time ago, then the responsible curator needs to be alerted.
+
See [[Entity classes and problems]] for linking issues and how to deal with them.
  
==Entity list sources and URLs==
+
=ftp to DJS=
to see the entity lists go to http://textpresso-dev.caltech.edu/gsa/worm/known_entities/<br>
+
05ftpAndEmailDjs.pl <br>
  
===Entity Source===
+
QCFast for all MODs have a button to FTP the edited file to DJS. This button launches the 05ftpAndEmailDjs.pl script, which:
* /home3/acedb/ws/acedb -> latest database build- this build will show up on the live WB site within 1-2 months
+
*creates a final entity table in /data1/Users/arunr/gsa/worm/entity_link_tables
* ace server at cshl  ->
+
*makes a log entry in /data1/Users/arunr/gsa/worm/logs
* postgres -> this information is as current as could be, but won't show up on the live site for at least 2 months.
+
*deposits the linked file in /data1/Users/arunr/gsa/worm/linked_xml
 +
*creates a record in /data1/Users/arunr/gsa/worm/done
 +
*e-mails a link to the final entity_link_table to the MOD curator
 +
*e-mails an alert to the file on the ftp server to DJS
  
===URL constructors===
+
To re-ftp a file the files created by the script AND the file deposited on the ftp server need to be removed. The DJS server is password protected by DJS, and account and access to the server needs to be assigned by them.
http://www.wormbase.org/wiki/index.php/Linking_To_WormBase<br>
 
Currently we are forming URLs based on public names versus object class ID's.  These links are redirected to the correct page, which is the URL based on object ID. By using the public name for the linking instead of the ID, we can avoid problems that may result in links dying due to objects getting merged or killed.
 
  
===Entities===
+
=Proofreading=
Object -> ''source'' ; web page link link <br>
+
*DJS sends WormBase the pdf formatted proof before sending the article back to the author.
<example object pages><br>
+
*DJS requests a 24 hour turnaround.
 +
*QCFast curator relays all suggested changes to DJS, as the curator cannot make any changes to the doc itself.
 +
*DJS sends proof to author.
 +
*If author gets back to DJS about errors or suggested links. DJS consults the curator about the feedback and curator recommends an action (remove the link or send a corrected URL).
  
'''Gene''' -> ''ace server (more current than dev)'' ; Gene Summary page URL uses public gene name, which is redirected to WBGeneID<br>
+
=Project issues=
<http://www.wormbase.org/db/gene/gene?name=mpk-1;class=Gene><br>
+
==Ambiguous names==
<http://www.wormbase.org/db/gene/gene?name=F43C1.1;class=Gene><br>
+
This problem refers links being made to the wrong entity based on a name used to describe multiple objects within a species or between many species.
<br>
+
*Intraspecies issue: In WB some clone names and strain names are identical. This problem has not been able to be resolved at the script level; these ambiguous names are put on an exclusion list instead.
 +
*Interspecies issue:  Gene names, including aliases, from SGD, Flybase, and MGI should be collected and compared with our own gene lists, including all synonyms.   Arun will be making a splash page on textpresso dev to house all these ambiguous names, to direct readers if a proper URL cannot be resolved (2/18/10 - last action taken)
 +
**Example: general anatomy terms can be linked to the wrong species <br>
 +
Problem: in [http://textpresso-dev.caltech.edu/gsa/worm/html/GEN115188fin_WB.html this] ''C. elegans'' paper, 'epithelium'  was linked to the elegans epithelium page <br>
 +
"T cells migrate toward (Fong et al. 2002). Additionally, loss of GRK3, which is highly expressed in mouse olfactory epithelium (Schleicher et al. 1993), "<br>  
 +
Action: "epithelium" was added to the anatomy exclusion list. From this paper, we also added, 'head', 'tail', 'neuron', 'sensory neuron' to the exclusion list as well.
  
'''Protein''' -> ''ace server'' ; Gene Summary page:<br>
+
==Longevity of links==
<http://www.wormbase.org/db/gene/gene?name=egl-4;class=Gene><br>
+
Things for consideration
<br>
+
*How can we ensure the links are maintained?
 +
*How can we ensure non-WB links are maintained?
 +
*Link checking can be used to make sure silent links have been dealt with.
  
'''Allele/SNP''' -> ''acedb database on tazendra at /home3/acedb/ws/acedb'' ; Variation Report page<br>
+
At WormBase we will be addressing the stability of our links by collecting the URLs we use/create and periodically running a link checker to make sure the links are still alive.  The link checker will point out any links that have gone dead; however, it will not check if the page is relevant. We will be working out the specifics of this aspect of the project. <br>
<http://www.wormbase.org/db/gene/variation?name=ga117> <br>
 
<http://www.wormbase.org/db/gene/variation?name=hw42941><br>
 
<br>
 
  
'''Transgene''' (inserted transgenes only) ->  ''acedb database on tazendra at /home3/acedb/ws/acedb'' ;  Transgene Summary page <br>
+
All the linked entities in a paper, along with their link status at the time of sending the linked XML back to DJS, are available
<http://www.wormbase.org/db/gene/transgene?name=sEx14536;class=Transgene><br>
+
[http://textpresso-dev.caltech.edu/gsa/worm/entity_link_tables/ here] <br>
<br>
 
  
'''Rearrangement''' -> ''acedb database on tazendra at /home3/acedb/ws/acedb'' <br>
+
The script that forms the entity table can be run periodically for each file and the statuses can be checked. If the links are 'silent' for a paper that was published a long time ago, then the responsible curator needs to be alerted.
<br>
 
  
'''Strain''' -> ''acedb database on tazendra at /home3/acedb/ws/acedb''; Strain Report page<br>
+
==People==
<http://www.wormbase.org/db/gene/strain?name=MH37;class=Strain ><br>
+
'''Caltech''' Karen Yook, Hans-Michael Müller, Paul Sternberg. ''Alum'': Arun Rangarajan, Chris Grove, Daniela Raciti, James Done, Juancarlos Chan <br>
<br>
+
'''GSA:''' Tim Schedl, Tracey DePellegrin-Connelly, Ruth Isaacson, Virginia Ingerson<br>
 
+
'''DJS (Dartmouth Journal Services):''' Sharon Faelten, Diana Schaeffer, Michelle Kerns. ''Alum'': Lolly Otis, Stephen Haenel <br>
'''Clone''' -> ''acedb database on tazendra at /home3/acedb/ws/acedb''; Sequence Summary page<br>
+
'''SGD''' Marek Skrzypek, Rob Nash<br>
<http://www.wormbase.org/db/seq/sequence?name=yk1106g06.3;class=Sequence><br>
+
'''Flybase''' Steven Marygold,  Aoife Larkin. ''Alum'': Raymund Stefancsik
<http://www.wormbase.org/db/seq/sequence?name=OSTR153G5_1;class=Sequence ><br>
 
<br>
 
  
'''Anatomy/cell''' -> ''acedb database on tazendra at /home3/acedb/ws/acedb'' with corrections by Raymond ; Summary of anatomy ontology term<br>
+
==Alerts==
<http://www.wormbase.org/db/ontology/anatomy?name=HSN><br>
+
E-mail alerts are sent when
<http://www.wormbase.org/db/ontology/anatomy?name=Z1.ppp ><br>
+
* a paper is entered through the ticketer  Subject line: "GENETICS: WormBase data for accepted paper GENETICS/2015/185272"
<br>
+
* a jfp form becomes active for the author Subject line: "new paper ticket DOI created"
 +
* an article is entered into the markup pipeline Subject line "GSA auto-email: WB 179705 article received from GSA."
 +
* the linked article is ready for QC Subject line: "GSA auto-email: WB 179705 linked file available"
 +
* linked and QC'd article has been ftp'd to DJS server Subject line: "GSA auto-email: WB 179705 article FTPed"
 +
* a paper goes through any of the MOD markup pipelines.  
  
'''Author''' -> ''The links are formed using the URL pattern <br>
+
E-mail addresses are stored in the emails folder for each MOD instance.<br>
http://www.wormbase.org/db/misc/person?name=$url_encoded_name;paper=$wbpaper_id, <br>
+
Email folders contain files with respective addresses, use sudo to change/update the addresses
where $url_encoded_name is the full name of an author with the following modifications: <br>
 
i) periods after middle names removed and <br>
 
ii) spaces converted to %20, which is HTML equivalent of a space. <br>
 
  
Examples:<br>
+
/data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/worm/emails
Name: Eve W. L. Chow <br>
+
*allele_tickets.txt -> worm-bug@sanger.ac.uk
Link: http://www.wormbase.org/db/misc/person?name=Eve%20W%20L%20Chow;paper=WBPaper00038245
+
*alleles.txt ->  worm-bug@sanger.ac.u karen@wormbase.org, mueller@caltech.ed
 +
*allele_developers.txt -> mueller@caltech.edu, karen@wormbase.org
 +
*developers.txt -> karen@wormbase.org, mueller@caltech.edu
 +
*curators.txt -> karen@wormbase.org, mueller@caltech.edu, pws@caltech.edu, vanauken@caltech.edu
 +
*final.txt -> karen@wormbase.org, mueller@caltech.edu, Genetics_Specialist.djs@sheridan.com, GGG_Specialist.djs@sheridan.com, virginia.ingerson@sheridan.com, pws@caltech.edu, vanauken@caltech.edu
  
Name: I. Russel Lee <br>
+
/data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/fly/emails/
Link: http://www.wormbase.org/db/misc/person?name=I%20Russel%20Lee;paper=WBPaper00038245
+
*developers.txt -> karen@wormbase.org, mueller@caltech.edu
 +
*curators.txt -> al806@cam.ac.uk, sjm41@cam.ac.uk, karen@wormbase.org, mueller@caltech.edu
 +
*final.txt -> mueller@caltech.edu, sjm41@cam.ac.uk, al806@cam.ac.uk, Genetics_Specialist.djs@sheridan.com, GGG_Specialist.djs@sheridan.com, virginia.ingerson@sheridan.com, karen@wormbase.org
  
(There may be also be other special characters with accents, umlauts, etc., The best thing to do would be to check WormBase and put the correct link, if the author name already exists. Else the English equivalent of the special character could be visually determined and a new author link can be formed and notified to Cecilia.)
 
  
==People==
+
/data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/yeast/emails/
'''Caltech''': Arun Rangarajan, Karen Yook, Juancarlos Chan, Paul Sternberg, Hans-Michael Müller, Chris Grove, Daniela Raciti<br>
+
*curators.txt -> mueller@caltech.edu, marek.skrzypek@stanford.edu, rnash@stanford.edu, karen@wormbase.org
'''GSA:''' Tim Schedl, Tracey DePellegrin-Connelly, Ruth Isaacson<br>
+
*developers.txt -> karen@wormbase.org, mueller@caltech.edu
'''DJS (Dartmouth Journal Services):''' Stephen Haenel, Sharon Faelten, Lolly Otis<br>
+
*final.txt -> mueller@caltech.edu, Genetics_Specialist.djs@sheridan.com, GGG_Specialist.djs@sheridan.com, virginia.ingerson@sheridan.com, marek.skrzypek@stanford.edu, rnash@stanford.edu, karen@wormbase.or
  
 +
email files are also on the sandbox:
 +
/data2/srv/textpresso-dev.caltech.edu/www/docroot/karen/gsa/
  
  
 +
--[[User:Kyook|Kyook]] ([[User talk:Kyook|talk]]) 19:53, 26 February 2016 (UTC)
 
[[Category:Curation]]
 
[[Category:Curation]]
 +
[[Category:GSA_markup]]

Latest revision as of 20:50, 10 January 2017

Caltech_documentation

Visual overviews of linking pipeline
Marked-up papers

GENETICS Editors duties

Retrieving/Assigning a WBPaperID

  • GSA Editors enter the approved paper doi into the WBPaperID ticketer form.
    • The ticketer captures the last six digits of the doi of the article, which is used as an identifier throughout the linking and QC process.
    • The ticketer assigns an official WBPaperID, which is needed to track the paper through WB curation paths
  • Retrieve the link to a Journal First Pass form (JFP) from the ticketer. Note: this step is sometimes missed by the GSA editors causing confusion when the paper comes through the normal WB paper pipeline.
  • Editors send a notice of approval to the author along with the link to the JFP.
    • GSA editors control the message to the authors - to make changes to the message, contact GSA editors. The message reads as follows:
Dear Dr. XXXX:

DOI: 10.1534/genetics.11X.XXXXXX
WBPaper: X

The Genetics Society of America is working with textpresso (www.textpresso.org) and WormBase (www.wormbase.org) to create links between genetic and genomic objects in your article to the appropriate page in WormBase. These links will be included in both the online full text and PDF versions of the published article.

If you want any genes, alleles, transgenes, CGC-destined strains, anatomy terms, etc., previously known or new/described for the first time in your article to be linked to WormBase please enter the names of these objects using the form at the following link:

http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/journal/journal_first_pass.cgi?action=Curate&paper=X&passwd=X.X


Because your submitted data will be processed automatically, please follow the examples carefully. If you have questions, or want to upload a file instead, contact Karen Yook, Wormbase Curator, at karen@wormbase.org.


Thank you.

XXXX XXXX
Editorial Assistant


Mappings between doi and WBPaperIDs can be found here.

WBPaperID ticketer

  • GSA editors need to enter the paper in to the Journal_paper_ticketer to retrieve a link to the journal first pass form.
  • GSA editors enter the paper doi into the Journal_paper_ticketer; full doi expressions need to be used, e.g., 10.1534/genetics.113.157685; the link for each paper is accessible on the Journal Data Display, see below.
  • The ticketer enters the paper into postgres as a doi object only. All information about the paper (title, authors, abstract, PMID, etc.) is entered when the paper comes through the normal paper pipeline.
  • The ticketer sends an alert to the QCFast curator and WB Paper curator (Kimberly) that a WBPaperID was assigned.
  • The paper should show up in Pubmed within a couple days/weeks, depending on when the doi was entered and where the paper is in the processing stage.
  • The WBPaper curator (Kimberly) uses the doi to manually identify the paper during the normal paper pipeline. In the normal paper pipeline, a cron job fetches PMIDs from PubMed, note: the cron job doesn't look for doi's; it only looks for PMIDs. When a new Genetics or G3 paper shows up in the paper editor, Kimberly manually looks for the corresponding doi to make sure the paper has been entered already and the PMID is then manually added. Once the PMID is added, Kimberly approves the paper through the regular paper pipeline.

Journal author first pass form (JFP)

  • An alert is sent to the QCFast curator when an author submits data through the JFP. The QCFast curator must check the entered data to make sure it has been entered in a useful format, see following. Alerts (flags) are also sent to the appropriate WB curator to alert them to new objects.
  • The JFP differs from the author first pass form sent out to authors of publications in other journals; although all the datatypes on the form are the same, the JFP highlights those data types that are linked and requests the authors to declare new objects that don't exist in WB at the time.
  • Journal author data entered into linked classes of the form are added automatically to the corresponding class entity list.
  • Since the linking scripts perform exact string match (and formatting cue recognition), if data is not entered correctly into the JFP,
    • the object may not be recognized in the paper
    • extraneous information will be added to the entity list and used to mark up objects in the paper
  • The QCFast curator needs to check the JFP entries for a given paper to make sure everything is entered correctly or to silence strings not meant to be marked up- silencing strings is achieved by placing two tildas "~~", before any text not meant to be added to an entity list.

For example

"ZK1128.2, mett-10, which is orthologous to the human methyltransferase-10 domain containing protein (METT10D), and corresponds to ZK1128.2."    
to
"ZK1128.2, mett-10~~, which is orthologous to the human methyltransferase-10 domain containing protein (METT10D), and corresponds to ZK1128.2."

In the first case every word would have been added to the gene entity list (as it was entered in the genesymbol data field), the "~~" hides everything that comes after it so that only the gene names are added (although in this case, ZK1128.9, does not necessarily need to be added, as it should already be on the gene list).

  • To see author-submitted data for all JFPs
**access the JFP index here
**choose "textpresso" from the drop down menu in the upper left corner of the page
**click "Show Data"
**data are listed by paper and data type
  • Format for data entered through the form:
    • Data need to be comma-separated and in the standard format for each data type, all extraneous information hidden (see below).
    • Gene
      • public_name
      • CDs
      • unapproved gene names need to be cleared by genenames@wormbase
    • Transgene
      • integrated transgene nomenclature format only
      • 'Ex' objects that exist in WormBase
      • entities within transgenes/construct expressions are linked, e.g. in the expression 'sur-5::GFP', 'sur-5' is linked.
    • Antibody - no longer linked
    • Proteins - link will be to gene page
    • Alleles
      • currently only elegans alleles; priority on curating alleles of other species is very low
    • Strains (only those accepted by the CGC for deposit into the permanent collection)
      • currently only marking up known C. elegans strains, mainly available through the CGC; priority on curating other species is very low
  • Changes to the form need to go through Juancarlos

Journal Data Display

This site displays information about ticketed papers
http://tazendra.caltech.edu/~postgres/cgi-bin/author_fp_display.cgi?afp_jfp=jfp
columns include:

download proofs from DJS into a local directory 'proofs', copy to textpresso-dev
>scp -r ./proofs kyook@textpresso-dev.caltech.edu:/home/kyook 
>ssh kyook@textpresso-dev.caltech.edu
>sudo cp ./proofs/* /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/worm/proofs/
make all pdfs readable
>sudo chmod a+r -R /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/worm/proofs/

Proofs should be available on Journal Data Display

  • pdf link
  • temp pdf
  • go link

Linking steps

Visual overviews of linking pipeline
Linking scripts
Markup excerpts

Dartmouth Journal Services (DJS), the GSA journal publishers uploads the file, as a NML-formatted XML, by ftp to the textpresso server.

  • DJS requests 48 hours to receive back the linked file.
  • The QCFast Curator receives an e-mail alert that a paper has been uploaded and to expect the linked paper soon.
  • The paper goes through the linking script and the create_entity_table script, URLS to the linked html as well as the entity list table are e-mailed to the curator.
  • When there are no problems with the file and linking, the curator can expect the follow up e-mail with the links no later than an hour after the initial e-mail. If there is a problem with the file, the process is stopped until the QCFast developer can fix the pipeline.
  • n.b. there is a bug in the linking script that will stall the process based on certain characters, unfortunately, the bug is frequent enough that it is noticeable, but not frequent enough to warrant a full debugging.

The QCFast curator goes through the marked-up html view to make sure the links are correct and no objects are missed.

  • The QCFast curator can add or remove links through the QCFast interface.
  • If there are multiple objects not linked, the QCFast curator can opt to add all missing objects to the JFP for that paper and rerun the linking script (WB interface only).
  • If links are not able to be made, e.g., due to XML formatting, the QCFast curator can download the file and edit the links manually. The name of the file must remain the same as the original file or uploading the manually edited file will result in an error.

The final edited linked file, as an html, is uploaded, by ftp, to the DJS server.

  • A final e-mail is sent to the QCFast curator that confirms the upload.
  • Uploading the file will also launch a script that creates a final entity link table.
  • This final e-mail should be received within an hour of the upload.

See Entity classes and problems for linking issues and how to deal with them.

ftp to DJS

05ftpAndEmailDjs.pl

QCFast for all MODs have a button to FTP the edited file to DJS. This button launches the 05ftpAndEmailDjs.pl script, which:

  • creates a final entity table in /data1/Users/arunr/gsa/worm/entity_link_tables
  • makes a log entry in /data1/Users/arunr/gsa/worm/logs
  • deposits the linked file in /data1/Users/arunr/gsa/worm/linked_xml
  • creates a record in /data1/Users/arunr/gsa/worm/done
  • e-mails a link to the final entity_link_table to the MOD curator
  • e-mails an alert to the file on the ftp server to DJS

To re-ftp a file the files created by the script AND the file deposited on the ftp server need to be removed. The DJS server is password protected by DJS, and account and access to the server needs to be assigned by them.

Proofreading

  • DJS sends WormBase the pdf formatted proof before sending the article back to the author.
  • DJS requests a 24 hour turnaround.
  • QCFast curator relays all suggested changes to DJS, as the curator cannot make any changes to the doc itself.
  • DJS sends proof to author.
  • If author gets back to DJS about errors or suggested links. DJS consults the curator about the feedback and curator recommends an action (remove the link or send a corrected URL).

Project issues

Ambiguous names

This problem refers links being made to the wrong entity based on a name used to describe multiple objects within a species or between many species.

  • Intraspecies issue: In WB some clone names and strain names are identical. This problem has not been able to be resolved at the script level; these ambiguous names are put on an exclusion list instead.
  • Interspecies issue: Gene names, including aliases, from SGD, Flybase, and MGI should be collected and compared with our own gene lists, including all synonyms. Arun will be making a splash page on textpresso dev to house all these ambiguous names, to direct readers if a proper URL cannot be resolved (2/18/10 - last action taken)
    • Example: general anatomy terms can be linked to the wrong species

Problem: in this C. elegans paper, 'epithelium' was linked to the elegans epithelium page
"T cells migrate toward (Fong et al. 2002). Additionally, loss of GRK3, which is highly expressed in mouse olfactory epithelium (Schleicher et al. 1993), "
Action: "epithelium" was added to the anatomy exclusion list. From this paper, we also added, 'head', 'tail', 'neuron', 'sensory neuron' to the exclusion list as well.

Longevity of links

Things for consideration

  • How can we ensure the links are maintained?
  • How can we ensure non-WB links are maintained?
  • Link checking can be used to make sure silent links have been dealt with.

At WormBase we will be addressing the stability of our links by collecting the URLs we use/create and periodically running a link checker to make sure the links are still alive. The link checker will point out any links that have gone dead; however, it will not check if the page is relevant. We will be working out the specifics of this aspect of the project.

All the linked entities in a paper, along with their link status at the time of sending the linked XML back to DJS, are available here

The script that forms the entity table can be run periodically for each file and the statuses can be checked. If the links are 'silent' for a paper that was published a long time ago, then the responsible curator needs to be alerted.

People

Caltech Karen Yook, Hans-Michael Müller, Paul Sternberg. Alum: Arun Rangarajan, Chris Grove, Daniela Raciti, James Done, Juancarlos Chan
GSA: Tim Schedl, Tracey DePellegrin-Connelly, Ruth Isaacson, Virginia Ingerson
DJS (Dartmouth Journal Services): Sharon Faelten, Diana Schaeffer, Michelle Kerns. Alum: Lolly Otis, Stephen Haenel
SGD Marek Skrzypek, Rob Nash
Flybase Steven Marygold, Aoife Larkin. Alum: Raymund Stefancsik

Alerts

E-mail alerts are sent when

  • a paper is entered through the ticketer Subject line: "GENETICS: WormBase data for accepted paper GENETICS/2015/185272"
  • a jfp form becomes active for the author Subject line: "new paper ticket DOI created"
  • an article is entered into the markup pipeline Subject line "GSA auto-email: WB 179705 article received from GSA."
  • the linked article is ready for QC Subject line: "GSA auto-email: WB 179705 linked file available"
  • linked and QC'd article has been ftp'd to DJS server Subject line: "GSA auto-email: WB 179705 article FTPed"
  • a paper goes through any of the MOD markup pipelines.

E-mail addresses are stored in the emails folder for each MOD instance.
Email folders contain files with respective addresses, use sudo to change/update the addresses

/data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/worm/emails
  • allele_tickets.txt -> worm-bug@sanger.ac.uk
  • alleles.txt -> worm-bug@sanger.ac.u karen@wormbase.org, mueller@caltech.ed
  • allele_developers.txt -> mueller@caltech.edu, karen@wormbase.org
  • developers.txt -> karen@wormbase.org, mueller@caltech.edu
  • curators.txt -> karen@wormbase.org, mueller@caltech.edu, pws@caltech.edu, vanauken@caltech.edu
  • final.txt -> karen@wormbase.org, mueller@caltech.edu, Genetics_Specialist.djs@sheridan.com, GGG_Specialist.djs@sheridan.com, virginia.ingerson@sheridan.com, pws@caltech.edu, vanauken@caltech.edu
/data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/fly/emails/
  • developers.txt -> karen@wormbase.org, mueller@caltech.edu
  • curators.txt -> al806@cam.ac.uk, sjm41@cam.ac.uk, karen@wormbase.org, mueller@caltech.edu
  • final.txt -> mueller@caltech.edu, sjm41@cam.ac.uk, al806@cam.ac.uk, Genetics_Specialist.djs@sheridan.com, GGG_Specialist.djs@sheridan.com, virginia.ingerson@sheridan.com, karen@wormbase.org


/data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/yeast/emails/
  • curators.txt -> mueller@caltech.edu, marek.skrzypek@stanford.edu, rnash@stanford.edu, karen@wormbase.org
  • developers.txt -> karen@wormbase.org, mueller@caltech.edu
  • final.txt -> mueller@caltech.edu, Genetics_Specialist.djs@sheridan.com, GGG_Specialist.djs@sheridan.com, virginia.ingerson@sheridan.com, marek.skrzypek@stanford.edu, rnash@stanford.edu, karen@wormbase.or

email files are also on the sandbox:

/data2/srv/textpresso-dev.caltech.edu/www/docroot/karen/gsa/


--Kyook (talk) 19:53, 26 February 2016 (UTC)