Difference between revisions of "GSA Markup SOP"

From WormBaseWiki
Jump to navigationJump to search
m
m
Line 80: Line 80:
 
*This final e-mail should be received within an hour of the upload.
 
*This final e-mail should be received within an hour of the upload.
  
 +
See [[Entity classes and problems]] for linking issues and how to deal with them.
  
===Proofreading===
+
=Proofreading=
 
*DJS sends WormBase the pdf formatted proof before sending the article back to the author.  
 
*DJS sends WormBase the pdf formatted proof before sending the article back to the author.  
 
*DJS requests a 24 hour turnaround.
 
*DJS requests a 24 hour turnaround.
Line 110: Line 111:
  
 
The script that forms the entity table can be run periodically for each file and the statuses can be checked. If the links are 'silent' for a paper that was published a long time ago, then the responsible curator needs to be alerted.
 
The script that forms the entity table can be run periodically for each file and the statuses can be checked. If the links are 'silent' for a paper that was published a long time ago, then the responsible curator needs to be alerted.
 
===Entity Sources===
 
Entity lists are available [http://textpresso-dev.caltech.edu/gsa/worm/known_entities/ here]
 
 
01downloadModEntities.pl downloads entities<br>
 
From acedb on spica
 
*Anatomy_name
 
*Clone
 
*Rearrangement
 
*Strain
 
*Variation
 
From postres on tazendra
 
*Genes
 
*Transgenes
 
 
02formSortedLexicon.pl forms the lexicon from the known_entities.
 
 
===URL constructors===
 
http://www.wormbase.org/wiki/index.php/Linking_To_WormBase
 
http://www.wormbase.org/db/get?name=X;class=Y
 
 
Currently we are forming URLs based on public names versus object class ID's.  These links are redirected to the correct page, which is the URL based on object ID.  By using the public name for the linking instead of the ID, we can avoid problems that may result in links dying due to objects getting merged or killed.
 
 
===Entities===
 
Object -> ''source'' ; web page link link <br>
 
<example object pages><br>
 
 
'''Gene''' -> ''ace server (more current than dev)'' ; Gene Summary page URL uses public gene name, which is redirected to WBGeneID<br>
 
<http://www.wormbase.org/db/get?name=mpk-1;class=Gene><br>
 
<http://www.wormbase.org/db/get?name=F43C1.1;class=Gene><br>
 
<br>
 
 
'''Protein''' -> ''ace server'' ; Gene Summary page:<br>
 
<http://www.wormbase.org/db/get?name=egl-4;class=Gene><br>
 
<br>
 
 
'''Allele/SNP''' -> ''acedb database on tazendra at /home3/acedb/ws/acedb'' ; Variation Report page<br>
 
<http://www.wormbase.org/db/get?name=ga117;class=Variation> <br>
 
<http://www.wormbase.org/db/get?name=hw42941;class=Variation><br>
 
<br>
 
 
'''Transgene''' (inserted transgenes only) ->  ''acedb database on tazendra at /home3/acedb/ws/acedb'' ;  Transgene Summary page <br>
 
<http://www.wormbase.org/db/get?name=oxIs12;class=Transgene><br>
 
<br>
 
 
'''Rearrangement''' -> ''acedb database on tazendra at /home3/acedb/ws/acedb'' <br>
 
<br>
 
 
'''Strain''' -> ''acedb database on tazendra at /home3/acedb/ws/acedb''; Strain Report page<br>
 
<http://www.wormbase.org/db/get?name=MH37;class=Strain ><br>
 
<br>
 
 
'''Clone''' -> ''acedb database on tazendra at /home3/acedb/ws/acedb''; Sequence Summary page<br>
 
<http://www.wormbase.org/db/get?name=yk1106g06.3;class=Sequence><br>
 
<http://www.wormbase.org/db/get?name=OSTR153G5_1;class=Sequence ><br>
 
<br>
 
 
'''suppress for now''' Anatomy/cell -> ''acedb database on tazendra at /home3/acedb/ws/acedb'' with corrections by Raymond ; Summary of anatomy ontology term<br>
 
<http://www.wormbase.org/db/get?name=HSN;class=Anatomy_term><br>
 
<http://www.wormbase.org/db/get?name=Z1.ppp;class=Anatomy_term><br>
 
<br>
 
 
'''suppress for now''' Author -> ''The links are formed using the URL pattern <br>
 
http://www.wormbase.org/db/misc/person?name=$url_encoded_name;paper=$wbpaper_id, <br>
 
where $url_encoded_name is the full name of an author with the following modifications: <br>
 
i) periods after middle names removed and <br>
 
ii) spaces converted to %20, which is HTML equivalent of a space. <br>
 
 
Examples:<br>
 
Name: Eve W. L. Chow <br>
 
Link: http://www.wormbase.org/db/misc/person?name=Eve%20W%20L%20Chow;paper=WBPaper00038245
 
 
Name: I. Russel Lee <br>
 
Link: http://www.wormbase.org/db/misc/person?name=I%20Russel%20Lee;paper=WBPaper00038245
 
 
(There may be also be other special characters with accents, umlauts, etc., The best thing to do would be to check WormBase and put the correct link, if the author name already exists. Else the English equivalent of the special character could be visually determined and a new author link can be formed and notified to Cecilia.)
 
 
--[[User:Kyook|kjy]] 22:04, 19 April 2012 (UTC)
 
  
 
==People==
 
==People==
'''Caltech''': Arun Rangarajan, Karen Yook, Juancarlos Chan, Paul Sternberg, Hans-Michael Müller, Chris Grove, Daniela Raciti, James Done<br>
+
'''Caltech'''''Current'': Karen Yook, Hans-Michael Müller, James Done, Juancarlos Chan, Paul Sternberg. ''Alum'': Arun Rangarajan, Chris Grove, Daniela Raciti <br>
 
'''GSA:''' Tim Schedl, Tracey DePellegrin-Connelly, Ruth Isaacson<br>
 
'''GSA:''' Tim Schedl, Tracey DePellegrin-Connelly, Ruth Isaacson<br>
 
'''DJS (Dartmouth Journal Services):''' Stephen Haenel, Sharon Faelten, Lolly Otis<br>
 
'''DJS (Dartmouth Journal Services):''' Stephen Haenel, Sharon Faelten, Lolly Otis<br>
  
  
 
+
--[[User:Kyook|kjy]] 22:58, 23 July 2013 (UTC)
 
[[Category:Curation]]
 
[[Category:Curation]]
 
[[Category:GSA_markup]]
 
[[Category:GSA_markup]]

Revision as of 22:58, 23 July 2013

Caltech_documentation

Visual overviews of linking pipeline
Marked-up papers

GENETICS Editors duties

  • Retrieve WBPaperID- GSA Editors enter the approved paper doi into the WBPaperID ticketer here to retrieve a WBPaperID and link to a Journal First Pass form (JFP) for the paper. Note: this step can be inadvertently missed by the editors. The ticketer captures the last six digits of the doi of the article, which is used as an identifier throughout the linking and QC process. Mappings between doi and WBPaperIDs can be found here.
  • GSA Editors send a notice of approval to the author along with the link to the JFP.

Ask Genetics editors if they can submit the paper using the WBPaperID or send more info on the paper

WBPaperID ticketer

  • The ticketer generates a link to a Journal First Pass form (JFP); the link for each paper is accessible here. (The GOid mappings were put in place for GO linking and is not currently used.)
  • The ticketer enters the paper into postgres as a doi object only. All information about the paper (title, authors, abstract, PMID, etc.) is entered when the paper comes through the normal paper pipeline.
  • The ticketer sends an alert to the QCFast curator and WBPaper curator (Kimberly) that a WBPaperID was assigned.
  • The paper should show up in Pubmed within a couple days/weeks, depending on when the doi was entered and where the paper is in the processing stage.
  • The WBPaper curator (Kimberly) uses the doi to manually identify the paper during the normal paper pipeline. In the normal paper pipeline, a cron job fetches PMIDs from PubMed, note: the cron job doesn't look for doi's; it only looks for PMIDs. When a new Genetics or G3 paper shows up in the paper editor, Kimberly manually looks for the corresponding doi to make sure the paper has been entered already and the PMID is then manually added. Once the PMID is added, Kimberly approves the paper through the regular paper pipeline.

Journal author first pass form (JFP)

  • An alert is sent to the QCFast curator when an author submits data through the JFP. The QCFast curator must check the entered data to make sure it has been entered in a useful format, see following. Alerts (flags) are also sent to the appropriate WB curator to alert them to new objects.
  • The JFP differs from the author first pass form sent out to authors of publications in other journals; although all the datatypes on the form are the same, the JFP highlights those data types that are linked and requests the authors to declare new objects that don't exist in WB at the time.
  • Journal author data entered into linked classes of the form are added automatically to the corresponding class entity list.
  • Since the linking scripts perform exact string match (and formatting cue recognition), if data is not entered correctly into the JFP,
    • the object may not be recognized in the paper
    • extraneous information will be added to the entity list and used to mark up objects in the paper
  • The QCFast curator needs to check the JFP entries for a given paper to make sure everything is entered correctly or to silence strings not meant to be marked up- silencing strings is achieved by placing two tildas "~~", before any text not meant to be added to an entity list.

For example

"ZK1128.2, mett-10, which is orthologous to the human methyltransferase-10 domain containing protein (METT10D), and corresponds to ZK1128.2."    
to
"ZK1128.2, mett-10~~, which is orthologous to the human methyltransferase-10 domain containing protein (METT10D), and corresponds to ZK1128.2."

In the first case every word would have been added to the gene entity list (as it was entered in the genesymbol data field), the "~~" hides everything that comes after it so that only the gene names are added (although in this case, ZK1128.9, does not necessarily need to be added, as it should already be on the gene list).

  • To see author-submitted data for all JFPs
**access the JFP index here
**choose "textpresso" from the drop down menu in the upper left corner of the page
**click "Show Data"
**data are listed by paper and data type
  • Format for data entered through the form:
    • Data need to be comma-separated and need to be in the standard format for each data type and all extraneous information hidden (see below).
    • Gene
      • public_name
      • CDs
      • unapproved gene names need to be cleared by genenames@wormbase
    • Transgene
      • integrated transgene nomenclature format only
      • 'Ex' objects that exist in WormBase
      • entities within transgenes/construct expressions are linked, e.g. in the expression 'sur-5::GFP', 'sur-5' is linked.
    • Antibody - no longer linked
    • Proteins - link will be to gene page
    • Alleles
      • currently only elegans alleles; priority on curating alleles of other species is very low
    • Strains
      • currently only marking up known C. elegans strains, mainly available through the CGC; priority on curating other species is very low

Linking steps

Visual overviews of linking pipeline
Linking scripts
Entity classes and problems
Markup excerpts

Dartmouth Journal Services (DJS), the GSA journal publishers uploads the file, as a NML-formatted XML, by ftp to the textpresso server.

  • DJS requests 48 hours to receive back the linked file.
  • The QCFast Curator receives an e-mail alert that a paper has been uploaded and to expect the linked paper soon.
  • The paper goes through the linking script and the create_entity_table script, URLS to the linked html as well as the entity list table are e-mailed to the curator.
  • When there are no problems with the file and linking, the curator can expect the follow up e-mail with the links no later than an hour after the initial e-mail. If there is a problem with the file, the process is stopped until the QCFast developer can fix the pipeline.
  • n.b. there is a bug in the linking script that will stall the process based on certain characters, unfortunately, the bug is frequent enough that it is noticeable, but not frequent enough to warrant a full debugging.

The QCFast curator goes through the marked-up html view to make sure the links are correct and no objects are missed.

  • The QCFast curator can add or remove links through the QCFast interface.
  • If there are multiple objects not linked, the QCFast curator can opt to add all missing objects to the JFP for that paper and rerun the linking script (WB interface only).
  • If links are not able to be made, e.g., due to XML formatting, the QCFast curator can download the file and edit the links manually. The name of the file must remain the same as the original file or uploading the manually edited file will result in an error.

The final edited linked file, as an html, is uploaded by ftp to the DJS servers.

  • A final e-mail is sent to the QCFast curator that confirms the upload.
  • Uploading the file will also launch a script that creates a final entity link table.
  • This final e-mail should be received within an hour of the upload.

See Entity classes and problems for linking issues and how to deal with them.

Proofreading

  • DJS sends WormBase the pdf formatted proof before sending the article back to the author.
  • DJS requests a 24 hour turnaround.
  • QCFast curator relays all suggested changes to DJS, as the curator cannot make any changes to the doc itself.
  • DJS sends proof to author.
  • If author gets back to DJS about errors or suggested links. DJS consults the curator about the feedback and curator recommends an action (remove the link or send a corrected URL).

Project issues

Ambiguous names

This problem refers links being made to the wrong entity based on a name used to describe multiple objects within a species or between many species.

  • Intraspecies issue: In WB some clone names and strain names are identical. This problem has not been able to be resolved at the script level; these ambiguous names are put on an exclusion list instead.
  • Interspecies issue: Gene names, including aliases, from SGD, Flybase, and MGI should be collected and compared with our own gene lists, including all synonyms. Arun will be making a splash page on textpresso dev to house all these ambiguous names, to direct readers if a proper URL cannot be resolved (2/18/10 - last action taken)
    • Example: general anatomy terms can be linked to the wrong species

Problem: in this C. elegans paper, 'epithelium' was linked to the elegans epithelium page
"T cells migrate toward (Fong et al. 2002). Additionally, loss of GRK3, which is highly expressed in mouse olfactory epithelium (Schleicher et al. 1993), "
Action: "epithelium" was added to the anatomy exclusion list. From this paper, we also added, 'head', 'tail', 'neuron', 'sensory neuron' to the exclusion list as well.

Longevity of links

Things for consideration

  • How can we ensure the links are maintained?
  • How can we ensure non-WB links are maintained?
  • Link checking can be used to make sure silent links have been dealt with.

At WormBase we will be addressing the stability of our links by collecting the URLs we use/create and periodically running a link checker to make sure the links are still alive. The link checker will point out any links that have gone dead; however, it will not check if the page is relevant. We will be working out the specifics of this aspect of the project.

All the linked entities in a paper, along with their link status at the time of sending the linked XML back to DJS, are available here

The script that forms the entity table can be run periodically for each file and the statuses can be checked. If the links are 'silent' for a paper that was published a long time ago, then the responsible curator needs to be alerted.

People

CaltechCurrent: Karen Yook, Hans-Michael Müller, James Done, Juancarlos Chan, Paul Sternberg. Alum: Arun Rangarajan, Chris Grove, Daniela Raciti
GSA: Tim Schedl, Tracey DePellegrin-Connelly, Ruth Isaacson
DJS (Dartmouth Journal Services): Stephen Haenel, Sharon Faelten, Lolly Otis


--kjy 22:58, 23 July 2013 (UTC)