GSA Markup SOP

From WormBaseWiki
Jump to navigationJump to search

Caltech_documentation

GSA linking pipeline

GENETICS Editors duties

  • Retrieval of WBPaperID- GSA Editors enter the approved paper doi into the WBPaperID ticketer here to retrieve a WBPaperID. Note: this step can be inadvertently missed by the editors.
    • The ticketer captures the doi of the article, the last 6 digits are used as an identifier throughout the linking and QC process.
    • The doi is used to manually identify the paper during the normal pubmed paper pipeline.
    • The ticketer generates a link to a Journal First Pass form (JFP).
  • GSA Editors send authors the JFP link
  • GSA Editors send a notice of approval to the author along with the link to the JFP; the link for each paper is accessible here.

Ask Genetics editors if they can submit the paper using the WBPaperID or send more info on the paper

Postgres/WB actions

  • WBPaperID assignment enters the paper into postgres as a doi object only. All information about the paper (title, authors, abstract, PMID, etc.) is entered when the paper comes through the normal paper pipeline.
  • The doi and WBPaperID mappings can be accessed here This form also has links to the JFP form (first 'link' after the WBPaperID) and the GO term to GOid mappings (second 'link'). The GOid mappings were put in place for GO linking and is not currently used.
  • Entering of the doi and WBPaperID assignment by the GENETICS Editor leads to an alert being sent to the QCFast curator and Kimberly.
  • The paper should show up in Pubmed within a couple days to weeks, depending on when the doi was entered and where the paper is in the processing stage.
  • In the normal paper pipeline, a cron job fetches PMIDs from PubMed, note: the cron job doesn't look for doi's; it only looks for PMIDs. When a new Genetics or G3 paper shows up in the paper editor, Kimberly manually looks for the corresponding doi to make sure the paper has been entered already and the PMID is then manually added. Once the PMID is added, Kimberly approves the paper through the regular paper pipeline.

Journal author first pass form (JFP)

  • An alert is sent to the QCFast curator when an author submits data through the JFP. The QCFast curator must check the entered data to make sure it has been entered in a useful format, see following. Alerts (flags) are also sent to the appropriate WB curator to alert them to new objects.
  • The JFP differs from the author first pass form sent out to authors of publications in other journals; although all the datatypes on the form are the same, the JFP highlights those data types that are linked and requests the authors to declare new objects that don't exist in WB at the time.
  • Journal author data entered into linked classes of the form are added automatically to the corresponding class entity list.
  • Since the linking scripts perform exact string match (and formatting cue recognition), if data is not entered correctly into the JFP,
    • the object may not be recognized in the paper
    • extraneous information will be added to the entity list and used to mark up objects in the paper
  • The QCFast curator needs to check the JFP entries for a given paper to make sure everything is entered correctly or to silence strings not meant to be marked up- silencing strings is achieved by placing two tildas "~~", before any text not meant to be added to an entity list.

For example

"ZK1128.2, mett-10, which is orthologous to the human methyltransferase-10 domain containing protein (METT10D), and corresponds to ZK1128.2."    
to
"ZK1128.2, mett-10~~, which is orthologous to the human methyltransferase-10 domain containing protein (METT10D), and corresponds to ZK1128.2."

In the first case every word would have been added to the gene entity list (as it was entered in the genesymbol data field), the "~~" hides everything that comes after it so that only the gene names are added (although in this case, ZK1128.9, does not necessarily need to be added, as it should already be on the gene list).

  • To see author-submitted data for all JFPs
**access the JFP index here
**choose "textpresso" from the drop down menu in the upper left corner of the page
**click "Show Data"
**data are listed by paper and data type
  • Format for data entered through the form:
    • Data need to be comma-separated and need to be in the standard format for each data type and all extraneous information hidden (see below).
    • Gene
      • public_name
      • CDs
      • unapproved gene names need to be cleared by genenames@wormbase
    • Transgene
      • integrated transgene nomenclature format only
      • 'Ex' objects that exist in WormBase
      • entities within transgenes/construct expressions are linked, e.g. in the expression 'sur-5::GFP', 'sur-5' is linked.
    • Antibody - no longer linked
    • Proteins - link will be to gene page
    • Alleles
      • currently only elegans alleles; priority on curating alleles of other species is very low
    • Strains
      • currently only marking up known C. elegans strains, mainly available through the CGC; priority on curating other species is very low


Entity linking scripts

GSA linking pipeline Linking script pipeline

Dartmouth Journal Services (DJS), the GSA journal publishers uploads the file, as a NML-formatted XML, by ftp to the textpresso server.

  • DJS requests 48 hours to receive back the linked file.
  • The QCFast Curator receives an e-mail alert that a paper has been uploaded and to expect the linked paper soon.
  • The paper goes through the linking script and the create_entity_table script, URLS to the linked html as well as the entity list table are e-mailed to the curator.
  • When there are no problems with the file and linking, the curator can expect the follow up e-mail with the links no later than an hour after the initial e-mail. If there is a problem with the file, the process is stopped until the QCFast developer can fix the pipeline.
  • n.b. there is a bug in the linking script that will stall the process based on certain characters, unfortunately, the bug is frequent enough that it is noticeable, but not frequent enough to warrant a full debugging.

The QCFast curator goes through the marked-up html view to make sure the links are correct and no objects are missed.

  • If there are multiple objects not linked, the QCFast curator can add all missing objects to the JFP for that paper and rerun the linking script. Otherwise the QCFast curator can add and delete links through the QCFast interace itself.
  • If links are not able to be made, e.g., due to XML formatting, the QCFast curator can download the file and edit the links manually. The name of the file must remane the same as the original file or uploading the manually edited file will result in an error.

The final edited linked file, as an html, is uploaded by ftp to the DJS servers.

  • A final e-mail is sent to the QCFast curator that confirms the upload.
  • Uploading the file will also launch a script that creates a final entity link table.
  • This final e-mail should be received within an hour of the upload.

Proofreading

  • DJS sends WormBase the pdf formatted proof before sending the article back to the author.
  • DJS requests a 24 hour turnaround.
  • QCFast curator relays all suggested changes to DJS, as the curator cannot make any changes to the doc itself.
  • DJS sends proof to author.
  • If author gets back to DJS about errors or suggested links. DJS consults the curator about the feedback and curator recommends an action (remove the link or send a corrected URL).

Linking/Proofing Issues

Entities don't exist in WB yet

New entities being described for the first time can still be linked to pages in WB. These links remain silent until the objects are created in subsequent releases. These new objects need to be added to entity lists via the JFP form, see above. If authors submit the form as soon as they receive the link to the form, the objects are added to the entity list before the XML is received and the linking scripts will link the new objects. One problem is that the authors enter information on the form in such a way that the scripts cannot read them. A curator must visually check the data from the JFP to make sure the information was entered properly. The QCFast curator is alerted when an author submits a form and in that alert, a summary of the information is presented. Bad entries need to be manually blocked from being read by the JFP scripts by placing two tildas "~~" before the entry.

Gene names don't exist in WB

Gene names need to be approved by WB nomenclature curators. In most cases, the gene name not only has been approved, but already exists on the ace server so already has been added to the gene list. Authors should declare any new gene names before the linking scripts are run, but authors do not always follow through. One action is to link the name to the official WBGene sequence ID. This will require that the authors have explicitly stated what the gene sequence is. Linking the name to the gene should not affect what happens in acedb, the link lives within the online paper alone. If an author is not explicit, the best option is to contact the author, if there is no response within a reasonable time, no link will be created. This situation is not a frequent event.

For example a paper has a couple gene names, "cbr-dpy-1" and "cpy-4", both referring to the same briggsae sequence, but neither name existed in acedb. In this case make a link to the WBGene page rather than create a link with cbr-dpy-1 or cpy-4 as the public name, as the other links are made.

Formatting problems break recognition

Entity recognition relies on exact string matches and XML formatting cues. If entities are not formatted correctly in the XML then the entity may not be recognized and linked. QCFast does not allow links to be made that do not follow XML formatting restrictions, to link these entities, it is necessary to download the html, manually add the link, then upload the fixed html.

  • Case (a): <i>Cbr-dpy-1</i> instead of Cbr-<i>dpy-1</i>
Otherwise the <i> tag becomes part of the string and string matching fails on Cbr-dpy-1, even if we have it in the wormbase entity list.
  • Case (b): <i>unc-119</i> instead of <i>unc</i>-<i>119</i>
  • Case (c): Parentheses shouldn't be italicized since they are used as an indicator of string starts and stops for entity recognition.
<i>cbr-unc-119</i>(<i>st20000</i>) instead of <i>cbr-unc-119(st20000)</i>

combined entity expressions

  • double mutants, double alleles
  • dpy-11/12

Project issues

Ambiguous name lists

We are still working out the details for this issue
From a meeting on 2/18/10 we are moving forward on dealing with ambiguous entity names, that is a name in a paper that may be used to describe multiple objects and which may result in the creation of a faulty link. This is a problem both with in a single species paper as well as a problem will need to address in order to expand the project to other species and to multi-species papers.

  • Intraspecies issue: This problem is pertinent to entity lists within a single species paper. We have run into this problem with clone names and strain names being identical (for example) and not being able to resolve this discrepancy at the script level. Our initial response to these ambiguous names has been to put these names on an exclusion list.
  • Interspecies issue: This is a particularly pertinent when dealing with a paper that contains gene names from different species. We are collecting gene names, including aliases, from SGD, and Flybase, and will compare them with our own gene lists, including all synonyms. We will expand the list comparison to mouse gene names. Arun will be making a splash page on textpresso dev to house all these ambiguous names, and to use to direct readers if a proper URL cannot be resolved.
    • General anatomy terms can be linked to the wrong species

Problem: in this C. elegans paper, 'epithelium' was linked to the elegans epithelium page
"T cells migrate toward (Fong et al. 2002). Additionally, loss of GRK3, which is highly expressed in mouse olfactory epithelium (Schleicher et al. 1993), "
Action: "epithelium" were added to the anatomy exclusion list. From this paper, we also added, 'head', 'tail', 'neuron', 'sensory neuron' to the exclusion list as well.

Longevity of links

Thing for consideration

  • How can we ensure the links are maintained?
  • How can we ensure non-WB links are maintained?
  • Link checking can be used to make sure silent links have been dealt with.

At WormBase we will be addressing the stability of our links by collecting the URLs we use/create and periodically running a link checker to make sure the links are still alive. The link checker will point out any links that have gone dead; however, it will not check if the page is relevant. We will be working out the specifics of this aspect of the project.

All the linked entities in a paper, along with their link status at the time of sending the linked XML back to DJS, are available here

The script that forms the entity table can be run periodically for each file and the statuses can be checked. If the links are 'silent' for a paper that was published a long time ago, then the responsible curator needs to be alerted.

Entity Sources

Entity lists are available here

01downloadModEntities.pl downloads entities
From acedb on spica

  • Anatomy_name
  • Clone
  • Rearrangement
  • Strain
  • Variation

From postres on tazendra

  • Genes
  • Transgenes

02formSortedLexicon.pl forms the lexicon from the known_entities.

URL constructors

http://www.wormbase.org/wiki/index.php/Linking_To_WormBase

http://www.wormbase.org/db/get?name=X;class=Y

Currently we are forming URLs based on public names versus object class ID's. These links are redirected to the correct page, which is the URL based on object ID. By using the public name for the linking instead of the ID, we can avoid problems that may result in links dying due to objects getting merged or killed.

Entities

Object -> source ; web page link link
<example object pages>

Gene -> ace server (more current than dev) ; Gene Summary page URL uses public gene name, which is redirected to WBGeneID
<http://www.wormbase.org/db/get?name=mpk-1;class=Gene>
<http://www.wormbase.org/db/get?name=F43C1.1;class=Gene>

Protein -> ace server ; Gene Summary page:
<http://www.wormbase.org/db/get?name=egl-4;class=Gene>

Allele/SNP -> acedb database on tazendra at /home3/acedb/ws/acedb ; Variation Report page
<http://www.wormbase.org/db/get?name=ga117;class=Variation>
<http://www.wormbase.org/db/get?name=hw42941;class=Variation>

Transgene (inserted transgenes only) -> acedb database on tazendra at /home3/acedb/ws/acedb ; Transgene Summary page
<http://www.wormbase.org/db/get?name=oxIs12;class=Transgene>

Rearrangement -> acedb database on tazendra at /home3/acedb/ws/acedb

Strain -> acedb database on tazendra at /home3/acedb/ws/acedb; Strain Report page
<http://www.wormbase.org/db/get?name=MH37;class=Strain >

Clone -> acedb database on tazendra at /home3/acedb/ws/acedb; Sequence Summary page
<http://www.wormbase.org/db/get?name=yk1106g06.3;class=Sequence>
<http://www.wormbase.org/db/get?name=OSTR153G5_1;class=Sequence >

suppress for now Anatomy/cell -> acedb database on tazendra at /home3/acedb/ws/acedb with corrections by Raymond ; Summary of anatomy ontology term
<http://www.wormbase.org/db/get?name=HSN;class=Anatomy_term>
<http://www.wormbase.org/db/get?name=Z1.ppp;class=Anatomy_term>

suppress for now Author -> The links are formed using the URL pattern
http://www.wormbase.org/db/misc/person?name=$url_encoded_name;paper=$wbpaper_id,
where $url_encoded_name is the full name of an author with the following modifications:
i) periods after middle names removed and
ii) spaces converted to %20, which is HTML equivalent of a space.

Examples:
Name: Eve W. L. Chow
Link: http://www.wormbase.org/db/misc/person?name=Eve%20W%20L%20Chow;paper=WBPaper00038245

Name: I. Russel Lee
Link: http://www.wormbase.org/db/misc/person?name=I%20Russel%20Lee;paper=WBPaper00038245

(There may be also be other special characters with accents, umlauts, etc., The best thing to do would be to check WormBase and put the correct link, if the author name already exists. Else the English equivalent of the special character could be visually determined and a new author link can be formed and notified to Cecilia.)

--kjy 22:04, 19 April 2012 (UTC)

People

Caltech: Arun Rangarajan, Karen Yook, Juancarlos Chan, Paul Sternberg, Hans-Michael Müller, Chris Grove, Daniela Raciti, James Done
GSA: Tim Schedl, Tracey DePellegrin-Connelly, Ruth Isaacson
DJS (Dartmouth Journal Services): Stephen Haenel, Sharon Faelten, Lolly Otis