GSA Markup SOP

From WormBaseWiki
Jump to navigationJump to search

Caltech_documentation

GSA linking pipeline

from Arun's 2010 SAB Talk

SOP for linking

We are given 48 hours to markup and return the XML document from Genetics. If it arrives on a Friday, we will have the weekend.

At Caltech, the XML is scanned by Arun's script, which identifies and marks up all know objects by exact match and formatting rules. All of the objects; genes, variations, transgenes, rearrangements, are mined from servers rather than from a WB release so that they are as up-to-date as possible, such that genes that are not on the website but have been approved will be recognized. Arun and I will go through the markup to make sure the links are correct and no objects are missed. Objects that are missed are added to Arun's object lists through the author first-pass form, so in practice, curators may receive objects from the author first pass form from either the author, me, or both.

After correcting the object lists, we run the paper through again before a second in-house proof, then send it back to DJS. It takes 20-30 minutes to go through a paper with a lot of genes.

Issues

Entities don't exist in WB yet

New entities being described for the first time can still be linked to pages in WB, which will remain silent until the objects are created (and will show up on the public site about 2 months after the paper is received). These new objects need to be added to Arun's lists via the journal author first pass form. Authors are asked to fill out this form when they are notified of the acceptance of their paper. If they submit the form right away, the objects are added to Arun's list before we receive the XML and the linking should identify all new objects for which we create links. One problem is that the authors enter information on the form in such a way that the scripts cannot read them. A curator must visually check the data from the author form to make sure the information was entered properly. I am alerted when an author submits a form and in that alert, a summary of the information is presented, which tells me if things were entered properly, if they aren't I need to go back and fix the entries. Bad entries are blocked from being read by Arun's scripts by placing two tilda's "~~" before the entry.

Gene names don't exist in WB yet

Gene names need to be approved by our nomenclature curators, Jonathan Hodgkin and Mary Ann Tuli. In most cases, an author has already contacted Jonathan and Mary Ann and the gene name not only has been approved, but already exists on the ace server so already has been added to the gene list. Authors are asked to declare these new gene names before the linking scripts are run, but authors do not always follow through. One action is to link the name to the official WBGene sequence ID. This will require that the authors have explicitly stated what the gene sequence is. Linking the name to the gene should not affect what happens in acedb, the link lives within the online paper alone. If an author is not explicit, the best option is for us to contact them, if we do not get a response within a reasonable time, no link will be created. We are still working on a suitably quick SOP for these cases.

For example a paper has a couple gene names, "cbr-dpy-1" and "cpy-4", both referring to the same briggsae sequence, but neither name existed in acedb. In this case make a link to the WBGene page rather than create a link with cbr-dpy-1 or cpy-4 as the public name, as the other links are made.

Formatting problems break recognition

Entity recognition relies on exact string matches and XML formatting cues. If entities are not formatted correctly in the XML then the entity may not be recognized and linked. So if you see an unlinked entity, check to see if the formatting has been altered; if this is the case, a message needs to get sent to Genetics and DJS to correct the formatting at the source as we do not want to change the XML document. You can still finish checking the document for new objects and adding them to the lists in the mean time.

  • Case (a): <i>Cbr-dpy-1</i> instead of Cbr-<i>dpy-1</i>
Otherwise the <i> tag becomes part of the string and string matching fails on Cbr-dpy-1, even if we have it in the wormbase entity list.
  • Case (b): <i>unc-119</i> instead of <i>unc</i>-<i>119</i>
  • Case (c): Parentheses shouldn't be italicized since they are used as an indicator of string starts and stops for entity recognition.
<i>cbr-unc-119</i>(<i>st20000</i>) instead of <i>cbr-unc-119(st20000)</i>



Entity list sources

to see the actual entities go to http://dev.textpresso.org/gsa/known_entities/

Genes -> ace server (more current than dev)
Variations -> wscurrent.obo -> latest dev WS
Transgenes -> wscurrent.obo -> latest dev WS
Rearrangements -> wscurrent.obo
Strains ->
Clones (cosmids) ->
Anatomy/cell
Person list -> on the fly through Postgres (more current than dev)


People

Caltech: Arun Rangarajan, Karen Yook, Juancarlos Chan, Paul Sternberg, Hans-Michael Müller
GSA: Tim Schedl, Tracey DePellegrin-Connelly, Ruth Isaacson
DJS (Dartmouth Journal Services): Stephen Haenel, Sharon Faelten, Lolly Otis