GSA Markup SOP

GSA markup

GSA linking pipeline
Entity classes and problems
Examples
- Marked up papers
- Markup excerpts

SOP for linking

Once approved for publication, Genetics enters the paper doi into Juancarlos's WBPaperID ticketer to retrieve a WBPaperID that we use to track the paper. At this time the paper has been entered into our database as a doi object only. There is no other information about the paper in our local database (postgres). Information about the paper is entered at the time that the paper comes through the normal pubmed pipeline. Entering of the doi and WBPaperID assignment by Genetics leads to an alert, being sent to me and to Kimberly, of a Genetics paper coming to us soon. The paper should show up in Pubmed within a couple days depending on how far along the paper is the review and compilation process at Genetics and when they entered the doi. Juancarlos has written a script to identify the doi as it comes through pubmed to avoid the paper getting assigned a second WBPaperID resulting in the duplication of the paper in postgres.

Genetics sends a notice of approval to the author along with a link to a journal author first pass form. This form differs from the standard author first pass form that is sent out to elegans authors of publications in other journals; although all the datatypes on the form are the same, the journal author form highlights a select few data types that the authors should fill out for us, primarily those data types that we provide links to with an emphasis on requesting the authors to tell us about every new object that is in their paper and doesn't exist in WB yet.

Journal author data entered into these select fields are added automatically to Arun's linking lists. Since the scripts perform exact string match (and formatting cue recognition), if the data is not entered correctly, the objects will not be recognized, so a curator has to make sure that all the author data has been entered in a format that is recognized by Arun's script. I receive an e-mail alert that an author has entered data and I check the entered data to make sure it has been entered in a useful format.

Ideally the author has entered the data and it is sitting correctly in our database before we receive the XML from DJS. Once the XML is received from DJS, Arun's script is run, to recognize all objects on the entity lists, and generate urls to WB.

Arun and I go through the marked-up XML to make sure the links are correct and no objects are missed. It takes 20-40 minutes to manually go through a paper with a lot of genes.

Some of the objects; genes, and persons are mined from servers/local databases rather than from a WB release so that they are as up-to-date as possible, such that even if these objects are not on the website they will be recognized. As stated above, objects that don't exist on any of the lists will be added to Arun's object lists through the author first-pass form, so in practice, curators may receive objects from the author first pass form from either the author, me, or both. See elsewhere on this wiki for discussion about pre-WB approved object names.

After correcting the object lists, we run the paper through again before a second in-house proof, then send it back to DJS. We are given 48 hours to markup and return the XML document from Genetics. If it arrives on a Friday, we have the weekend.

Issues

Entities don't exist in WB yet

New entities being described for the first time can still be linked to pages in WB, which will remain silent until the objects are created (and will show up on the public site about 2 months after the paper is received). These new objects need to be added to Arun's lists via the journal author first pass form. Authors are asked to fill out this form when they are notified of the acceptance of their paper. If they submit the form right away, the objects are added to Arun's list before we receive the XML and the linking should identify all new objects for which we create links. One problem is that the authors enter information on the form in such a way that the scripts cannot read them. A curator must visually check the data from the author form to make sure the information was entered properly. I am alerted when an author submits a form and in that alert, a summary of the information is presented, which tells me if things were entered properly, if they aren't I need to go back and fix the entries. Bad entries are blocked from being read by Arun's scripts by placing two tilda's "~~" before the entry.

Gene names don't exist in WB yet

Gene names need to be approved by our nomenclature curators, Jonathan Hodgkin and Mary Ann Tuli. In most cases, an author has already contacted Jonathan and Mary Ann and the gene name not only has been approved, but already exists on the ace server so already has been added to the gene list. Authors are asked to declare these new gene names before the linking scripts are run, but authors do not always follow through. One action is to link the name to the official WBGene sequence ID. This will require that the authors have explicitly stated what the gene sequence is. Linking the name to the gene should not affect what happens in acedb, the link lives within the online paper alone. If an author is not explicit, the best option is for us to contact them, if we do not get a response within a reasonable time, no link will be created. We are still working on a suitably quick SOP for these cases.

For example a paper has a couple gene names, "cbr-dpy-1" and "cpy-4", both referring to the same briggsae sequence, but neither name existed in acedb. In this case make a link to the WBGene page rather than create a link with cbr-dpy-1 or cpy-4 as the public name, as the other links are made.

Formatting problems break recognition

Entity recognition relies on exact string matches and XML formatting cues. If entities are not formatted correctly in the XML then the entity may not be recognized and linked. So if you see an unlinked entity, check to see if the formatting has been altered; if this is the case, a message needs to get sent to Genetics and DJS to correct the formatting at the source as we do not want to change the XML document. You can still finish checking the document for new objects and adding them to the lists in the mean time.

Case (a): Cbr-dpy-1 instead of Cbr-dpy-1

Otherwise the tag becomes part of the string and string matching fails on Cbr-dpy-1, even if we have it in the wormbase entity list.

Case (b): unc-119 instead of unc-119

Case (c): Parentheses shouldn't be italicized since they are used as an indicator of string starts and stops for entity recognition.

cbr-unc-119(st20000) instead of cbr-unc-119(st20000)

Entity list sources and example URL

to see the actual entities go to http://dev.textpresso.org/gsa/known_entities/

Gene -> ace server (more current than dev) ; Gene Summary page URL uses public gene name, which is redirected to WBGeneID
<http://www.wormbase.org/db/gene/gene?name=mpk-1;class=Gene>
<http://www.wormbase.org/db/gene/gene?name=F43C1.1;class=Gene>

Protein name -> ace server ; Gene Summary page:
<http://www.wormbase.org/db/gene/gene?name=egl-4;class=Gene>

Allele/SNP -> wscurrent.obo/ latest dev WS ; Variation Report page
<http://www.wormbase.org/db/gene/variation?name=ga117>
<http://www.wormbase.org/db/gene/variation?name=hw42941>

Transgene -> objects that only exist in WB, wscurrent.obo/latest dev WS ; Transgene Summary page
<http://www.wormbase.org/db/gene/transgene?name=sEx14536;class=Transgene>

Rearrangement -> wscurrent.obo/latest dev WS

Strain name -> Limited to strains that are in WB ; Strain Report page
<http://www.wormbase.org/db/gene/strain?name=MH37;class=Strain >

Clones -> cosmid, cDNA clone, ORFeome clone ; Sequence Summary page
<http://www.wormbase.org/db/seq/sequence?name=yk1106g06.3;class=Sequence>
<http://www.wormbase.org/db/seq/sequence?name=OSTR153G5_1;class=Sequence >

Anatomy/cell/tissue/Cell lineage -> Raymond will provide list ; Summary of anatomy ontology term
<http://www.wormbase.org/db/ontology/anatomy?name=HSN>
<http://www.wormbase.org/db/ontology/anatomy?name=Z1.ppp >

Author -> on the fly through Postgres (more current than dev); Person Report page
<http://www.wormbase.org/db/misc/person?name=WBPerson625;class=Person >
If possible we will attempt to use a sort of query URL, which would link the person to an author page if there were no person page
Example <http://www.wormbase.org/lookup.cgi?paper=WBPaper00000003&author=Bob+Horvitz>

People

Caltech: Arun Rangarajan, Karen Yook, Juancarlos Chan, Paul Sternberg, Hans-Michael Müller
GSA: Tim Schedl, Tracey DePellegrin-Connelly, Ruth Isaacson
DJS (Dartmouth Journal Services): Stephen Haenel, Sharon Faelten, Lolly Otis

GSA Markup SOP

Contents

GSA markup

SOP for linking

Issues

Entities don't exist in WB yet

Gene names don't exist in WB yet

Formatting problems break recognition

Entity list sources and example URL

People

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools