Difference between revisions of "GSA Markup SOP"

From WormBaseWiki
Jump to navigationJump to search
Line 11: Line 11:
 
'''Once approved for publication, Genetics enters the paper doi into Juancarlos's WBPaperID ticketer to retrieve a WBPaperID'''. http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/journal/journal_paper_ticket.cgi<br>  
 
'''Once approved for publication, Genetics enters the paper doi into Juancarlos's WBPaperID ticketer to retrieve a WBPaperID'''. http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/journal/journal_paper_ticket.cgi<br>  
 
'''''(Where is the tickter script?)'''''
 
'''''(Where is the tickter script?)'''''
 +
 
''Kimberly: Does this step always happen?  That's the part I wasn't clear on.  It looks like sometimes this might not happen?''
 
''Kimberly: Does this step always happen?  That's the part I wasn't clear on.  It looks like sometimes this might not happen?''
 
*This ticketer captures the doi of the article, the last 6 digits are used as an identifier throughout the linking and QC process.  
 
*This ticketer captures the doi of the article, the last 6 digits are used as an identifier throughout the linking and QC process.  

Revision as of 15:45, 16 July 2013

Caltech_documentation

GSA markup


SOP for linking

Once approved for publication, Genetics enters the paper doi into Juancarlos's WBPaperID ticketer to retrieve a WBPaperID. http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/journal/journal_paper_ticket.cgi
(Where is the tickter script?)

Kimberly: Does this step always happen? That's the part I wasn't clear on. It looks like sometimes this might not happen?

  • This ticketer captures the doi of the article, the last 6 digits are used as an identifier throughout the linking and QC process.
  • The doi is also used to identify the paper during the normal pubmed paper pipeline.
  • The assigned WBPaperID is used to track and map the journal first pass form through the first pass process.

At this time the paper has been entered into our local database (postgres) as a doi object only. There is no other information about the paper in postgres. Information about the paper (title, authors, abstract, PMID, etc.) comes in when the paper comes through the normal pubmed pipeline.

Entering of the doi and WBPaperID assignment by Genetics leads to an alert being sent to me and to Kimberly.
The paper should show up in Pubmed within a couple days depending on when the doi was entered and where the paper is in the processing stage. Juancarlos' script (is this part of the paper pipeline?? ) iidentifies the doi as it comes through pubmed to avoid the paper getting assigned a second WBPaperID, which would result in the duplication of the paper in postgres.

Genetics sends a notice of approval to the author along with a link to a journal author first pass form.
A link to the journal author form for each paper is accessible, in line, for each paper from http://tazendra.caltech.edu/~postgres/cgi-bin/journal/journal_all.cgi. This form differs from the standard author first pass form that is sent out to elegans authors of publications in other journals; although all the datatypes on the form are the same, the journal author form highlights a select few data types that the authors should fill out for us, primarily those data types that we provide links to with an emphasis on requesting the authors to tell us about every new object that is in their paper and doesn't exist in WB yet.

Journal author data entered into these select fields are added automatically to Arun's entity lists.

ANYTHING IN THE PAPER THAT HAS A MATCH ON AN ENTITY LIST, AND HAS NOT BEEN PUT ON AN EXCLUSION LIST, WILL BE MARKED UP; therefore submitted data need to be manually checked by a curator
Arun's scripts perform exact string match (and formatting cue recognition). If data is not entered correctly, the object may not be recognized in the paper, also extraneous information will be added to the entity list and used to mark up objects in the paper. When an author submits data through the journal author first pass form, an e-mail alert is sent to me. I check the entered data to make sure it has been entered in a useful format.

All author-submitted data can be accessed on tazendra:

http://tazendra.caltech.edu/~postgres/cgi-bin/journal/journal_all.cgi
choose "textpresso" from the drop down menu in the upper left corner of the page
click "Show Data"
data are listed by paper and data type


Data need to be comma-separated and need to be in the standard format for each data type and all extraneous information hidden (see below).

  • Gene (genesymbol)
    • public name, CDS
    • unapproved gene names need to be added to the entity list but Arun needs to be told to link it to a known gene ID Do we need a temporary entity list for this data type?
  • Transgene
    • in integrated transgene nomenclature format only
    • no unapproved 'Ex' objects
    • no genomic expressions, entities within transgenes/construct expressions are omitted from linking at the script level, e.g. in the expression 'sur-5::GFP', 'sur-5' will not be linked as 'sur-5:" is recognized by Arun's script and processed as an entity to omit from linking.
  • Antibody
    • only protein name (link will be to gene page)
  • Alleles (extvariation)
    • currently only elegans alleles; priority on curating alleles of other species is very low
  • Strains (newstrains)
    • currently only marking up known C. elegans strains, mainly available through the CGC; priority on curating other species is very low


Changes to the author data is made to the journal author first pass form available through the paper associated link on http://tazendra.caltech.edu/~postgres/cgi-bin/journal/journal_all.cgi.
To hide extraneous information, preface the information with two tildas, that is, reorganize the data so that objects that need to be added to the entity lists are comma-separated and all other information is listed after "~~".

change
"ZK1128.2, mett-10, which is orthologous to the human methyltransferase-10 domain containing protein (METT10D), and corresponds to ZK1128.2."
to
"ZK1128.2, mett-10~~, which is orthologous to the human methyltransferase-10 domain containing protein (METT10D), and corresponds to ZK1128.2."

In the first case every word would have been added to the gene entity list (as it was entered in the genesymbol data field), the "~~" hides everything that comes after it so that only the gene names are added (although in this case, ZK1128.9, does not necessarily need to be added, as it should already be on the gene list).

Entity linking scripts are run

Linking script pipeline
The XML is sent, via e-mail to Arun, and Arun runs his scripts for linking entities to WB pages
Ideally the author has entered the data and it is sitting correctly in our database before we receive the XML from DJS, usually authors do not submit any data. But Karen goes through the paper after the linking script is run and identifies any new objects.

Arun and I go through the marked-up XML to make sure the links are correct and no objects are missed.
All objects that were missed by the linking scripts are added to the journal author first pass form by me (which once submitted also alerts the appropriate curator to new objects). It takes 20-40 minutes to manually go through a paper with a lot of genes.

Some of the objects; genes, and persons are mined from servers/local databases rather than from a WB release so that they are as up-to-date as possible, such that even if these objects are not on the website they will be recognized. As stated above, objects that don't exist on any of the lists will be added to Arun's object lists through the author first-pass form, so in practice, curators may receive objects from the author first pass form from either the author, me, or both. See elsewhere on this wiki for discussion about pre-WB approved object names.

After correcting the entity lists, we run the paper through again before a second in-house proof, then send it back to DJS.
We are given 48 hours to markup and return the XML document from Genetics. If it arrives on a Friday, we have the weekend.

Authors are sent the marked up doc for proofreading
Authors get back to DJS about errors, we are consulted about the problems and recommend an action (remove the link, or send a corrected URL).

Linking/Proofing Issues

Entities don't exist in WB yet

New entities being described for the first time can still be linked to pages in WB, which will remain silent until the objects are created (and will show up on the public site about 2 months after the paper is received). These new objects need to be added to Arun's lists via the journal author first pass form. Authors are asked to fill out this form when they are notified of the acceptance of their paper. If they submit the form right away, the objects are added to Arun's list before we receive the XML and the linking should identify all new objects for which we create links. One problem is that the authors enter information on the form in such a way that the scripts cannot read them. A curator must visually check the data from the author form to make sure the information was entered properly. I am alerted when an author submits a form and in that alert, a summary of the information is presented, which tells me if things were entered properly, if they aren't I need to go back and fix the entries. Bad entries are blocked from being read by Arun's scripts by placing two tilda's "~~" before the entry.

Gene names don't exist in WB

Gene names need to be approved by our nomenclature curators, Jonathan Hodgkin and Mary Ann Tuli. In most cases, an author has already contacted Jonathan and Mary Ann and the gene name not only has been approved, but already exists on the ace server so already has been added to the gene list. Authors are asked to declare these new gene names before the linking scripts are run, but authors do not always follow through. One action is to link the name to the official WBGene sequence ID. This will require that the authors have explicitly stated what the gene sequence is. Linking the name to the gene should not affect what happens in acedb, the link lives within the online paper alone. If an author is not explicit, the best option is for us to contact them, if we do not get a response within a reasonable time, no link will be created. We are still working on a suitably quick SOP for these cases.

For example a paper has a couple gene names, "cbr-dpy-1" and "cpy-4", both referring to the same briggsae sequence, but neither name existed in acedb. In this case make a link to the WBGene page rather than create a link with cbr-dpy-1 or cpy-4 as the public name, as the other links are made.

Formatting problems break recognition

Entity recognition relies on exact string matches and XML formatting cues. If entities are not formatted correctly in the XML then the entity may not be recognized and linked. So if you see an unlinked entity, check to see if the formatting has been altered; if this is the case, a message needs to get sent to Genetics and DJS to correct the formatting at the source as we do not want to change the XML document. You can still finish checking the document for new objects and adding them to the lists in the mean time.

  • Case (a): <i>Cbr-dpy-1</i> instead of Cbr-<i>dpy-1</i>
Otherwise the <i> tag becomes part of the string and string matching fails on Cbr-dpy-1, even if we have it in the wormbase entity list.
  • Case (b): <i>unc-119</i> instead of <i>unc</i>-<i>119</i>
  • Case (c): Parentheses shouldn't be italicized since they are used as an indicator of string starts and stops for entity recognition.
<i>cbr-unc-119</i>(<i>st20000</i>) instead of <i>cbr-unc-119(st20000)</i>

combined entity expressions

  • double mutants, double alleles
  • dpy-11/12

Project issues

Ambiguous name lists

We are still working out the details for this issue
From a meeting on 2/18/10 we are moving forward on dealing with ambiguous entity names, that is a name in a paper that may be used to describe multiple objects and which may result in the creation of a faulty link. This is a problem both with in a single species paper as well as a problem will need to address in order to expand the project to other species and to multi-species papers.

  • Intraspecies issue: This problem is pertinent to entity lists within a single species paper. We have run into this problem with clone names and strain names being identical (for example) and not being able to resolve this discrepancy at the script level. Our initial response to these ambiguous names has been to put these names on an exclusion list.
  • Interspecies issue: This is a particularly pertinent when dealing with a paper that contains gene names from different species. We are collecting gene names, including aliases, from SGD, and Flybase, and will compare them with our own gene lists, including all synonyms. We will expand the list comparison to mouse gene names. Arun will be making a splash page on textpresso dev to house all these ambiguous names, and to use to direct readers if a proper URL cannot be resolved.
    • General anatomy terms can be linked to the wrong species

Problem: in this C. elegans paper, 'epithelium' was linked to the elegans epithelium page
"T cells migrate toward (Fong et al. 2002). Additionally, loss of GRK3, which is highly expressed in mouse olfactory epithelium (Schleicher et al. 1993), "
Action: "epithelium" were added to the anatomy exclusion list. From this paper, we also added, 'head', 'tail', 'neuron', 'sensory neuron' to the exclusion list as well.

Longevity of links

We are still working out the details for this issue
One major concern is how long the links made to the article will be stable. While this project is a collaboration with GENETICS and WormBase, we can control the stability of the links for as long as our funding lasts. Once the project expands to other MODs, each MOD will be responsible for the stability of their own URLs, compounding URL stability issues for any paper that contains multi-mod URLs.

At WormBase we will be addressing the stability of our links by collecting the URLs we use/create and periodically running a link checker to make sure the links are still alive. The link checker will point out any links that have gone dead; however, it will not check if the page is relevant. We will be working out the specifics of this aspect of the project.

All the linked entities in a paper, along with their link status at the time of sending the linked XML back to DJS, are available here

The script that forms the above can be run periodically for each file and the statuses can be checked. If the links are 'silent' for a paper that was published a long time ago, then the responsible curator needs to be alerted.

Entity list sources and URLs

to see the entity lists go to http://textpresso-dev.caltech.edu/gsa/worm/known_entities/

Entity Source

  • /home3/acedb/ws/acedb -> latest database build- this build will show up on the live WB site within 1-2 months
  • ace server at cshl ->
  • postgres -> this information is as current as could be, but won't show up on the live site for at least 2 months.

URL constructors

http://www.wormbase.org/wiki/index.php/Linking_To_WormBase

http://www.wormbase.org/db/get?name=X;class=Y

Currently we are forming URLs based on public names versus object class ID's. These links are redirected to the correct page, which is the URL based on object ID. By using the public name for the linking instead of the ID, we can avoid problems that may result in links dying due to objects getting merged or killed.

Entities

Object -> source ; web page link link
<example object pages>

Gene -> ace server (more current than dev) ; Gene Summary page URL uses public gene name, which is redirected to WBGeneID
<http://www.wormbase.org/db/get?name=mpk-1;class=Gene>
<http://www.wormbase.org/db/get?name=F43C1.1;class=Gene>

Protein -> ace server ; Gene Summary page:
<http://www.wormbase.org/db/get?name=egl-4;class=Gene>

Allele/SNP -> acedb database on tazendra at /home3/acedb/ws/acedb ; Variation Report page
<http://www.wormbase.org/db/get?name=ga117;class=Variation>
<http://www.wormbase.org/db/get?name=hw42941;class=Variation>

Transgene (inserted transgenes only) -> acedb database on tazendra at /home3/acedb/ws/acedb ; Transgene Summary page
<http://www.wormbase.org/db/get?name=oxIs12;class=Transgene>

Rearrangement -> acedb database on tazendra at /home3/acedb/ws/acedb

Strain -> acedb database on tazendra at /home3/acedb/ws/acedb; Strain Report page
<http://www.wormbase.org/db/get?name=MH37;class=Strain >

Clone -> acedb database on tazendra at /home3/acedb/ws/acedb; Sequence Summary page
<http://www.wormbase.org/db/get?name=yk1106g06.3;class=Sequence>
<http://www.wormbase.org/db/get?name=OSTR153G5_1;class=Sequence >

suppress for now Anatomy/cell -> acedb database on tazendra at /home3/acedb/ws/acedb with corrections by Raymond ; Summary of anatomy ontology term
<http://www.wormbase.org/db/get?name=HSN;class=Anatomy_term>
<http://www.wormbase.org/db/get?name=Z1.ppp;class=Anatomy_term>

suppress for now Author -> The links are formed using the URL pattern
http://www.wormbase.org/db/misc/person?name=$url_encoded_name;paper=$wbpaper_id,
where $url_encoded_name is the full name of an author with the following modifications:
i) periods after middle names removed and
ii) spaces converted to %20, which is HTML equivalent of a space.

Examples:
Name: Eve W. L. Chow
Link: http://www.wormbase.org/db/misc/person?name=Eve%20W%20L%20Chow;paper=WBPaper00038245

Name: I. Russel Lee
Link: http://www.wormbase.org/db/misc/person?name=I%20Russel%20Lee;paper=WBPaper00038245

(There may be also be other special characters with accents, umlauts, etc., The best thing to do would be to check WormBase and put the correct link, if the author name already exists. Else the English equivalent of the special character could be visually determined and a new author link can be formed and notified to Cecilia.)

--kjy 22:04, 19 April 2012 (UTC)

People

Caltech: Arun Rangarajan, Karen Yook, Juancarlos Chan, Paul Sternberg, Hans-Michael Müller, Chris Grove, Daniela Raciti
GSA: Tim Schedl, Tracey DePellegrin-Connelly, Ruth Isaacson
DJS (Dartmouth Journal Services): Stephen Haenel, Sharon Faelten, Lolly Otis