GSA Markup SOP
Visual overviews of linking pipeline
Marked-up papers
Contents
GENETICS Editors duties
Retrieving/Assigning a WBPaperID
- GSA Editors enter the approved paper doi into the WBPaperID ticketer form.
- The ticketer captures the last six digits of the doi of the article, which is used as an identifier throughout the linking and QC process.
- The ticketer assigns an official WBPaperID, which is needed to track the paper through WB curation paths
- Retrieve the link to a Journal First Pass form (JFP) from the ticketer. Note: this step is sometimes missed by the GSA editors causing confusion when the paper comes through the normal WB paper pipeline.
- Editors send a notice of approval to the author along with the link to the JFP.
- GSA editors control the message to the authors - to make changes to the message, contact GSA editors. The message reads as follows:
Dear Dr. XXXX: DOI: 10.1534/genetics.11X.XXXXXX WBPaper: X The Genetics Society of America is working with textpresso (www.textpresso.org) and WormBase (www.wormbase.org) to create links between genetic and genomic objects in your article to the appropriate page in WormBase. These links will be included in both the online full text and PDF versions of the published article. If you want any genes, alleles, transgenes, CGC-destined strains, anatomy terms, etc., previously known or new/described for the first time in your article to be linked to WormBase please enter the names of these objects using the form at the following link: http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/journal/journal_first_pass.cgi?action=Curate&paper=X&passwd=X.X Because your submitted data will be processed automatically, please follow the examples carefully. If you have questions, or want to upload a file instead, contact Karen Yook, Wormbase Curator, at karen@wormbase.org. Thank you. XXXX XXXX Editorial Assistant
Mappings between doi and WBPaperIDs can be found here.
WBPaperID ticketer
- GSA editors need to enter the paper in to the Journal_paper_ticketer to retrieve a link to the journal first pass form.
- GSA editors enter the paper doi into the Journal_paper_ticketer; full doi expressions need to be used, e.g., 10.1534/genetics.113.157685; the link for each paper is accessible on the Journal Data Display, see below.
- The ticketer enters the paper into postgres as a doi object only. All information about the paper (title, authors, abstract, PMID, etc.) is entered when the paper comes through the normal paper pipeline.
- The ticketer sends an alert to the QCFast curator and WB Paper curator (Kimberly) that a WBPaperID was assigned.
- The paper should show up in Pubmed within a couple days/weeks, depending on when the doi was entered and where the paper is in the processing stage.
- The WBPaper curator (Kimberly) uses the doi to manually identify the paper during the normal paper pipeline. In the normal paper pipeline, a cron job fetches PMIDs from PubMed, note: the cron job doesn't look for doi's; it only looks for PMIDs. When a new Genetics or G3 paper shows up in the paper editor, Kimberly manually looks for the corresponding doi to make sure the paper has been entered already and the PMID is then manually added. Once the PMID is added, Kimberly approves the paper through the regular paper pipeline.
Journal author first pass form (JFP)
- An alert is sent to the QCFast curator when an author submits data through the JFP. The QCFast curator must check the entered data to make sure it has been entered in a useful format, see following. Alerts (flags) are also sent to the appropriate WB curator to alert them to new objects.
- The JFP differs from the author first pass form sent out to authors of publications in other journals; although all the datatypes on the form are the same, the JFP highlights those data types that are linked and requests the authors to declare new objects that don't exist in WB at the time.
- Journal author data entered into linked classes of the form are added automatically to the corresponding class entity list.
- Since the linking scripts perform exact string match (and formatting cue recognition), if data is not entered correctly into the JFP,
- the object may not be recognized in the paper
- extraneous information will be added to the entity list and used to mark up objects in the paper
- The QCFast curator needs to check the JFP entries for a given paper to make sure everything is entered correctly or to silence strings not meant to be marked up- silencing strings is achieved by placing two tildas "~~", before any text not meant to be added to an entity list.
For example
"ZK1128.2, mett-10, which is orthologous to the human methyltransferase-10 domain containing protein (METT10D), and corresponds to ZK1128.2."
to
"ZK1128.2, mett-10~~, which is orthologous to the human methyltransferase-10 domain containing protein (METT10D), and corresponds to ZK1128.2."
In the first case every word would have been added to the gene entity list (as it was entered in the genesymbol data field), the "~~" hides everything that comes after it so that only the gene names are added (although in this case, ZK1128.9, does not necessarily need to be added, as it should already be on the gene list).
- To see author-submitted data for all JFPs
**access the JFP index here **choose "textpresso" from the drop down menu in the upper left corner of the page **click "Show Data" **data are listed by paper and data type
- Format for data entered through the form:
- Data need to be comma-separated and in the standard format for each data type, all extraneous information hidden (see below).
- Gene
- public_name
- CDs
- unapproved gene names need to be cleared by genenames@wormbase
- Transgene
- integrated transgene nomenclature format only
- 'Ex' objects that exist in WormBase
- entities within transgenes/construct expressions are linked, e.g. in the expression 'sur-5::GFP', 'sur-5' is linked.
- Antibody - no longer linked
- Proteins - link will be to gene page
- Alleles
- currently only elegans alleles; priority on curating alleles of other species is very low
- Strains (only those accepted by the CGC for deposit into the permanent collection)
- currently only marking up known C. elegans strains, mainly available through the CGC; priority on curating other species is very low
- Changes to the form need to go through Juancarlos
Journal Data Display
This site displays information about ticketed papers
http://tazendra.caltech.edu/~postgres/cgi-bin/author_fp_display.cgi?afp_jfp=jfp
columns include:
- DOI - entered through Journal_paper_ticketer
- WBPaper - auto generated by ticketer, links to http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/paper_display.cgi?data_number=00050589&action=Search+%21 paper display
- data link - links to jfp, 'link' signifies data was entered, 'no data' indicates author didn't submit any information
- textpresso html - links to html view of paper from species specific html directory
- proofs - links to paper proof uploaded manually to species proof directory on textpresso-dev
- to upload proofs
download proofs from DJS into a local directory 'proofs', copy to textpresso-dev >scp -r ./proofs kyook@textpresso-dev.caltech.edu:/home/kyook >ssh kyook@textpresso-dev.caltech.edu >sudo cp ./proofs/* /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/worm/proofs/ make all pdfs readable >sudo chmod a+r -R /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/worm/proofs/
Proofs should be available on Journal Data Display
- pdf link
- temp pdf
- go link
Linking steps
Visual overviews of linking pipeline
Linking scripts
Markup excerpts
Dartmouth Journal Services (DJS), the GSA journal publishers uploads the file, as a NML-formatted XML, by ftp to the textpresso server.
- DJS requests 48 hours to receive back the linked file.
- The QCFast Curator receives an e-mail alert that a paper has been uploaded and to expect the linked paper soon.
- The paper goes through the linking script and the create_entity_table script, URLS to the linked html as well as the entity list table are e-mailed to the curator.
- When there are no problems with the file and linking, the curator can expect the follow up e-mail with the links no later than an hour after the initial e-mail. If there is a problem with the file, the process is stopped until the QCFast developer can fix the pipeline.
- n.b. there is a bug in the linking script that will stall the process based on certain characters, unfortunately, the bug is frequent enough that it is noticeable, but not frequent enough to warrant a full debugging.
The QCFast curator goes through the marked-up html view to make sure the links are correct and no objects are missed.
- The QCFast curator can add or remove links through the QCFast interface.
- If there are multiple objects not linked, the QCFast curator can opt to add all missing objects to the JFP for that paper and rerun the linking script (WB interface only).
- If links are not able to be made, e.g., due to XML formatting, the QCFast curator can download the file and edit the links manually. The name of the file must remain the same as the original file or uploading the manually edited file will result in an error.
The final edited linked file, as an html, is uploaded, by ftp, to the DJS server.
- A final e-mail is sent to the QCFast curator that confirms the upload.
- Uploading the file will also launch a script that creates a final entity link table.
- This final e-mail should be received within an hour of the upload.
See Entity classes and problems for linking issues and how to deal with them.
ftp to DJS
05ftpAndEmailDjs.pl
QCFast for all MODs have a button to FTP the edited file to DJS. This button launches the 05ftpAndEmailDjs.pl script, which:
- creates a final entity table in /data1/Users/arunr/gsa/worm/entity_link_tables
- makes a log entry in /data1/Users/arunr/gsa/worm/logs
- deposits the linked file in /data1/Users/arunr/gsa/worm/linked_xml
- creates a record in /data1/Users/arunr/gsa/worm/done
- e-mails a link to the final entity_link_table to the MOD curator
- e-mails an alert to the file on the ftp server to DJS
To re-ftp a file the files created by the script AND the file deposited on the ftp server need to be removed. The DJS server is password protected by DJS, and account and access to the server needs to be assigned by them.
Proofreading
- DJS sends WormBase the pdf formatted proof before sending the article back to the author.
- DJS requests a 24 hour turnaround.
- QCFast curator relays all suggested changes to DJS, as the curator cannot make any changes to the doc itself.
- DJS sends proof to author.
- If author gets back to DJS about errors or suggested links. DJS consults the curator about the feedback and curator recommends an action (remove the link or send a corrected URL).
Project issues
Ambiguous names
This problem refers links being made to the wrong entity based on a name used to describe multiple objects within a species or between many species.
- Intraspecies issue: In WB some clone names and strain names are identical. This problem has not been able to be resolved at the script level; these ambiguous names are put on an exclusion list instead.
- Interspecies issue: Gene names, including aliases, from SGD, Flybase, and MGI should be collected and compared with our own gene lists, including all synonyms. Arun will be making a splash page on textpresso dev to house all these ambiguous names, to direct readers if a proper URL cannot be resolved (2/18/10 - last action taken)
- Example: general anatomy terms can be linked to the wrong species
- Example: general anatomy terms can be linked to the wrong species
Problem: in this C. elegans paper, 'epithelium' was linked to the elegans epithelium page
"T cells migrate toward (Fong et al. 2002). Additionally, loss of GRK3, which is highly expressed in mouse olfactory epithelium (Schleicher et al. 1993), "
Action: "epithelium" was added to the anatomy exclusion list. From this paper, we also added, 'head', 'tail', 'neuron', 'sensory neuron' to the exclusion list as well.
Longevity of links
Things for consideration
- How can we ensure the links are maintained?
- How can we ensure non-WB links are maintained?
- Link checking can be used to make sure silent links have been dealt with.
At WormBase we will be addressing the stability of our links by collecting the URLs we use/create and periodically running a link checker to make sure the links are still alive. The link checker will point out any links that have gone dead; however, it will not check if the page is relevant. We will be working out the specifics of this aspect of the project.
All the linked entities in a paper, along with their link status at the time of sending the linked XML back to DJS, are available
here
The script that forms the entity table can be run periodically for each file and the statuses can be checked. If the links are 'silent' for a paper that was published a long time ago, then the responsible curator needs to be alerted.
People
Caltech Karen Yook, Hans-Michael Müller, Paul Sternberg. Alum: Arun Rangarajan, Chris Grove, Daniela Raciti, James Done, Juancarlos Chan
GSA: Tim Schedl, Tracey DePellegrin-Connelly, Ruth Isaacson, Virginia Ingerson
DJS (Dartmouth Journal Services): Sharon Faelten, Diana Schaeffer, Michelle Kerns. Alum: Lolly Otis, Stephen Haenel
SGD Marek Skrzypek, Rob Nash
Flybase Steven Marygold, Aoife Larkin. Alum: Raymund Stefancsik
Alerts
E-mail alerts are sent when
- a paper is entered through the ticketer Subject line: "GENETICS: WormBase data for accepted paper GENETICS/2015/185272"
- a jfp form becomes active for the author Subject line: "new paper ticket DOI created"
- an article is entered into the markup pipeline Subject line "GSA auto-email: WB 179705 article received from GSA."
- the linked article is ready for QC Subject line: "GSA auto-email: WB 179705 linked file available"
- linked and QC'd article has been ftp'd to DJS server Subject line: "GSA auto-email: WB 179705 article FTPed"
- a paper goes through any of the MOD markup pipelines.
E-mail addresses are stored in the emails folder for each MOD instance.
Email folders contain files with respective addresses, use sudo to change/update the addresses
/data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/worm/emails
- allele_tickets.txt -> worm-bug@sanger.ac.uk
- alleles.txt -> worm-bug@sanger.ac.u karen@wormbase.org, mueller@caltech.ed
- allele_developers.txt -> mueller@caltech.edu, karen@wormbase.org
- developers.txt -> karen@wormbase.org, mueller@caltech.edu
- curators.txt -> karen@wormbase.org, mueller@caltech.edu, pws@caltech.edu, vanauken@caltech.edu
- final.txt -> karen@wormbase.org, mueller@caltech.edu, Genetics_Specialist.djs@sheridan.com, GGG_Specialist.djs@sheridan.com, virginia.ingerson@sheridan.com, pws@caltech.edu, vanauken@caltech.edu
/data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/fly/emails/
- developers.txt -> karen@wormbase.org, mueller@caltech.edu
- curators.txt -> al806@cam.ac.uk, sjm41@cam.ac.uk, karen@wormbase.org, mueller@caltech.edu
- final.txt -> mueller@caltech.edu, sjm41@cam.ac.uk, al806@cam.ac.uk, Genetics_Specialist.djs@sheridan.com, GGG_Specialist.djs@sheridan.com, virginia.ingerson@sheridan.com, karen@wormbase.org
/data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/yeast/emails/
- curators.txt -> mueller@caltech.edu, marek.skrzypek@stanford.edu, rnash@stanford.edu, karen@wormbase.org
- developers.txt -> karen@wormbase.org, mueller@caltech.edu
- final.txt -> mueller@caltech.edu, Genetics_Specialist.djs@sheridan.com, GGG_Specialist.djs@sheridan.com, virginia.ingerson@sheridan.com, marek.skrzypek@stanford.edu, rnash@stanford.edu, karen@wormbase.or
email files are also on the sandbox:
/data2/srv/textpresso-dev.caltech.edu/www/docroot/karen/gsa/