Linking script pipeline

From WormBaseWiki
Revision as of 20:37, 23 July 2013 by Kyook (talk | contribs)
Jump to navigationJump to search

GSA Markup SOP

Linking entities in Genetics articles to WormBase resource pages

Linking of entities (or synonymously, objects) in Genetics articles to WormBase happens as a two-step process.

  • First, a lexicon of entities is formed using WormBase database and author-submitted data.
  • Second, the recognition of entities and linking entities to their corresponding WormBase resource pages happens.

Comments about the DJS scope document are at the bottom of the page.

Step 1: Creation of the lexicon

This section explains how the lexicon used for linking Genetics articles is created. This is the first step of the two-step process involved in linking Genetics articles The script that forms the lexicon is located on http://dev.textpresso.org at

/home/arun/gsa/scripts/01sortedLexicon.pl.
This script also uses the PERL modules available at /home/arun/gsa/scripts/perlmodules/. The script gets objects both from WormBase and author-submitted new objects form.

WormBase objects

First, the WormBase objects are formed from the WormBase databases in tazendra (http://tazendra.caltech.edu).
The entity lists Cell, Clone, Rearrangement, Strain, Transgene, and Variation are all formed by querying the latest acedb WormBase release on tazendra, which is updated monthly. The script is located on tazendra at

/home/acedb/arun/wb_objects/01objects.pl.

This script outputs a file for each of these entity classes on tazendra in the directory

/home/acedb/arun/wb_objects/known_objects/.  

The entity list for Gene is formed by querying the postgres database tables on tazendra, which are updated monthly except where noted. (This was done because the acedb database has only WBGeneIDs and not the actual gene names.)
The script is

/home/acedb/arun/wb_objects/02gene.pl

The script queries postgres for the following tables:

  • gin_genesequencelab
  • gin_locus (this information is updated everyday at 2pm PST from Sanger's nameserver)
  • gin_seqname
  • gin_sequence

and outputs a list of all gene names in file

/home/acedb/arun/wb_objects/known_objects/Gene.

The entity list for Person is formed using the person_obo.cgi (desgined by Juancarlos Chan), which is updated on the fly. This file is located at http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/person_obo.cgi.
The script used for forming the file for Person class is located at

/home/acedb/arun/wb_objects/03person.pl

and it outputs the file

/home/acedb/arun/wb_objects/known_objects/Person 

which has the following format:
<Name>\t<WBPersonID>
(\t stands for a tab)
(The WBPersonID is needed for forming the links to WormBase.)

All the object files are formed daily by running the above three scripts as cron jobs. Then all the object files are transferred to dev.textpresso.org daily using an scp cron job that runs daily as well. The files are scp'ed to the following location on dev.textpresso.org:

/home/arun/gsa/ace/known_objects/.

All the object files can be viewed at http://dev.textpresso.org/gsa/known_objects/

Author-submitted objects

Author-submitted objects are extracted on http://dev.textpresso.org by the script

/home/arun/gsa/scripts/01sortedLexicon.pl

from the following URL: http://tazendra.caltech.edu/~postgres/cgi-bin/journal/journal_all.cgi?action=Show+Data&type=textpresso

The rules used by the script are the following:

  • anything in the data column after ~~ is removed.
  • anything in the data column inside square brackets, along with square brackets are removed.
  • it is assumed that the data objects are separated by commas.

The mapping of names is as follows:

genesymbol -> Gene
extvariation -> Variation
newstrains -> Strain
newbalancers -> Rearrangement
transgene -> Transgene
newsnp -> Variation

The script 01sortedLexicon.pl extracts all the author-submitted data objects first and stores them in a hash table (named lexicon). The hash has two keys, the first key being the entity name and the second key being the entity class name. (The hash value is always set to 1 and is not used.)

Then the script reads the WormBase entity files located at

/home/arun/gsa/ace/known_objects/ 

which can be viewed at http://dev.textpresso.org/gsa/known_objects/
and loads them into the lexicon hash.

For Variation objects, all suffixed extensions are added to the lexicon along with the base Variation names. The following suffixes are used: ts, sd, gf, cs, lf, mx

For forming Protein objects, we simply capitalize the entities in Gene class.

For Person objects, the WBPersonID information is ignored and only the names are stored. (The mapping information from name to WBPersonID is not needed for recognition of names. It is needed only when forming the links.)

Exclusions

Exclusions to all classes can be provided in files located on dev.textpresso.org at

/home/arun/gsa/scripts/exclusions/

(current exclusions can be viewed at http://dev.textpresso.org/gsa/exclusions/)

Anything that is in the exclusion list is not added to the lexicon hash.

Stopwords

Any of the 192 stopwords (located at http://dev.textpresso.org/gsa/stopwords), if present, is removed from the lexicon.

The lexicon file

The script 01sortedLexicon.pl outputs the file:

/home/arun/gsa/scripts/lexicon

(warning: may take very long to load fully on a web browser since it has ~2.6 million entries; can be viewed at http://dev.textpresso.org/gsa/lexicon)

The lexicon entries are sorted in descending order of the entity lengths, so the longest strings are at the beginning of the list. This sorting is necessary because when performing linking, we want to link the longest possible entity first and not link any shorter entities within that long entity. (For example, consider the transgene eIs[unc-31::lacZ]. If we link it to its corresponding transgene page, then we do not want to link its sub-string unc-31 to the unc-31 gene page again.)

All non-alpha-numeric characters (other than underscore) are mapped to their corresponding 'textpresso-coded' equivalent symbols. These look like _PRD_ (code for period), _CMM_ (code for comma), _OSB_ (code for open square bracket), etc., This step is necessary since special characters have special meaning in regular expression pattern matching. Once linking happens in the next step, all the 'textpresso-coded' symbols are restored to their original characters. This results in no change of information and helps perform regular expression matching without any problems.

Step 2: Entity recognition and linking

The recognition of entities in an input XML file (from DJS, sent to WormBase via email) and linking of the recognized entities to their corresponding WormBase pages is done by the script:

/home/arun/gsa/scripts/02link.pl

located on http://dev.textpresso.org. This script also uses the PERL modules available at

/home/arun/gsa/scripts/perlmodules/.

The script takes one command-line argument, which is the source XML file that is sent by DJS.

This script first reads in the lexicon formed in Step 1 and stores it in a hash. Along with storing the entries in the hash it also stores the entities in an array, called @sorted_entries. This array has the entries sorted in descending order of length of the entities, as explained in Step 1.

The script then reads the input XML file line by line and creates two string variables. The first one, $xml_file, stores all the contents of the input XML file with the special characters textpresso-encoded. This string is the one in which links are created. The second one, $tokenized_file, has only the lines that need to be linked, with special characters textpresso-encoded and all the XML tags and the XML encodings (like ¢) removed. This string is the one used to recognize the entities.

Entities in some lines are excluded from getting linked (a policy set by Genetics and DJS). The lines that are excluded from linking in the XML file are the ones that contain any of the following PERL regular expressions in them:

<Affiliations
<Correspondence
<Footnote
<Article_Title
<\S+_Runhead
<Bib_Reference
<COMMENT
<H1
<H2
<H3
<H4
<Table
<Figure
<Article_Subtitle
<Abbreviations
<Keywords
<title>
<entry.*rowsep=\"\d+\"/
<entry.*colsep=\"\d+\"/

(The last two regular expressions are for table entries.)

All the text inside all other tags (like <Abstract>, <Para_Text>, <Flush_Left>, <Ack>, etc.,) are analyzed for presence of entities and have the identified entities linked.

These two strings, along with the lexicon hash and sorted_entries array are passed to a sub-routine in the PERL module WormbaseLinkTasks.pm in /home/arun/gsa/scripts/perlmodules/. This sub-routine is the one that does the actual entity recognition and the linking.

Sub-routine for recognition and linking: findAndLinkObjects

This sub-routine checks the $tokenized_file for matches of entities present in @sorted_entries. Recall that @sorted_entries has the longest strings first, so these are first matched. Then the links are created in the $xml_file in one shot instead of line-by-line (using the g regular expression modifier). This speeds up the linking process. (We remove unwanted links at the end.) Also before forming the links in $xml_file, the textpresso-encoded entity is mapped to its original characters. This helps avoid re-linking of sub-strings inside already-matched entities. (One needs to look into the script carefully and study the examples in the comments inside the script to understand this.)

We use pattern matching only for one class of objects, the cis-double mutant special cases (like zu405te33). In this case, we do not have delimiters at the end of the first entity zu405 to mark the end, so we directly search $tokenized_file for the PERL regular expression

($left_del)([a-z]{1,3}\d+)([a-z]{1,3}\d+)($right_del)

where $left_del is left delimiter (which could be one of ' |\_|\-|^|\n') and $right_del is right delimiter (which could be one of ' |\_|\-|$|\n'). Notice that since we did textpresso-encoding, an underscore essentially captures all the other delimiters.

After all the above linking steps are completed, we replace all the textpresso-encoded special characters to the original characters.

Finally all the entities in excluded lines above are removed and this is returned as the linked XML file. Also any WormBase persons identified in parts other than the <Authors> tag are delinked.

Based on input from Tim Schedl and Karen Yook, the script does not link only the gene portion of any transgene. For example, the script removes links to eor-1p and EOR-1 in eor-1p::EOR-1::GFP. It links eor-1p::EOR-1::GFP, only if it can identify this entity as a whole. (Note: As of now, WB curators do not want these expressions linked, so such expressions are not available for URL creation. This example only illustrates what the script does.)

Output

The script forms two output files, one in directory

/home/arun/gsa/scripts/linked_xml/

and one in directory

/home/arun/gsa/scripts/html/.

The file in the html directory is identical to the one in linked_xml directory, but it has the extension .html instead of .xml. This makes it possible to view the linked file in a web browser, which is not possible with the XML file extension. The html linked files can be viewed online at http://dev.textpresso.org/gsa/html/.

(If one wants to save these files in XML format, then one can right click on the browser window and save the Page Source. This preserves the links.)

The linked HTML file is then sent to Karen Yook, who does the quality check. Any feedback from Karen that is general and should be automated is incorporated into the linking script. If Karen finds any new objects in the article that were not declared by the authors, she adds them through the same form authors use to submit the data (http://tazendra.caltech.edu/~postgres/cgi-bin/journal/journal_all.cgi). The Step 1 script is run again by Arun to include this data in the lexicon before running the linking script again. All other special cases are handled on a per-article basis, usually through e-mail.

Once a linked XML file is completed, the file is transferred to DJS via ftp and an email is sent to DJS notifying them that the linked file has been deposited on their ftp server. (DJS gets back to WormBase if they/authors find anything wrong or for any corrections/edits.)

Requests for DJS/copy editor/authors

The following two quality-control steps will help WormBase link more objects in articles and also help form correct links.

Case (1): Put the complete entity string inside italic tags. Example: Instead of Cbr-dpy-1 the source XML should have Cbr-dpy-1. Otherwise the tag in Cbr-dpy-1 becomes part of the string and string matching fails for Cbr-dpy-1. Unfortunately, since dpy-1 is present in our list only that part gets linked to dpy-1 page.

Case (2): There should be no closing and opening italic tags within an entity. Example: Entries like unc-119 should be changed to unc-119.

Genetics and DJS should decide at which step in the pipeline these need to be taken care of.

(The previously-reported case (3) to not block italicize multiple objects is handled by the linking script, so it is not required.)

Some points we would like to clarify in DJS scope document

The scope document still reads as though DJS is using the web service to do the linking. There is a note on page 3 saying the web service is not operational. For clarity, DJS may want to update this document. (WormBase feels that totally automated linking may not be feasible because of unforeseen complications in the linking process. Manual quality check will be done by WormBase from now on for all the articles.) Steps 3 and 4 in page 5 under "Article Production Workflow" need to be updated.