Difference between revisions of "Entity classes and problems"

From WormBaseWiki
Jump to navigationJump to search
m
Line 2: Line 2:
 
back to [http://wiki.wormbase.org/index.php/Pipeline_for_markup GSA markup SOP]
 
back to [http://wiki.wormbase.org/index.php/Pipeline_for_markup GSA markup SOP]
  
==Classes of entities that are being marked up==
+
==Entities marked up==
 
All WB entities can be viewed [http://dev.textpresso.org/gsa/worm/known_entities/ here]
 
All WB entities can be viewed [http://dev.textpresso.org/gsa/worm/known_entities/ here]
  
Stable classes = entity lists that do not change much
+
===Entity Sources===
* Clone (cosmids) that exist in WB already
+
Entity lists are available [http://textpresso-dev.caltech.edu/gsa/worm/known_entities/ here]
* Rearrangement
 
* '''NOT LINKED'''- Anatomy_name  (problem terms “set”, “cell”, “ray”, “CAN”, “D”, needs rules) -> [http://www.wormbase.org/wiki/index.php/Markup_excerpts linking problems]. For further examples => With Raymond's help, we will start with a small highly defined list of terms that has been developed for linking with WormAtlas''
 
  
Constant growth classes = new entities should eventually be in WB
+
01downloadModEntities.pl downloads entities<br>
* Variations => includes all variations that follow elegans nomenclature rules, which applies to other species as well
+
From acedb on spica
* '''NOT LINKED'''- Person (authors only) => working on getting a query URL to direct person name to either WBPerson or Author page
+
*Anatomy_name
* Transgene “Is” and "Ex" expressions that exist in WB already, as well as all new "Is" constructs that follow proper nomenclature rules (will not link new "Ex")
+
*Clone
* Genomic expressions are linked to some extent => see [http://www.wormbase.org/wiki/index.php/Markup_excerpts here] for further examples
+
*Rearrangement
* '''NOT LINKED'''- Addgene plasmids? Chris will check on developing this class of objects
+
*Strain
* '''NOT LINKED'''- Chemical/Drugs/Small Molecules
+
*Variation
 +
From postgres on tazendra
 +
*Genes
 +
*Transgenes
  
Monitored growth classes = entity names need curator approval
+
02formSortedLexicon.pl forms the lexicon from the known_entities.
* Gene => unapproved gene names should be checked with genenames curator (Tim Schedl). The current policy is to manually link the name to the reported CDS page. However, this may cause confusion down the line if the name is not approved and is used for a different CDS even though this unapproved gene name connection will only exist within the GSA article.
 
* Strains => only the strains that exist in WB.  We will want Strains for other species eventually.
 
  
==Classes we are working on linking==
+
===URL constructor===
 +
Based on WormBase protocol [http://www.wormbase.org/wiki/index.php/Linking_To_WormBase here], the URL constructor used in the linking script is<br>
 +
http://www.wormbase.org/db/get?name=X;class=Y
 +
 
 +
Currently we are forming URLs based on public names versus object class ID's.  These links are redirected to the correct page, which is the URL based on object ID.  By using the public name for the linking instead of the ID, we can avoid problems that may result in links dying due to objects getting merged or killed. In some cases, such as Phenotype, the URL is generated by a mapped WBPhenotypeID.
 +
 
 +
===Entity classes===
 +
'''Object''' -> ''source'' ; web page link <br>
 +
<example object page><br>
 +
 
 +
'''Gene''' -> ''postgres on tazendra'' ; Gene Summary page URL uses public gene name, which is redirected to WBGeneID<br>
 +
<http://www.wormbase.org/db/get?name=mpk-1;class=Gene><br>
 +
<http://www.wormbase.org/db/get?name=F43C1.1;class=Gene><br>
 +
 
 +
 
 +
'''Protein''' -> ''postgres on tazendra'' ; Gene Summary page<br>
 +
<http://www.wormbase.org/db/get?name=egl-4;class=Gene><br>
 +
<br>
 +
 
 +
'''Allele/SNP''' -> ''acedb on spica'' ; Variation Report page<br>
 +
<http://www.wormbase.org/db/get?name=ga117;class=Variation> <br>
 +
<http://www.wormbase.org/db/get?name=hw42941;class=Variation><br>
 +
<br>
 +
 
 +
'''Transgene''' -> ''postgres on tazendra'' ;  Transgene Summary page <br>
 +
<http://www.wormbase.org/db/get?name=oxIs12;class=Transgene><br>
 +
<br>
 +
 
 +
'''Rearrangement''' -> ''acedb on spica'' <br>
 +
<br>
 +
 
 +
'''Strain''' -> ''acedb on spica''; Strain Report page<br>
 +
<http://www.wormbase.org/db/get?name=MH37;class=Strain ><br>
 +
<br>
 +
 
 +
'''Clone''' -> ''acedb on spica''; Sequence Summary page<br>
 +
<http://www.wormbase.org/db/get?name=yk1106g06.3;class=Sequence><br>
 +
<http://www.wormbase.org/db/get?name=OSTR153G5_1;class=Sequence ><br>
 +
<br>
 +
 
 +
'''Phenotype''' ->
 +
 
 +
 
 +
==Classes to add==
 +
*Molecules/Drugs
 +
*Human Diseases
 +
 
 +
==Suppressed classes===
 +
''' Anatomy/cell''' -> ''acedb on spica'' with corrections by Raymond ; Summary of anatomy ontology term<br>
 +
<http://www.wormbase.org/db/get?name=HSN;class=Anatomy_term><br>
 +
<http://www.wormbase.org/db/get?name=Z1.ppp;class=Anatomy_term><br>
 +
<br>
 +
 
 +
'''Author''' -> ''The links are formed using the URL pattern <br>
 +
http://www.wormbase.org/db/misc/person?name=$url_encoded_name;paper=$wbpaper_id, <br>
 +
where $url_encoded_name is the full name of an author with the following modifications: <br>
 +
i) periods after middle names removed and <br>
 +
ii) spaces converted to %20, which is HTML equivalent of a space. <br>
 +
 
 +
Examples:<br>
 +
Name: Eve W. L. Chow <br>
 +
Link: http://www.wormbase.org/db/misc/person?name=Eve%20W%20L%20Chow;paper=WBPaper00038245
 +
 
 +
Name: I. Russel Lee <br>
 +
Link: http://www.wormbase.org/db/misc/person?name=I%20Russel%20Lee;paper=WBPaper00038245
 +
 
 +
(There may be also be other special characters with accents, umlauts, etc., The best thing to do would be to check WormBase and put the correct link, if the author name already exists. Else, the English equivalent of the special character could be visually determined and a new author link can be formed and notified to Cecilia.)
 +
 
 +
 
 +
==Problematic classes==
 
*Person - based on WBPersonID.  This is a problematic class since publication names are not consistent and a single person could have the same name, initials as someone else, or have multiple aliases. However, we will limit the linking to the authors on the paper only and work closely with our person curator to make sure these links are made properly.  
 
*Person - based on WBPersonID.  This is a problematic class since publication names are not consistent and a single person could have the same name, initials as someone else, or have multiple aliases. However, we will limit the linking to the authors on the paper only and work closely with our person curator to make sure these links are made properly.  
  
Line 28: Line 96:
  
 
*Anatomy terms -   
 
*Anatomy terms -   
** Many cells are named as single letters, ex. "E", "F", "B", etc., which results in many erroneous links,  
+
** Many cells are named as single letters, ex. "E", "F", "B", etc., which results in many erroneous links   
** Some cells are also names for common reagents or paper features, e.g. EMS is a cell but also a common mutagen.  
+
** Ambiguities: some cells are also names for common reagents or paper features, e.g., EMS is a cell but also a common mutagen. These names are put on an exclusion list, and if the cell exists in the paper, they are manually linked to the correct page.
These names are put on an exclusion list, and if the cell exists in the paper, we will manually link them to the correct page.
 
 
** Sex specific anatomy terms such as "Z4.appaap nucleus hermaphrodite" and "Z4.appaap nucleus male" will probably not be an entity that is linked as in general that term is not used in normal writing.  The best that we can do is to link such terms to the non-sex-specific anatomy page "Z4.appaap nucleus" where the user can choose the sex-specific term.  
 
** Sex specific anatomy terms such as "Z4.appaap nucleus hermaphrodite" and "Z4.appaap nucleus male" will probably not be an entity that is linked as in general that term is not used in normal writing.  The best that we can do is to link such terms to the non-sex-specific anatomy page "Z4.appaap nucleus" where the user can choose the sex-specific term.  
  
Line 37: Line 104:
 
**In cases where people name the elements using standard nomenclature, creating a silent link is easy, but more often than not, people do not name these elements and thus a name has to be created for them.  Without the name, a link cannot be made.
 
**In cases where people name the elements using standard nomenclature, creating a silent link is easy, but more often than not, people do not name these elements and thus a name has to be created for them.  Without the name, a link cannot be made.
  
==When entities do not exist in WB already==
+
==Linking issues==
Currently, if an entity does not exist in WB, a silent link will be created based on the term in the paper.
+
===Entities don't exist in WB===
*If the object is an allele, strain, or rearrangement, there should not be a problem as these are lab specific objects that would not conflict with the general public, as long as lab designations were used.
+
*'''Gene names''' need to be approved by WB nomenclature curators. Authors should declare any new gene names before the linking scripts are run, but authors do not always follow through. One action is to link the name to the official WBGene sequence ID. This will require that the authors have explicitly stated what the gene sequence is. Linking the name to the gene should not affect what happens in acedb, the link lives within the online paper alone. If an author is not explicit, the best option is to contact the author, if there is no response within a reasonable time, no link will be created. This situation is not a frequent event.
*In the case of a gene, there is the potential problem that the authors are using a gene name that has not been approved, and may conflict with other labs or objects in WB. We will need to monitor these objects carefully and create a robust feedback system with our nomenclature curators.   
+
*'''Variations''' can be verified through the variation nameserver and a silent link made to the object. Alert the variation curator that the object is being created through the GSA JFP for that paper. Details for the variation can be entered through the form or through the Phenotype OA.
**At present, authors have been asked to declare new genes or  gene names so that we are alerted to these objects. 
+
*'''Other entities'''- check with the corresponding curator as to the feasibility to create a silent link.
**In practice, most of the time, authors have already contacted our nomenclature curators in private and these gene names have been added to the database awaiting publication of the paper.
+
 
**However, there are occasions when authors have not contacted our nomenclature curators. In practice, our pipeline has been to alert our curators when we come across such a gene and let them handle the approval and creation of the gene in our database. The GSA linking pipeline creates a problem with this pipeline in that we will be creating a link no matter if our curators approve the gene name or not.  There has been one test of this early on in the project, which was not a problem. We have now come across the case of a brigssae gene being named, but not existing in WormBase yet.  The immediate solution is to link the gene to the sequence page as this was included in the publication, but this is a manual task and is not amenable to our long-term goal of automating as much of the process as possible.
+
===Formatting problems break recognition===
 +
Entity recognition relies on exact string matches and XML formatting cuesIf entities are not formatted correctly in the XML then the entity may not be recognized and linked. QCFast does not allow links to be made that do not follow XML formatting restrictions, to link these entities, it is necessary to download the html, manually add the link, then upload the fixed html.
 +
 
 +
* Case (a): <nowiki><i>Cbr-dpy-1</i> instead of  Cbr-<i>dpy-1</i></nowiki> <br>
 +
::Otherwise the <nowiki><i></nowiki> tag becomes part of the string and string matching fails on Cbr-dpy-1, even if we have it in the wormbase entity list.
 +
 
 +
* Case (b): <nowiki><i>unc-119</i> instead of <i>unc</i>-<i>119</i></nowiki>
 +
 
 +
* Case (c): Parentheses shouldn't be italicized since they are used as an indicator of string starts and stops for entity recognition. <br>
 +
::<nowiki> <i>cbr-unc-119</i>(<i>st20000</i>) instead of <i>cbr-unc-119(st20000)</i></nowiki>
 +
 
 +
===Combined entity expressions===
 +
* double mutants, double alleles
 +
* dpy-11/12
 +
These cases also need to be dealt with by manually making changes to the XML.
 +
 
  
 
==Dealing with ambiguities ==
 
==Dealing with ambiguities ==
Line 56: Line 138:
 
===Gene names===
 
===Gene names===
 
We anticipate many cases where the name of a gene (or gene name) will be used across different species, making the identification of that gene problematic, not only in a single-species paper (e.g., MEX-1 exists as a name for a mammalian gene as well as for an elegans protein).  We will start to deal with this by identifying all similar gene names, including synonyms, between the different organisms (SGD, flybase, mouse)
 
We anticipate many cases where the name of a gene (or gene name) will be used across different species, making the identification of that gene problematic, not only in a single-species paper (e.g., MEX-1 exists as a name for a mammalian gene as well as for an elegans protein).  We will start to deal with this by identifying all similar gene names, including synonyms, between the different organisms (SGD, flybase, mouse)
 
==Author responsibilities==
 
We have relied on authors to declare WB naive objects to us so that we can make sure they are properly linked.  However, authors do not always get back to us in time, and in cases when they do, they do not always tell us what we need to know, or tell us in a manner that we can automate the addition of their responses to our pipeline.  Manual oversight on author input will need to be provided on an ongoing basis. <br>
 
 
==Problems==
 
Proper formatting in the xml that is given us is critical to smooth running of Arun's scripts.  That is, if any part of "cbr-unc-119" is not italicized properly, the scripts may not recognize the entity as a gene, or worse, may only recognize part of the entity. In this case, the link could potentially go to the elegans unc-119 page rather than the briggsae unc-119 page.  In addition, parentheses need to be formatted correctly as well,  that is, in the expression cpy-4(sy5022), cyp-4 and sy5022 need to be independently italicized for them to be recognized as entities to be linked to independent pages.  If the whole expression is italicized, it won't be recognized as a phrase to be linked, so no link will be created. 
 
 
----
 
It is our hope that as object linking to this MOD becomes a familiar event to readers, authors will see the value of declaring approved new objects to us before the linking scripts are run.  Until then we will continue to rely on curator moderation, rather than author input, and full automation.
 
  
 
==See also==
 
==See also==
 
More [http://wiki.wormbase.org/index.php/Markup_excerpts Problematic cases]
 
More [http://wiki.wormbase.org/index.php/Markup_excerpts Problematic cases]
 
Marked up papers [http://textpresso-dev.caltech.edu/gsa/worm/html/ here]
 
Marked up papers [http://textpresso-dev.caltech.edu/gsa/worm/html/ here]
 
  
  
 
[[Category:Curation]]
 
[[Category:Curation]]
 
[[Category:GSA_markup]]
 
[[Category:GSA_markup]]

Revision as of 23:19, 23 July 2013

back to Caltech_documentation
back to GSA markup SOP

Entities marked up

All WB entities can be viewed here

Entity Sources

Entity lists are available here

01downloadModEntities.pl downloads entities
From acedb on spica

  • Anatomy_name
  • Clone
  • Rearrangement
  • Strain
  • Variation

From postgres on tazendra

  • Genes
  • Transgenes

02formSortedLexicon.pl forms the lexicon from the known_entities.

URL constructor

Based on WormBase protocol here, the URL constructor used in the linking script is

http://www.wormbase.org/db/get?name=X;class=Y

Currently we are forming URLs based on public names versus object class ID's. These links are redirected to the correct page, which is the URL based on object ID. By using the public name for the linking instead of the ID, we can avoid problems that may result in links dying due to objects getting merged or killed. In some cases, such as Phenotype, the URL is generated by a mapped WBPhenotypeID.

Entity classes

Object -> source ; web page link
<example object page>

Gene -> postgres on tazendra ; Gene Summary page URL uses public gene name, which is redirected to WBGeneID
<http://www.wormbase.org/db/get?name=mpk-1;class=Gene>
<http://www.wormbase.org/db/get?name=F43C1.1;class=Gene>


Protein -> postgres on tazendra ; Gene Summary page
<http://www.wormbase.org/db/get?name=egl-4;class=Gene>

Allele/SNP -> acedb on spica ; Variation Report page
<http://www.wormbase.org/db/get?name=ga117;class=Variation>
<http://www.wormbase.org/db/get?name=hw42941;class=Variation>

Transgene -> postgres on tazendra ; Transgene Summary page
<http://www.wormbase.org/db/get?name=oxIs12;class=Transgene>

Rearrangement -> acedb on spica

Strain -> acedb on spica; Strain Report page
<http://www.wormbase.org/db/get?name=MH37;class=Strain >

Clone -> acedb on spica; Sequence Summary page
<http://www.wormbase.org/db/get?name=yk1106g06.3;class=Sequence>
<http://www.wormbase.org/db/get?name=OSTR153G5_1;class=Sequence >

Phenotype ->


Classes to add

  • Molecules/Drugs
  • Human Diseases

Suppressed classes=

Anatomy/cell -> acedb on spica with corrections by Raymond ; Summary of anatomy ontology term
<http://www.wormbase.org/db/get?name=HSN;class=Anatomy_term>
<http://www.wormbase.org/db/get?name=Z1.ppp;class=Anatomy_term>

Author -> The links are formed using the URL pattern
http://www.wormbase.org/db/misc/person?name=$url_encoded_name;paper=$wbpaper_id,
where $url_encoded_name is the full name of an author with the following modifications:
i) periods after middle names removed and
ii) spaces converted to %20, which is HTML equivalent of a space.

Examples:
Name: Eve W. L. Chow
Link: http://www.wormbase.org/db/misc/person?name=Eve%20W%20L%20Chow;paper=WBPaper00038245

Name: I. Russel Lee
Link: http://www.wormbase.org/db/misc/person?name=I%20Russel%20Lee;paper=WBPaper00038245

(There may be also be other special characters with accents, umlauts, etc., The best thing to do would be to check WormBase and put the correct link, if the author name already exists. Else, the English equivalent of the special character could be visually determined and a new author link can be formed and notified to Cecilia.)


Problematic classes

  • Person - based on WBPersonID. This is a problematic class since publication names are not consistent and a single person could have the same name, initials as someone else, or have multiple aliases. However, we will limit the linking to the authors on the paper only and work closely with our person curator to make sure these links are made properly.
  • Non-elegans genes - in general as long as the gene name is unique from elegans gene names, and exists in the database, this should not be a problem. One error we ran into is that some names are similar and distinguishing between the elegans gene and the non elegans gene required a correctly formatted species prefix e.g. "Cbr-". Formatting errors are not easily dealt with by automated scripts.
  • Anatomy terms -
    • Many cells are named as single letters, ex. "E", "F", "B", etc., which results in many erroneous links
    • Ambiguities: some cells are also names for common reagents or paper features, e.g., EMS is a cell but also a common mutagen. These names are put on an exclusion list, and if the cell exists in the paper, they are manually linked to the correct page.
    • Sex specific anatomy terms such as "Z4.appaap nucleus hermaphrodite" and "Z4.appaap nucleus male" will probably not be an entity that is linked as in general that term is not used in normal writing. The best that we can do is to link such terms to the non-sex-specific anatomy page "Z4.appaap nucleus" where the user can choose the sex-specific term.
  • Transgene expressions - currently only the gene elements in the transgene genotype is marked up and is linked to the gene page for example, in the expression eor-1::GFP, only eor-1 is marked up. This is fundamentally wrong as the entire expression should be linked to the transgene page. There are a couple of issues to deal with for this class.
    • Not all transgenes are captured by our database. Specifically, Ex arrays and plasmids/cosmids used for injections are not captured since expression patterns for these elements are fleeting, unlike integrated arrays, so are not stable objects to curate.
    • In cases where people name the elements using standard nomenclature, creating a silent link is easy, but more often than not, people do not name these elements and thus a name has to be created for them. Without the name, a link cannot be made.

Linking issues

Entities don't exist in WB

  • Gene names need to be approved by WB nomenclature curators. Authors should declare any new gene names before the linking scripts are run, but authors do not always follow through. One action is to link the name to the official WBGene sequence ID. This will require that the authors have explicitly stated what the gene sequence is. Linking the name to the gene should not affect what happens in acedb, the link lives within the online paper alone. If an author is not explicit, the best option is to contact the author, if there is no response within a reasonable time, no link will be created. This situation is not a frequent event.
  • Variations can be verified through the variation nameserver and a silent link made to the object. Alert the variation curator that the object is being created through the GSA JFP for that paper. Details for the variation can be entered through the form or through the Phenotype OA.
  • Other entities- check with the corresponding curator as to the feasibility to create a silent link.

Formatting problems break recognition

Entity recognition relies on exact string matches and XML formatting cues. If entities are not formatted correctly in the XML then the entity may not be recognized and linked. QCFast does not allow links to be made that do not follow XML formatting restrictions, to link these entities, it is necessary to download the html, manually add the link, then upload the fixed html.

  • Case (a): <i>Cbr-dpy-1</i> instead of Cbr-<i>dpy-1</i>
Otherwise the <i> tag becomes part of the string and string matching fails on Cbr-dpy-1, even if we have it in the wormbase entity list.
  • Case (b): <i>unc-119</i> instead of <i>unc</i>-<i>119</i>
  • Case (c): Parentheses shouldn't be italicized since they are used as an indicator of string starts and stops for entity recognition.
<i>cbr-unc-119</i>(<i>st20000</i>) instead of <i>cbr-unc-119(st20000)</i>

Combined entity expressions

  • double mutants, double alleles
  • dpy-11/12

These cases also need to be dealt with by manually making changes to the XML.


Dealing with ambiguities

General ambiguities

    • We've come across entities that get mistaken for entities of a different class, this was noticed for a couple of strain names being identical to clone names.
    • Jargon terminology: EMS (embryonic cell) vs EMS (mutagen)

In these cases the term is added to an exclusion from linking list. It would be ideal to try to use some sort of context specific rules to try to make these terms less ambiguous. For now, we can manually identify and link these terms when necessary.

Author names

Person names are difficult to recognize and map to a WBPersonID in an automated fashion, because of the many variations in how an author lists their name, in addition, there are cases when more than one person shares the name, initials, etc.. We will deal with this by designing a URL that will first query for a person name in the WBPerson class, that matches the extracted author name, if one is not found then the URL will point to an author page. Since all authors will get a page eventually, that link is guaranteed to be alive at some point. Q. how often do authors get linked with WBPersonID's?

Gene names

We anticipate many cases where the name of a gene (or gene name) will be used across different species, making the identification of that gene problematic, not only in a single-species paper (e.g., MEX-1 exists as a name for a mammalian gene as well as for an elegans protein). We will start to deal with this by identifying all similar gene names, including synonyms, between the different organisms (SGD, flybase, mouse)

See also

More Problematic cases Marked up papers here