Difference between revisions of "Entity classes and problems"

From WormBaseWiki
Jump to navigationJump to search
Line 47: Line 47:
 
**However, there are occasions when authors have not contacted our nomenclature curators.  In practice, our pipeline has been to alert our curators when we come across such a gene and let them handle the approval and creation of the gene in our database.  The GSA linking pipeline creates a problem with this pipeline in that we will be creating a link no matter if our curators approve the gene name or not.  There has been one test of this early on in the project, which was not a problem.  We have now come across the case of a brigssae gene being named, but not existing in WormBase yet.  The immediate solution is to link the gene to the sequence page as this was included in the publication, but this is a manual task and is not amenable to our long-term goal of automating as much of the process as possible.   
 
**However, there are occasions when authors have not contacted our nomenclature curators.  In practice, our pipeline has been to alert our curators when we come across such a gene and let them handle the approval and creation of the gene in our database.  The GSA linking pipeline creates a problem with this pipeline in that we will be creating a link no matter if our curators approve the gene name or not.  There has been one test of this early on in the project, which was not a problem.  We have now come across the case of a brigssae gene being named, but not existing in WormBase yet.  The immediate solution is to link the gene to the sequence page as this was included in the publication, but this is a manual task and is not amenable to our long-term goal of automating as much of the process as possible.   
  
==Dealing with ambiguities --link by query?==
+
==Dealing with ambiguities ==
I'd like to explore setting up a silent link based on a queryBy using a query url, we may be able to avoid creating a page based on a name that will not be approved for a while, in addition, in terms of Person, if only one person has the name, the query will return the single page, whereas if there are multiple people, the query will return a list of possible people for the reader to look through.  This avoids the creation of a wrong link.  A query link may also be able to tap into other tags for a class, case in point would be to tap into genotype for a transgene, so that full genotype expressions can be linked to the transgene page without having to require the creation of a page based on the expression itself.
+
===General ambiguities===
 +
**We've come across entities that get mistaken for entities of a different class, this was noticed for a couple of strain names being identical to clone names.
 +
**Jargon terminology:  EMS (embryonic cell) vs EMS (mutagen)
 +
In these cases the term is added to an exclusion from linking list.  It would be ideal to try to use some sort of context specific rules to try to make these terms less ambiguousFor now, we can manually identify and link these terms when necessary.
 +
 
 +
===Author names===
 +
Person names are difficult to recognize and map to a WBPersonID in an automated fashion, because of the many variations in how an author lists their name, in addition, there are cases when more than one person shares the name, initials, etc.. We will deal with this by designing a URL that will first query for a person name in the WBPerson class, that matches the extracted author name, if one is not found then the URL will point to an author page.  Since all authors will get a page eventually, that link is guaranteed to be alive at some point. Q. how often do authors get linked with WBPersonID's?
 +
 
 +
===Gene names===
 +
We anticipate many cases where the name of a gene (or gene name) will be used across different species, making the identification of that gene problematic, not only in a single-species paper (e.g., MEX-1 exists as a name for a mammalian gene as well as for an elegans protein).  We will start to deal with this by identifying all similar gene names, including synonyms, between the different organisms (SGD, flybase, mouse)
  
 
==Author responsibilities==
 
==Author responsibilities==

Revision as of 13:41, 1 March 2010

Caltech_documentation

GSA markup

Classes of entities that are being marked up

to see the actual entities go to http://dev.textpresso.org/gsa/known_entities/

text in italics signifies possible changes to the entity lists; Classes Anatomy_name and Person are just now being marked up.

  • Stable classes = entity lists do not change much
    • Clone (cosmids) that exist in WB already
    • Rearrangement
    • Anatomy_name (problem terms “set”, “cell”, “ray”, “CAN”, “D”, needs rules) -> link here for further examples => With Raymond's help, we will start with a small highly defined list of terms that has been developed for linking with WormAtlas
  • Constant growth classes = new entities should eventually be in WB
    • Variations => includes all variations that follow elegans nomenclature rules, which applies to other species as well
    • Person (authors only) => working on getting a query URL to direct person name to either WBPerson or Author page
    • Transgene “Is” and "Ex" expressions that exist in WB already, as well as all new "Is" constructs that follow proper nomenclature rules (will not link new "Ex")
    • Genomic expressions will not be linked => see here for further examples
    • Addgene plasmids? Chris will check on developing this class of objects
    • Chemical/Drugs/Small Molecules
  • Monitored growth classes = entity names need curator approval
    • Gene =>manually map published, unapproved, gene name to reported cds page, the link will only be available through the paper, so if the author's gene name was not approved, a search for that name will not produce a hit in WB, or will go to what ever page associated with that name that had been an approved naming).
    • Strains => only the strains that exist in WB (i.e., the CGC). We will want Strains for other species eventually. Need to talk to MaryAnn and the CGC about developing this class.

Classes we are working on linking

  • Person - based on WBPersonID. This is a problematic class since publication names are not consistent and a single person could have the same name, initials as someone else, or have multiple aliases. However, we will limit the linking to the authors on the paper only and work closely with our person curator to make sure these links are made properly.
  • Non-elegans genes - in general as long as the gene name is unique from elegans gene names, and exists in the database, this should not be a problem. One error we ran into is that some names are similar and distinguishing between the elegans gene and the non elegans gene required a correctly formatted species prefix e.g. "Cbr-". Formatting errors are not easily dealt with, at the moment, by automated scripts.
  • Anatomy terms - this class was problematic due to the linking of the terms to common reagents or paper features, e.g. EMS is a cell but also a common mutagen. For now, as we will be monitoring papers manually, we can chart how often this happens and fix these by hand. We may need to implement context specificity for some of these terms.
  • Transgene expressions - currently only the gene elements in the transgene genotype is marked up and is linked to the gene page for example, in the expression eor-1::GFP, only eor-1 is marked up. This is fundamentally wrong as the entire expression should be linked to the transgene page. There are a couple of issues to deal with for this class.
    • First, not all transgenes are captured by our database. Specifically, Ex arrays and plamids/cosmids used for injections are not captured. This was an idealogical decision based on the fact that expression patterns for these elements are fleeting, unlike integrated arrays, so are not stable data elements to curate.
    • Second, in cases where people name the elements using standardized nomenclature, creating a silent link is easy, but more often than not, people do not name these elements and thus a name has to be created for them. Without the name, a link cannot be made.

When entities do not exist in WB already

Currently, if an entity does not exist in WB, a silent link will be created based on the term in the paper.

  • If the object is an allele, strain, or rearrangement, there should not be a problem as these are lab specific objects that would not conflict with the general public, as long as lab designations were used.
  • In the case of a gene, there is the potential problem that the authors are using a gene name that has not been approved, and may conflict with other labs or objects in WB. We will need to monitor these objects carefully and create a robust feedback system with our nomenclature curators.
    • At present, authors have been asked to declare new genes or gene names so that we are alerted to these objects.
    • In practice, most of the time, authors have already contacted our nomenclature curators in private and these gene names have been added to the database awaiting publication of the paper.
    • However, there are occasions when authors have not contacted our nomenclature curators. In practice, our pipeline has been to alert our curators when we come across such a gene and let them handle the approval and creation of the gene in our database. The GSA linking pipeline creates a problem with this pipeline in that we will be creating a link no matter if our curators approve the gene name or not. There has been one test of this early on in the project, which was not a problem. We have now come across the case of a brigssae gene being named, but not existing in WormBase yet. The immediate solution is to link the gene to the sequence page as this was included in the publication, but this is a manual task and is not amenable to our long-term goal of automating as much of the process as possible.

Dealing with ambiguities

General ambiguities

    • We've come across entities that get mistaken for entities of a different class, this was noticed for a couple of strain names being identical to clone names.
    • Jargon terminology: EMS (embryonic cell) vs EMS (mutagen)

In these cases the term is added to an exclusion from linking list. It would be ideal to try to use some sort of context specific rules to try to make these terms less ambiguous. For now, we can manually identify and link these terms when necessary.

Author names

Person names are difficult to recognize and map to a WBPersonID in an automated fashion, because of the many variations in how an author lists their name, in addition, there are cases when more than one person shares the name, initials, etc.. We will deal with this by designing a URL that will first query for a person name in the WBPerson class, that matches the extracted author name, if one is not found then the URL will point to an author page. Since all authors will get a page eventually, that link is guaranteed to be alive at some point. Q. how often do authors get linked with WBPersonID's?

Gene names

We anticipate many cases where the name of a gene (or gene name) will be used across different species, making the identification of that gene problematic, not only in a single-species paper (e.g., MEX-1 exists as a name for a mammalian gene as well as for an elegans protein). We will start to deal with this by identifying all similar gene names, including synonyms, between the different organisms (SGD, flybase, mouse)

Author responsibilities

We have relied on authors to declare WB naive objects to us so that we can make sure they are properly linked. However, authors do not always get back to us in time, and in cases when they do, they do not always tell us what we need to know, or tell us in a manner that we can automate the addition of their responses to our pipeline. Manual oversight on author input will need to be provided on an ongoing basis.

Problems

Proper formatting in the xml that is given us is critical to smooth running of Arun's scripts. That is, if any part of "cbr-unc-119" is not italicized properly, the scripts may not recognize the entity as a gene, or worse, may only recognize part of the entity. In this case, the link could potentially go to the elegans unc-119 page rather than the briggsae unc-119 page. In addition, parentheses need to be formatted correctly as well, that is, in the expression cpy-4(sy5022), cyp-4 and sy5022 need to be independently italicized for them to be recognized as entities to be linked to independent pages. If the whole expression is italicized, it won't be recognized as a phrase to be linked, so no link will be created.


It is our hope that as object linking to this MOD becomes a familiar event to readers, authors will see the value of partaking in the exercise to declare approved new objects to us before the linking scripts are run. Until then we will continue to rely on curator moderation, rather than author input, and full automation.