Mark up policy
Objects to be marked up
This list includes all the objects requested to be marked up at the very beginning of the project (from 9, April, 2009)
Follow up actions have been indicated and references to notes and e-mails are posted below.
Object is italized
→ = linking to
Gene name. mpk-1 → Gene Summary page:
<http://www.wormbase.org/db/gene/gene?name=mpk-1;class=Gene>
Anonymous gene (physical map location). F43C1.1 → Gene Summary page:
<http://www.wormbase.org/db/gene/gene?name=F43C1.1;class=Gene>
Protein name. EGL-4 -> Gene Summary page:
<http://www.wormbase.org/db/gene/gene?name=egl-4;class=Gene>
Allele. ga117 → Variation Report page (molecular basis of mutation):
<http://www.wormbase.org/db/gene/variation?name=ga117>
SNP. hw42941 → Variation Report page (molecular basis of SNP):
<http://www.wormbase.org/db/gene/variation?name=hw42941>
Strain name. MH37 → Strain Report page: <http://www.wormbase.org/db/gene/strain?name=MH37;class=Strain > (Will be limited to strains that are at the CGC stock center, and thus in WormBase?) Not currently marked up
cDNA clone. yk1106g06.3 → Sequence Summary page:
<http://www.wormbase.org/db/seq/sequence?name=yk1106g06.3;class=Sequence>
Now looking to clone class? (see below and e-mail of 1 July 2009 and 29 July)
cDNA ORFeome clone. OSTR153G5 → Sequence Summary page: <http://www.wormbase.org/db/seq/sequence?name=OSTR153G5_1;class=Sequence> Now looking to clone class? (see below and e-mail of 1 July 2009 and 29 July)
Transgene. sEX14536 → Transgene Summary page:
<http://www.wormbase.org/db/gene/transgene?name=sEx14536;class=Transgene>
Balancer chromosomes. hT2 -> Rearrangement: <http://caltech.wormbase.org/db/gene/rearrange?name=hT2;class=Rearrangement> marked rearrangements are linked to their constituent balancer and linked objects. we are currently working on model changes in WB to allow people to view the marked rearrangements from the balancer page so they will be able to get to information about the marked balancer, although not directly
Cell/tissue. HSN (neuron) → Summary of anatomy ontology term: <http://www.wormbase.org/db/ontology/anatomy?name=HSN> Not currently marked up, see comments below
Cell (lineage pedigree). Z1.ppp → Summary of anatomy ontology term: <http://www.wormbase.org/db/ontology/anatomy?name=Z1.ppp > Not currenlty marked up, see comments below'
Senior author (Not all authors). Paul Sternberg → Person Report page: <http://www.wormbase.org/db/misc/person?name=WBPerson625;class=Person> Not currently marked up although we are collecting e-mail addresses, see comments below
Sequence data
from 1 July 2009
Hello Tim, Steve, Thanks for your emails. Please see my comments below. gen103614FIN 1) As Steve mentioned, just before ABSTRACT is the link DC1, which erroneously links to Clone Report for DC1. DC1 looks like it is within a URL. Is there a way that any object found within a URL can be ignored? (Also happened for gen104885FIN.) I followed what Steve suggested in his email. I removed lines that have the following tags from getting linked: Affiliations Correspondence Footnote Article_Title _Runhead Bib_Reference COMMENT Also I found that entries within tables have tags like <entry rowsep="1" so I have done pattern matching using regular expressions to remove lines like these from getting linked. The PERL regex I used is <entry .* rowsep=\"\d+\" Steve: what other tagged items do you think I can exclude from linking? 2) cdf-2 and CDF-2 are not linked - should be "silent" links that come alive when the curators make the gene/protein pages. cdf-2 and CDF-2 are not in wormbase. These are new objects for wormbase. I thought we were going to ask the authors to provide us with a list of objects from which such new objects could be captured. Whether these objects are already in wormbase or not does not matter. Are you going to ask the authors to fill out the form now or how do we proceed with papers that have already been accepted? If you want me to do regular expression matching for different objects I can do it as well. My only concern is that every instance of the regular expression should indeed be a true positive and all true positives are captured by the regular expression (one-to-one and onto) otherwise we will run into false positives and false negatives. Tim: let me know how you want to handle this. 3) M9, which refers to a buffer, is inappropriately linked to http://www.wormbase.org/db/seq/clone?name=M9;class=Clone If this happens only in very few cases, then I can have a manual exclusion list. The list needs to be prepared by an expert (like Tim) for each object of interest. If it is difficult to come up with this list apriori, then someone needs to go through the linked articles and inform me. For now I have put M9 in the Clone exclusion list, so no occurrence of M9 will ever get linked to the Clone page. So if M9 is indeed a Clone in some context, then it won't get linked. (More generally, this is a difficult problem to address with the current string matching approach the program is using. We need to implement a word-sense disambiguation NLP approach to decide whether the occurrence is a true positive or not. Also we need a list of all such objects that need to be disambiguated. As the number of objects increases, the disambiguation problem could become intractable.) 4) III in "the SuperScript III First-Strand" is inappropriately linked as (http://www.wormbase.org/db/seq/sequence?name=III;class=Sequence). III is also a problem in gen104885FIN, E). I spoke to Karen Yook here and she says whatever GENETICS needs is in wormbase's Clone class and there is no need to link what is in wormbase Sequence list. So I am removing what is in wormbase Sequence from getting linked. Hence this issue is solved. 5) In the Analysis of CDF-2::GFP section of the Materials and Methods, the following are not linked: plasmids pDG222 and pDP15, and extrachromosomal arrays amEx1032, amEx1033 and amEx1201 and integrated transgenes amIs2, amIs4 and amIs5. What are the plans for the extrachromosomal arrays and integrated transgenes? Silent links? The plasmid names are more problematic as the DG and DP do not represent unique lab identifiers but non-unique initials of lab members. These are also new objects and not in wormbase. Same as in (2) above. 6) WormBase problem with n2527 <http://www.wormbase.org/db/gene/variation?name=n2527> which gave an internal error or misconfiguration message. Similarly for n1046 -- is this related to Steve's point about improper characters? No, I think the wormbase link is just down. Steve: I do not see the carriage returns on my linux machine. Are you using a windows machine? Can you convert the files to windows format and try? ====================================== gen104885FIN A) WormBase problem with or198 <http://www.wormbase.org/db/gene/variation?name=or198;class=Variation> which gave an internal error or misconfiguration message. Similarly for or191, or195 and or213. Looks like wormbase was down when you checked. They seem to be moving the server from CSHL to Toronto, so this could be causing the errors. B) In Materials and Method, Molecular biology: pSO26 and pAA64 are not linked. Same issues as above in 5) New objects. C) In Materials and Method, Molecular biology: WRM0633dC is not linked, while other fosmids in the same sentence are linked. New object. D) M9 buffer linking problem as in 3) above. See above. E) In the Discussion "III", in chromosome III, is linked as above in 4). Resolved. F) In the Literature Cited section, R45 is linked to a clone report http://www.wormbase.org/db/seq/clone?name=R45;class=Clone. However, Current Biology uses the R followed by number to designate page number, so this may be a problem that will reoccur often. Similarly for R93-R95, further down in the Literature Cited. Similarly, e128 is linked to the allele variation page for dyp-10, but PLoS journals use e followed by a number to indicate page. Similarly, e36 is linked as an allele variation, but instead is a page number for Nucleic Acids Res. Possibly exclude the Literature Cited section from all linking, or at least linking clones and alleles? Would this require that GENETICS and/or Dartmouth Press put an identifier at the beginning of Literature Cited so that your pipeline can recognize it? Tim: this information is already in the XML and I have modified the program to exclude Bib_References tagged items. So none of these objects are linked now. I am now testing the links for the test set that Steve sent me. Will keep you updated with what I find. Regards, Arun. Thanks, Tim
e-mail 29 June 2009
Hello Tim, The objects you have mentioned below to be reasonable are the ones that the program is linking currently. (Alleles are included in Variations.) If you want to see the list of entities in "Sequence", please browse through the list at http://dev.textpresso.org/gsa/known_objects/ Other object lists are also in the same page. <comment on author markup, see below> Regards, -Arun.
Cell data
from e-mail 29 June 2009
Hi Arun et al, As we are about to go live on the linking, I just want to be clear on what objects will be linked at this time. The following seem reasonable given the level of testing. Gene Allele Protein Rearrangement Strain Transgene Variation Clone I have excluded "Cell" for the time being as it was giving some erroneous flags. I am not sure about the status of links with "Sequence" - is this the sequence of the mutant allele lesion? - if so it should be included. Also, for "Person", I am not sure I saw this. Any thoughts? Tim
with follow up e-mail 29 June 2009
Hello Tim, The objects you have mentioned below to be reasonable are the ones that the program is linking currently. (Alleles are included in Variations.) If you want to see the list of entities in "Sequence", please browse through the list at http://dev.textpresso.org/gsa/known_objects/ Other object lists are also in the same page. <comment on author markup, see below> Regards, -Arun.
Author data capture
From e-mail of 8 May 2009
Hi, Does it make sense to have something about new WBPersons on the First Pass form? This would seem to be simpler on the GENETICS side. Tim Please fill free to contact me if you need me to assign a new WBPerson_id, just send me data (names, institution and email) so I can check if it already exists and needs to be created or added a new aka. Cecilia On Fri, May 8, 2009 at 11:31 AM, <azurebrd@its.caltech.edu> wrote: 3) What is the current thought on linking to people? Is the plan to just do it for the article title part? Currently, there are not links in the citations, but that might be a good idea. However, in the text there are some links, e.g. Isao Katsura in Gen93773FIN.html; I am not sure these links are needed. Note that not all authors are linked, for example Andrew Z. Fire is not linked, while the others are, in title for gen89433FIN.html. Similarly for Paul E Maines in gen96016FIN.html. I haven't spoken to anyone else about associating body text to people, but we have someone at Caltech that works on associating a paper's authors to people (which is more complicated than it should be due to multiple people having the same or similar names after abbreviations, and last names of people changing). It would be ideal if people would register with Cecilia (email above) to get a WormBase Person ID to add to the XML (unless they've published in the field before, in which case they'd need to look it up on WormBase), but this may be more work than we can expect of them. Either way, using a script to associate names to people would be possible based on existing people names, but it might not be practical for making perfect connections (unless it's okay to link to multiple people and put a disclaimer than links to people could refer to the wrong person because the name refers to someone that's new to the field). After Cecilia has verified an author-person connection for the author section, a script could match the body to the also-known-as data for that WormBase Person though. Juancarlos
Follow up response from Juancarlos:
Yeah, that'd be great. We have a section that asks for Author information, specifically their e-mail addresses, but it'd be good if they would instead list the author names along the WBPerson ID and email address. We can add a link to http://wormbase.org/ where they can select ``Author / Person to search for someone by name, and get back their WBPerson ID and email address (if we have it). (or if that's too cumbersome, we could talk to Todd or Norie about having a page specifically for that search, so they don't have to look at the front page / select ``Author / Person Cecilia, if you haven't seen the information we're requesting, you can look at the form here : http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/journal/journal_first_pass.cgi?action=Curate&paper=00000001&passwd=1241223658.1418967 And the Author field near the bottom would be something you could query to get contact info for people you need to associate. thank you, Juancarlos
snipet from Arun from 29 June 2009
... I have disabled "Person" for now, since the string match is an exact match. This led to some persons being missed because there was an extra period after an initial (or a period or a middle initial was missing). So I decided not to link any name. Let me know if we can just link whoever we can (and run into the risk of not linking some names) OR we just keep it as it is. Regards, -Arun.