Objects to be marked up

This list includes all the objects requested to be marked up at the very beginning of the project (from 9, April, 2009)
Follow up actions have been indicated and references to notes and e-mails are posted below.

Object is italized
→ = linking to

Gene name. mpk-1 → Gene Summary page: <;class=Gene>

Anonymous gene (physical map location). F43C1.1 → Gene Summary page: <;class=Gene>

Protein name. EGL-4 -> Gene Summary page: <;class=Gene>

Allele. ga117 → Variation Report page (molecular basis of mutation): <>

SNP. hw42941 → Variation Report page (molecular basis of SNP): <>

Strain name. MH37 → Strain Report page: <;class=Strain > (Will be limited to strains that are at the CGC stock center, and thus in WormBase?) Not currently marked up

cDNA clone. yk1106g06.3 → Sequence Summary page: <;class=Sequence> Now looking to clone class? (see below and e-mail of 1 July 2009 and 29 July)

cDNA ORFeome clone. OSTR153G5 → Sequence Summary page: <;class=Sequence> Now looking to clone class? (see below and e-mail of 1 July 2009 and 29 July)

Transgene. sEX14536 → Transgene Summary page: <;class=Transgene>

Balancer chromosomes. hT2 -> Rearrangement: <;class=Rearrangement> marked rearrangements are linked to their constituent balancer and linked objects. we are currently working on model changes in WB to allow people to view the marked rearrangements from the balancer page so they will be able to get to information about the marked balancer, although not directly

Cell/tissue. HSN (neuron) → Summary of anatomy ontology term: <> Not currently marked up, see comments below

Cell (lineage pedigree). Z1.ppp → Summary of anatomy ontology term: < > Not currenlty marked up, see comments below'

Senior author (Not all authors). Paul SternbergPerson Report page: <;class=Person> Not currently marked up although we are collecting e-mail addresses, see comments below

Sequence data

from 1 July 2009

Hello Tim, Steve,

Thanks for your emails.

Please see my comments below.


1)  As Steve mentioned, just before ABSTRACT is the link DC1, which
erroneously links to Clone Report for DC1.
DC1 looks like it is within a URL.  Is there a way that any object found
within a URL can be ignored?  (Also happened for gen104885FIN.)

I followed what Steve suggested in his email. I removed lines that have
the following tags from getting linked:


Also I found that entries within tables have tags like

<entry rowsep="1"

so I have done pattern matching using regular expressions to remove lines
like these from getting linked. The PERL regex I used is

<entry .* rowsep=\"\d+\"

Steve: what other tagged items do you think I can exclude from linking?

2)  cdf-2 and CDF-2 are not linked - should be "silent" links that come
alive when the curators make the gene/protein pages.

cdf-2 and CDF-2 are not in wormbase. These are new objects for wormbase. I
thought we were going to ask the authors to provide us with a list of
objects from which such new objects could be captured. Whether these
objects are already in wormbase or not does not matter.

Are you going to ask the authors to fill out the form now or how do we
proceed with papers that have already been accepted?

If you want me to do regular expression matching for different objects I
can do it as well. My only concern is that every instance of the regular
expression should indeed be a true positive and all true positives are
captured by the regular expression (one-to-one and onto) otherwise we will
run into false positives and false negatives.

Tim: let me know how you want to handle this.

3)  M9, which refers to a buffer, is inappropriately linked to;class=Clone

If this happens only in very few cases, then I can have a manual exclusion
list. The list needs to be prepared by an expert (like Tim) for each
object of interest. If it is difficult to come up with this list apriori,
then someone needs to go through the linked articles and inform me.

For now I have put M9 in the Clone exclusion list, so no occurrence of M9
will ever get linked to the Clone page. So if M9 is indeed a Clone in some
context, then it won't get linked.

(More generally, this is a difficult problem to address with the current
string matching approach the program is using. We need to implement a
word-sense disambiguation NLP approach to decide whether the occurrence is
a true positive or not. Also we need a list of all such objects that need
to be disambiguated. As the number of objects increases, the
disambiguation problem could become intractable.)

4) III in "the SuperScript III First-Strand" is inappropriately
linked as
(;class=Sequence). III
is also a problem in gen104885FIN, E).

I spoke to Karen Yook here and she says whatever GENETICS needs is in
wormbase's Clone class and there is no need to link what is in wormbase
Sequence list. So I am removing what is in wormbase Sequence from getting
linked. Hence this issue is solved.

5)  In the Analysis of CDF-2::GFP section of the Materials and
Methods, the following are not linked: plasmids pDG222 and pDP15, and
extrachromosomal arrays amEx1032, amEx1033 and amEx1201 and
integrated transgenes amIs2, amIs4 and amIs5.

What are the plans for the extrachromosomal arrays and integrated
transgenes? Silent links?
The plasmid names are more problematic as the DG and DP do not
represent unique lab identifiers but non-unique initials of lab

These are also new objects and not in wormbase. Same as in (2) above.

6) WormBase problem with n2527
<> which gave an
internal error or misconfiguration message.  Similarly for n1046 -- is
this related to Steve's point about improper characters?

No, I think the wormbase link is just down.

Steve: I do not see the carriage returns on my linux machine. Are you
using a windows machine? Can you convert the files to windows format and



A) WormBase problem with or198
which gave an internal error or misconfiguration message.  Similarly for
or191, or195 and or213.

Looks like wormbase was down when you checked. They seem to be moving the
server from CSHL to Toronto, so this could be causing the errors.

B) In Materials and Method, Molecular biology: pSO26 and pAA64 are not
linked.  Same issues as above in 5)

New objects.

C) In Materials and Method, Molecular biology: WRM0633dC is not
linked, while other fosmids in the same sentence are linked. 

New object.

D) M9 buffer linking problem as in 3) above.

See above.

E) In the Discussion "III", in chromosome III, is linked as above in 4).


F) In the Literature Cited section, R45 is linked to a clone report;class=Clone.  However,
Current Biology uses the R followed by number to designate page
number, so this may be a problem that will reoccur often.  Similarly for
R93-R95, further down in the Literature Cited.
Similarly, e128 is linked to the allele variation page for dyp-10, but
PLoS journals use e followed by a number to indicate page.
Similarly, e36 is linked as an allele variation, but instead is a page
number for Nucleic Acids Res.

Possibly exclude the Literature Cited section from all linking, or at
least linking clones and alleles?  Would this require that GENETICS and/or
Dartmouth Press put an identifier at the beginning of
Literature Cited so that your pipeline can recognize it?

Tim: this information is already in the XML and I have modified the
program to exclude Bib_References tagged items. So none of these objects
are linked now.

I am now testing the links for the test set that Steve sent me. Will keep
you updated with what I find.




e-mail 29 June 2009

Hello Tim,

The objects you have mentioned below to be reasonable are the ones that
the program is linking currently. (Alleles are included in Variations.)

If you want to see the list of entities in "Sequence", please browse
through the list at

Other object lists are also in the same page.

Cell data

from e-mail 29 June 2009

Hi Arun et al,

As we are about to go live on the linking, I just want to be clear on what objects will be linked at this time.

The following seem reasonable given the level of testing.


I have excluded "Cell" for the time being as it was giving some erroneous flags.  I am not sure about the status of links with "Sequence" - is   this the sequence of the mutant allele lesion? - if so it should be included.  Also, for "Person", I am not sure I saw this.

Any thoughts?


with follow up e-mail 29 June 2009

Hello Tim,

The objects you have mentioned below to be reasonable are the ones that
the program is linking currently. (Alleles are included in Variations.)

If you want to see the list of entities in "Sequence", please browse
through the list at

Other object lists are also in the same page.

Author data capture

From e-mail of 8 May 2009


Does it make sense to have something about new WBPersons on the First Pass form?  This would seem to be simpler on the GENETICS side.


Please fill free to contact me if you need me to assign a new WBPerson_id,
just send me data (names, institution and email) so I can check if it
already exists and needs to be created or added  a new aka.


On Fri, May 8, 2009 at 11:31 AM,  <> wrote:
What is the current thought on linking to people?  Is the plan to
just do it for the article title part? Currently, there are not links
in the citations, but that might be a good idea.  However, in the
text there are some links, e.g. Isao Katsura in Gen93773FIN.html; I
am not sure these links are needed.
Note that not all authors are linked, for example Andrew Z. Fire is
not linked, while the others are, in title for gen89433FIN.html.
Similarly for Paul E Maines in gen96016FIN.html.

I haven't spoken to anyone else about associating body text to people,
but we have someone at Caltech that works on associating a paper's
authors to people (which is more complicated than it should be due to
multiple people having the same or similar names after abbreviations,
and last names of people changing).  It would be ideal if people would
register with Cecilia (email above) to get a WormBase Person ID to add
to the XML (unless they've published in the field before, in which
case they'd need to look it up on WormBase), but this may be more work
than we can expect of them.

Either way, using a script to associate names to people would be
possible based on existing people names, but it might not be practical
for making perfect connections (unless it's okay to link to multiple
people and put a disclaimer than links to people could refer to the
wrong person because the name refers to someone that's new to the
field).  After Cecilia has verified an author-person connection for
the author section, a script could match the body to the also-known-as
data for that WormBase Person though.


Follow up response from Juancarlos:

Yeah, that'd be great.

We have a section that asks for Author information, specifically their
e-mail addresses, but it'd be good if they would instead list the
author names along the WBPerson ID and email address.  We can add a
link to where they can select ``Author / Person
to search for someone by name, and get back their WBPerson ID and
email address (if we have it).  (or if that's too cumbersome, we could
talk to Todd or Norie about having a page specifically for that
search, so they don't have to look at the front page / select 
``Author / Person

Cecilia, if you haven't seen the information we're requesting, you can
look at the form here :

And the Author field near the bottom would be something you could
query to get contact info for people you need to associate.

thank you,

snipet from Arun from 29 June 2009

I have disabled "Person" for now, since the string match is an exact
match. This led to some persons being missed because there was an extra
period after an initial (or a period or a middle initial was missing). So
I decided not to link any name. Let me know if we can just link whoever we
can (and run into the risk of not linking some names) OR we just keep it
as it is.