Attaching Genes to Papers
Gene Associations Based Upon Abstracts
When papers are added to postgres using the Enter New Papers function of the Paper Editor, the corresponding abstracts are scanned, via a script, for matches to loci, sequence names, and synonyms.
Postgres tables used for this are:
To view the contents of these tables, perform the following type of query using the referenceform.cgi:
SELECT * FROM gin_locus;
The script is called by the paper_editor.cgi when entering new PMID papers from XML, it's calling /home/postgres/work/pgpopulation/pap_papers/new_papers/pap_match.pm subroutine &processXmlIds . If it's a new entry and it looks at the abstract field, it splits the abstract into words by splitting on space, then for each word it removes , ( ) ; then tries to match it to the list from the three tables above. If it matches, it adds the entry with 'Inferred_automatically\t\"Abstract read $word\"' into the evidence. -- J
Updating the Script
The script that associates genes based upon abstracts does miss some genes because of the way they're expressed in the abstract.
An idea on what to change:
1) With the exception of a dash (-), split text from punctuation and then look for matches to the approved gene list.
This wouldn't quite work, since seqnames often have periods in them (and some loci / synonyms ?). Further, doing this query for non-letter, non-digit, non-period, we get some underscores. Are these correct ? Yes, I checked these gene names in WB and they are correct. If so we can't split on non-dash, non-space. We'd at least have to add non-period and non-underscore, but is there more to the rules, or do we not care so much about these exceptions ? : SELECT * FROM gin_seqname WHERE gin_seqname ~ '[^A-Z0-9.a-z]'; I think, to be on the safe side, we should not split on non-period and non-underscore, but see below for an additional idea about periods
joinkey | gin_seqname | gin_timestamp
00201656 | CE7X_3.3 | 2010-12-08 05:24:28.383862-08 00008349 | CE7X_3.1 | 2010-12-08 03:10:46.802746-08 00008350 | CE7X_3.2 | 2010-12-08 03:10:46.848005-08 00017153 | E_BE45912.2 | 2010-12-08 03:18:55.463388-08 00043233 | E_BE45912.1 | 2010-12-08 03:44:16.195364-08
Here's some loci that have some allele format (?) Yes, these must be gene names from other species that have a phenotype, but are not cloned and therefore the name isn't mapped to a corresponding elegans gene.
SELECT * FROM gin_locus WHERE gin_locus ~ '[^A-Z0-9.a-z\-]';
00000791 | Cre-unc(bd201) | 2010-12-08 03:01:55.622337-08 00000313 | Cbr-dpy(s1281) | 2010-12-08 03:01:15.792311-08 00000314 | Cbr-dpy(bd101) | 2010-12-08 03:01:15.852879-08 00000346 | Cbr-him(bd102) | 2010-12-08 03:01:16.602787-08 00000347 | Cbr-him(bd103) | 2010-12-08 03:01:16.682897-08 00000348 | Cbr-him(s1290) | 2010-12-08 03:01:16.762942-08 00000349 | Cbr-him(bd104) | 2010-12-08 03:01:16.84322-08 00000350 | Cbr-unc(s1270) | 2010-12-08 03:01:16.903372-08 00000351 | Cbr-unc(s1275) | 2010-12-08 03:01:16.982889-08
If you then look at the gin_synonyms list you'll see a ton of entries with underscore, and the following query has 67 results : SELECT * FROM gin_synonyms WHERE gin_synonyms ~ '[^A-Z0-9.a-z_\-]';
Okay, got it.
So we can change the code to work like you suggest, but then we'd certainly not catch any of these (definitely not the huge amount with periods in them). Let me know. I didn't think I'd made a suggestion, but I like most of what you said below -- J I think you initially wrote that statement to me :-). --K lol, ok, more signing of stuff =) -- J
For example, fem-1(hc17ts) would become fem-1 ( hc17ts )
DAF-18/PTEN would become DAF-18 / PTEN
Papers where gene associations have been missed for testing the script:
WBPaper00035164 - missed BLI-4 - because in the abstract it is written as ....(such as BLI-4).
WBPaper00035449 - missed gas-1, isp-1, daf-2, sod-2 - because the gene names are followed by alleles, e.g. daf-2(e1368)
WBPaper00035474 - missed fem-1, fem-3 - because the gene names are followed by alleles and, in one case, preceded by a left parenthesis
WBPaper00035490 - missed daf-16 - DAF-16/FOXO or daf-16-dependent or daf-16-independent
WBPaper00035559 - missed NMY-2, GPR-1/2, LIN-5 - (NMY-2) GPR-1/2 LIN-5. suggestions below would then capture GPR-1 but not give any indication that there's a GPR-2, is that what 1/2 is supposed to mean ? Yes, from the suggestion below, we would only be able to get GPR-1 and not GPR-2. We could try to come up with a rule for getting GPR-2 in cases like this, but I worry that it's then getting too complicated. GPR-1/2 is one of the most common cases of people writing gene names this way. Could we make a rule that said if the paper is associated with GPR-1, then also automatically associate it with GPR-2? I agree that it's too complicated to try to deal with it in general terms. We could try to write an exception for GPR-1/2, but is this likely to show up over and over, or more a case of different genes being refered to as gin-1/2/3/n once in infrequent abstracts ? -- J Yeah, I actually did a small-scale Textpresso search yesterday to try to get a sense of this. GPR-1/2 is fairly common, but others that I have seen in papers, like PLK-1/2 and MEX-5/6, are not as common. I think making an exception for GPR-1/2 would be worth it, but I don't think we need to come up with a general rule for this. At least not now. --K
WBPaper00037741 - missed DAF-18, PHA-4, HSF-1, SIR-2.1, AAK-2 - DAF-18/PTEN AMPK/AAK-2. Step 5 would fail to match AAK-2. but hopefully match AAK-2 ? That's right. Same with LIN-5. in the example directly above. Oh dear, I don't know what I meant by matching AAK-2 but hopefully matching AAK-2 maybe that by step 4 it would fail, but step 5 would catch it ? -- J Yes, that would seem right. --K.
WBPaper00037686 - missed bus-2, bus-4, bus-12, srf-3, bus-8, bus-17 - (bus-2, bus-4 and bus-12, together with the previously cloned srf-3, bus-8 and bus-17)
WBPaper00037794 - missed MIG-14 - MIG-14/Wls
From looking at the above query results and the missing gene/protein associations again, here is a revised idea. I think the benefits of capturing more C. elegans genes by splitting on parentheses and forward slashes outweighs the cost of possibly missing the gene names with parentheses or synonyms with a forward slash. The issues of splitting on periods is important, though, because if we did that, we would miss sequence names. This isn't a huge problem for published paper abstracts, but would be a bigger problem for meeting abstracts. Step number 5 below is a possible work around for this, but I don't know how costly or slow that would be. The idea here is to try to pick up gene or protein names that only occur once in an abstract at the end of a sentence (see LIN-5 example above from WBPaper00035559).
1) Split on space (same as before) good OK
2) Split before and after left and right parentheses ( ) good OK
3) Split before and after forward slash / good OK
4) Remove comma and semi-colon , ; just strip them out as if they'd never been there ? ok Yes
5) If not too slow, then split before and after periods to see if there are any additional matches, not found via the first round to pick up additional NEW gene names. I'm not entirely certain about this step. I'm not sure what you mean by NEW gene names, if you mean new as in not already found mentioned earlier in the paragraph, then ok, but if you mean NEW as in not in the dictionary from gin_locus/seqname/synonyms then it's not going to match because it's not matching -- J By NEW, I mean those not already found by the first iteration of the script. Yes, they would have to be in the gene dictionary to match. --K OK -- J
This will update the script we run on incoming papers, but we will need to decide what to do about previously entered papers. About a year ago, Juancarlos wrote a script to retroactively associate proteins with papers; perhaps we could similarly modify that script to run on previous papers? Sure, we could, we'd run it to see what new matches would come up, you'd look at them and then we could read them into postgres -- J Okay, that sounds good.
I don't remember what script that was, so I wrote something from scratch to match according to the 5 parameters above on the papers sampled above. It's at :
The output is in the same directory, file "sample", and adding an exception for GPR-1/2 in sample2. -- J
Okay, I will take a look at the outputs. If all looks okay, let's try a small pilot on ~100 papers, WBPaper00036200 - WBPaper00036300. I know there are some culprits within that set, and even though it's kind of a lot to look through, I want to make sure we aren't going to get a lot of false positives. If those results look okay, then I would be fine with incorporating the new script into the incoming paper pipeline and then also running the script on all of the older papers to make new associations. --K
I checked the script outputs in sample and sample2 - both look perfect. Let's next try a small pilot by running the script (and adding the exception for GPR-1/2) on papers WBPaper00036200 through WBPaper00036300. I'll check the output again, and if all looks good, we can run the script on the entire corpus and populate postgres with the new paper-gene connections. Thanks, --Kimberly
Done in sample3, not so many results compared to sample2 -- J
Exceptions and Problems Still not Fixed
*One problem still not fixed is cases when a series of genes is listed in the abstract with the format, e.g. elpc-1-4, which is meant to indicate elpc-1, elpc-2, elpc-3, and elpc-4. I'm not sure if it'd always be safe to make n number of associations based upon the numbers after the gene name. --K.
*One exception I found while looking at the script output was that whenever the gene slo-1 is mentioned in an abstract, the gene sle-1 is also being associated based upon slo-1 being an Other_name for sle-1. To fix this, we'd either have to make an exception for this case or not have the script look at Other_name when associating genes with papers based upon abstracts. I'm not sure how often looking at Other_name yields a correct association, so can't make a decision about this right now. --K.
We need to consider/evaluate whether matching on Other_name is really a good idea. I just deleted 31 paper gene-associations based upon matches to globin and ORF1 as Other_names. We'd need to run a test to see what we'd miss by not including Other_names and then decide if we should stop doing that. --K 06/16/2011
Gene Associations Based on Curated Data
Data types for which curation is stored in postgres:
I think Molecule and Phenotype is also in the OA, but I'm not sure what you consider gene associations (purposely not mentioning Paper and Person data). -- J
I'm starting to brainstorm about how we could fill in the holes for paper-gene association by using information about what genes have curated data from each paper. For example, if we missed a paper-gene association because the gene wasn't mentioned in the abstract, but the gene was curated for interaction data, could we then make the association based upon the fact that there is interaction curation for that gene using the paper as evidence? This whole idea still needs some fleshing out. --K