Attaching Genes to Papers

From WormBaseWiki
Jump to navigationJump to search

Gene Associations Based Upon Abstracts

When papers are added to postgres using the Enter New Papers function of the Paper Editor, the corresponding abstracts are scanned, via a script, for matches to loci, sequence names, and synonyms.

Postgres tables used for this are:

gin_locus

gin_seqname

gin_synonyms

To view the contents of these tables, perform the following type of query using the referenceform.cgi:

SELECT * FROM gin_locus;


The script is called by the paper_editor.cgi when entering new PMID papers from XML, it's calling /home/postgres/work/pgpopulation/pap_papers/new_papers/pap_match.pm subroutine &processXmlIds . If it's a new entry and it looks at the abstract field, it splits the abstract into words by splitting on space, then for each word it removes , ( ) ; then tries to match it to the list from the three tables above. If it matches, it adds the entry with 'Inferred_automatically\t\"Abstract read $word\"' into the evidence. -- J

Updating the script:

The script that associates genes based upon abstracts does miss some genes because of the way they're expressed in the abstract.

An idea on what to change:

1) With the exception of a dash (-), split text from punctuation and then look for matches to the approved gene list.

This wouldn't quite work, since seqnames often have periods in them (and some loci / synonyms ?). Further, doing this query for non-letter, non-digit, non-period, we get some underscores. Are these correct ? Yes, I checked these gene names in WB and they are correct. If so we can't split on non-dash, non-space. We'd at least have to add non-period and non-underscore, but is there more to the rules, or do we not care so much about these exceptions ? : SELECT * FROM gin_seqname WHERE gin_seqname ~ '[^A-Z0-9.a-z]'; I think, to be on the safe side, we should not split on non-period and non-underscore, but see below for an additional idea about periods.

joinkey  | gin_seqname |         gin_timestamp         

+-------------+-------------------------------

00201656 | CE7X_3.3    | 2010-12-08 05:24:28.383862-08
00008349 | CE7X_3.1    | 2010-12-08 03:10:46.802746-08
00008350 | CE7X_3.2    | 2010-12-08 03:10:46.848005-08
00017153 | E_BE45912.2 | 2010-12-08 03:18:55.463388-08
00043233 | E_BE45912.1 | 2010-12-08 03:44:16.195364-08

Here's some loci that have some allele format (?) Yes, these must be gene names from other species that have a phenotype, but are not cloned and therefore the name isn't mapped to a corresponding elegans gene.

SELECT * FROM gin_locus WHERE gin_locus ~ '[^A-Z0-9.a-z\-]';

00000791 | Cre-unc(bd201) | 2010-12-08 03:01:55.622337-08
00000313 | Cbr-dpy(s1281) | 2010-12-08 03:01:15.792311-08
00000314 | Cbr-dpy(bd101) | 2010-12-08 03:01:15.852879-08
00000346 | Cbr-him(bd102) | 2010-12-08 03:01:16.602787-08
00000347 | Cbr-him(bd103) | 2010-12-08 03:01:16.682897-08
00000348 | Cbr-him(s1290) | 2010-12-08 03:01:16.762942-08
00000349 | Cbr-him(bd104) | 2010-12-08 03:01:16.84322-08
00000350 | Cbr-unc(s1270) | 2010-12-08 03:01:16.903372-08
00000351 | Cbr-unc(s1275) | 2010-12-08 03:01:16.982889-08

If you then look at the gin_synonyms list you'll see a ton of entries with underscore, and the following query has 67 results : SELECT * FROM gin_synonyms WHERE gin_synonyms ~ '[^A-Z0-9.a-z_\-]';

Okay, got it.

So we can change the code to work like you suggest, but then we'd certainly not catch any of these (definitely not the huge amount with periods in them). Let me know.

For example, fem-1(hc17ts) would become fem-1 ( hc17ts )

DAF-18/PTEN would become DAF-18 / PTEN


Papers where gene associations have been missed for testing the script:

WBPaper00035164 - missed BLI-4 - because in the abstract it is written as ....(such as BLI-4).

WBPaper00035449 - missed gas-1, isp-1, daf-2, sod-2 - because the gene names are followed by alleles, e.g. daf-2(e1368)

WBPaper00035474 - missed fem-1, fem-3 - because the gene names are followed by alleles and, in one case, preceded by a left parenthesis

WBPaper00035490 - missed daf-16 - DAF-16/FOXO or daf-16-dependent or daf-16-independent

WBPaper00035559 - missed NMY-2, GPR-1/2, LIN-5 - (NMY-2) GPR-1/2 LIN-5.

WBPaper00037741 - missed DAF-18, PHA-4, HSF-1, SIR-2.1, AAK-2 - DAF-18/PTEN AMPK/AAK-2.

WBPaper00037686 - missed bus-2, bus-4, bus-12, srf-3, bus-8, bus-17 - (bus-2, bus-4 and bus-12, together with the previously cloned srf-3, bus-8 and bus-17)

WBPaper00037794 - missed MIG-14 - MIG-14/Wls


From looking at the above query results and the missing gene/protein associations again, here is a revised idea. I think the benefits of capturing more C. elegans genes by splitting on parentheses and forward slashes outweighs the cost of possibly missing the gene names with parentheses or synonyms with a forward slash. The issues of splitting on periods is important, though, because if we did that, we would miss sequence names. This isn't a huge problem for published paper abstracts, but would be a bigger problem for meeting abstracts. Step number 5 below is a possible work around for this, but I don't know how costly or slow that would be. The idea here is to try to pick up gene or protein names that only occur once in an abstract at the end of a sentence (see LIN-5 example above from WBPaper00035559).

Possible revision:

1) Split on space (same as before)

2) Split before and after left and right parentheses ( )

3) Split before and after forward slash /

4) Remove comma and semi-colon , ;

5) If not too slow, then split before and after periods to see if there are any additional matches, not found via the first round and thus pick up NEW gene names. I'm not entirely certain about this step.


This will update the script we run on incoming papers, but we will need to decide what to do about previously entered papers. About a year ago, Juancarlos wrote a script to retroactively associate proteins with papers; perhaps we could similarly modify that script to run on previous papers? Sure, we could, we'd run it to see what new matches would come up, you'd look at them and then we could read them into postgres -- J Okay, that sounds good.

Gene Associations Based on Curated Data

Data types for which curation is stored in postgres:

Antibody

Concise Descriptions

Gene Interactions

Gene Ontology

Gene Regulation

Picture

Transgenes

Variation Phenotype

I think Molecule and Phenotype is also in the OA, but I'm not sure what you consider gene associations (purposely not mentioning Paper and Person data). -- J

Back to 2010_-_Paper_Pipeline:_Documentation_and_Instructions