Difference between revisions of "Attaching Genes to Papers"

From WormBaseWiki
Jump to navigationJump to search
Line 38: Line 38:
 
  00043233 | E_BE45912.1 | 2010-12-08 03:44:16.195364-08
 
  00043233 | E_BE45912.1 | 2010-12-08 03:44:16.195364-08
  
'''Here's some loci that have some allele format (?)'''
+
Here's some loci that have some allele format (?) '''Yes, these must be gene names from other species that have a phenotype, but are not cloned and therefore the name isn't mapped to a corresponding elegans gene.  Again, not common, but probably safer not to exclude them.'''
 
SELECT * FROM gin_locus WHERE gin_locus ~ '[^A-Z0-9.a-z\-]';
 
SELECT * FROM gin_locus WHERE gin_locus ~ '[^A-Z0-9.a-z\-]';
 
  00000791 | Cre-unc(bd201) | 2010-12-08 03:01:55.622337-08
 
  00000791 | Cre-unc(bd201) | 2010-12-08 03:01:55.622337-08

Revision as of 22:39, 8 December 2010

Gene Associations Based Upon Abstracts

When papers are added to postgres using the Enter New Papers function of the Paper Editor, the corresponding abstracts are scanned, via a script, for matches to loci, sequence names, and synonyms.

Postgres tables used for this are:

gin_locus

gin_seqname

gin_synonyms

To view the contents of these tables, perform the following type of query using the referenceform.cgi:

SELECT * FROM gin_locus;


The script is called by the paper_editor.cgi when entering new PMID papers from XML, it's calling /home/postgres/work/pgpopulation/pap_papers/new_papers/pap_match.pm subroutine &processXmlIds . If it's a new entry and it looks at the abstract field, it splits the abstract into words by splitting on space, then for each word it removes , ( ) ; then tries to match it to the list from the three tables above. If it matches, it adds the entry with 'Inferred_automatically\t\"Abstract read $word\"' into the evidence. -- J

Updating the script:

The script that associates genes based upon abstracts does miss some genes because of the way they're expressed in the abstract.

An idea on what to change:

1) With the exception of a dash (-), split text from punctuation and then look for matches to the approved gene list.

This wouldn't quite work, since seqnames often have periods in them (and some loci / synonyms ?). Further, doing this query for non-letter, non-digit, non-period, we get some underscores. Are these correct ? Yes, I checked these gene names in WB and they are correct. If so we can't split on non-dash, non-space. We'd at least have to add non-period and non-underscore, but is there more to the rules, or do we not care so much about these exceptions ? : SELECT * FROM gin_seqname WHERE gin_seqname ~ '[^A-Z0-9.a-z]'; I think, to be on the safe side, we should not split on non-period and non-underscore.

joinkey  | gin_seqname |         gin_timestamp         

+-------------+-------------------------------

00201656 | CE7X_3.3    | 2010-12-08 05:24:28.383862-08
00008349 | CE7X_3.1    | 2010-12-08 03:10:46.802746-08
00008350 | CE7X_3.2    | 2010-12-08 03:10:46.848005-08
00017153 | E_BE45912.2 | 2010-12-08 03:18:55.463388-08
00043233 | E_BE45912.1 | 2010-12-08 03:44:16.195364-08

Here's some loci that have some allele format (?) Yes, these must be gene names from other species that have a phenotype, but are not cloned and therefore the name isn't mapped to a corresponding elegans gene. Again, not common, but probably safer not to exclude them. SELECT * FROM gin_locus WHERE gin_locus ~ '[^A-Z0-9.a-z\-]';

00000791 | Cre-unc(bd201) | 2010-12-08 03:01:55.622337-08
00000313 | Cbr-dpy(s1281) | 2010-12-08 03:01:15.792311-08
00000314 | Cbr-dpy(bd101) | 2010-12-08 03:01:15.852879-08
00000346 | Cbr-him(bd102) | 2010-12-08 03:01:16.602787-08
00000347 | Cbr-him(bd103) | 2010-12-08 03:01:16.682897-08
00000348 | Cbr-him(s1290) | 2010-12-08 03:01:16.762942-08
00000349 | Cbr-him(bd104) | 2010-12-08 03:01:16.84322-08
00000350 | Cbr-unc(s1270) | 2010-12-08 03:01:16.903372-08
00000351 | Cbr-unc(s1275) | 2010-12-08 03:01:16.982889-08

If you then look at the gin_synonyms list you'll see a ton of entries with underscore, and the following query has 67 results : SELECT * FROM gin_synonyms WHERE gin_synonyms ~ '[^A-Z0-9.a-z_\-]';

So we can change the code to work like you suggest, but then we'd certainly not catch any of these (definitely not the huge amount with periods in them). Let me know.

For example, fem-1(hc17ts) would become fem-1 ( hc17ts )

DAF-18/PTEN would become DAF-18 / PTEN


Papers where gene associations have been missed for testing the script:

WBPaper00035164 - missed BLI-4

WBPaper00035239 - missed CATP-5

WBPaper00035289 - missed SPP-5

WBPaper00035423 - missed PAR-1

WBPaper00035449 - missed gas-1, isp-1, daf-2, sod-2

WBPaper00035474 - missed fem-1, fem-3

WBPaper00035490 - missed daf-16

WBPaper00035559 - missed NMY-2, GPR-1/2, LIN-5

WBPaper00037741 - missed DAF-18, PHA-4, HSF-1, SIR-2.1, AAK-2

WBPaper00037686 - missed bus-2, bus-4, bus-12, srf-3, bus-8, bus-17

WBPaper00037794 - missed MIG-14


This will update the script we run on incoming papers, but we will need to decide what to do about previously entered papers. About a year ago, Juancarlos wrote a script to retroactively associate proteins with papers; perhaps we could similarly modify that script to run on previous papers? Sure, we could, we'd run it to see what new matches would come up, you'd look at them and then we could read them into postgres -- J

Gene Associations Based on Curated Data

Data types for which curation is stored in postgres:

Antibody

Concise Descriptions

Gene Interactions

Gene Ontology

Gene Regulation

Picture

Transgenes

Variation Phenotype

I think Molecule and Phenotype is also in the OA, but I'm not sure what you consider gene associations (purposely not mentioning Paper and Person data). -- J

Back to 2010_-_Paper_Pipeline:_Documentation_and_Instructions