Paper Tables in Postgres

From WormBaseWiki
Revision as of 17:42, 2 May 2011 by Vanaukenk (talk | contribs)
Jump to navigationJump to search

List of Paper Tables in Postgres (Alphabetical)

pap_affiliation

Contains the affiliation (location) of one or more authors of the paper, meeting abstract, or gazette article

For papers, this table mostly contains legacy information that was imported from the CGC when WormBase starting curating paper data, as PubMed currently does not curate all author affiliations for papers.

For meeting abstracts and gazette articles, this table contains the full list of affiliations, but the affiliations are not mapped to specific individuals.

pap_author

Contains the author_id

pap_author_index

author_id as joinkey

pap_author_possible

author_id as joinkey

pap_author_sent

author_id as joinkey

pap_author_verified

author_id as joinkey

pap_contained_in

Contains the WBPaper IDs for books in which book chapters are found.

pap_contains (Delete?)

Currently empty. May be able to delete.

Information is already found in the pap_contained_in table.

pap_curation_flags

Contains one of five values for information on how the paper should be treated in subsequent pipelines, RNAi data, and the status of the gene-paper associations.

functional_annotation - this tag is placed on papers that are used to compose concise descriptions, but for which there is no curatable C. elegans data.

genestudied_done - this tag is placed on papers when the gene-paper associations are complete.

Phenotype2GO - this tag is placed on papers that will be used to create Phenotype2GO annotations.

rnai_curation -

rnai_int_done -

pap_day

The day the paper is published. Note that not all papers have an associated publication day, some just have month and year, or just year.

If a PubMed-indexed paper, the day is taken from the day tag in the corresponding paper XML:

<PubDate>

<Year>2011</Year>

<Month>Feb</Month>

<Day>4</Day>

</PubDate>

pap_editor

This table holds the information on editors of paper objects, mostly books. The current ACeDB ?Paper model dumps Editors as ?Text, but there is not a separate ?Editor class like there is for ?Author and ?Person. Future work may be to create an ?Editor class.

Current stats:

142 editors for 137 books

15 books without an editor

344 book chapters

PubMed has recently started indexing book chapters and differentiating between authors of the chapter and editors of the book. This is done in the XML by designating and Author Type (example using an excerpt from PMID:21413247):


<AuthorList Type="editors">

<Author>

<LastName>Riddle</LastName>

<ForeName>Donald L</ForeName>

<Initials>DL</Initials>

</Author>

</AuthorList>


<AuthorList Type="authors">

<Author>

<LastName>Ambros</LastName>

<ForeName>Victor</ForeName>

<Initials>V</Initials>

</Author>

</AuthorList>

pap_electronic_path

This table holds the path (location) of a paper, including its associated supplemental information.

When a paper's pdf is downloaded, this value is populated via a script.

pap_erratum_for (Delete?)

There is no data in this table (see below).

pap_erratum_in

This table contains the WBPaper IDs for errata or corrections that have been published for a paper object.

==pap_fulltext_url

This table contains the URL to the full text of a paper. This information is not routinely captured for papers, but is used for the new on-line Worm Breeder's Gazette.

pap_gene

This table contains the list of genes that are associated with a paper. Paper-gene information is added automatically by a script that processes incoming abstracts, automatically from the author first-pass form, and manually by curators.

pap_identifier

This table contains all identifiers for a paper. There are numerous types of identifiers; past worm meetings have not used a consistent format. It would be desirable to have all meeting and WBG abstract identifiers adhere to the same format, but that is currently not a high priority. There are also numerous errors and typos that need to be corrected.

Types of identifiers and examples:

8-digit number - this denotes the WBPaper ID for papers merged into the existing valid paper.

pmid - this denotes the corresponding PubMed identifier, e.g. pmid21471153

cgc - this denotes the old cgc paper identifiers, e.g. cgc5654. These identifiers were discontinued after WormBase took over responsibility for paper curation.

medline - this denotes old, invalid MEDLINE IDs, e.g. med20110504


00004204 | doi                                       |         2 | two1823     | 2010-02-14 01:00:40.186078-08
00006525 | CSHSQB04p159                              |         1 | two1841     | 2005-07-19 16:20:14.089611-07

[4/25/11 5:45:21 PM] j chan: 00010002 | cam3030 | 1 | two1841 | 2005-07-19 16:20:14.385265-07

00010003 | cam3031                                   |         1 | two1841     | 2005-07-19 16:20:14.394107-07
00010004 | cam3035                                   |         1 | two1841     | 2005-07-19 16:20:14.404771-07
00010005 | cwbg11.2p57                               |         1 | two1841     | 2005-07-19 16:20:14.418583-07
00010006 | eawm2004ab1                               |         1 | two1841     | 2005-07-19 16:20:14.419828-07

[4/25/11 5:45:58 PM] j chan: 00001858 | med94089766 | 1 | | 2005-10-05 17:41:41.552402-07 [4/25/11 5:46:19 PM] j chan: ecwm [4/25/11 5:47:01 PM] j chan: 00012010 | mcwm2000ab98 | 1 | two1841 | 2005-07-19 16:20:37.492126-07 [4/25/11 5:47:10 PM] j chan: 00012294 | mwwm02abs102296 | 1 | two1841 | 2005-07-19 16:20:40.731929-07 [4/25/11 5:47:28 PM] j chan: 00011845 | isbn0-14-051288-8 | 1 | two1841 | 2005-07-19 16:20:35.35042-07 [4/25/11 5:47:47 PM] j chan: 00013640 | wb2001p658 | 1 | two1841 | 2005-07-19 16:20:52.630007-07 [4/25/11 5:47:55 PM] j chan: 00013642 | wbg1.1p12 | 1 | two1841 | 2005-07-19 16:20:52.641639-07 [4/25/11 5:48:42 PM] j chan: 00017551 | wm02ab137 | 1 | two1841 | 2005-07-19 16:21:37.745368-07

00017552 | wm02ab6                                   |         1 | two1841     | 2005-07-19 16:21:37.746729-07
00017553 | wm2000ab73                                |         1 | two1841     | 2005-07-19 16:21:37.747903-07
00017554 | wm2000ab76

[4/25/11 5:49:18 PM] j chan: 00025147 | doi10.1534/genetics.104.036137 | 2 | two1823 | 2010-03-21 01:02:04.12676-07

00026601 | othCur_Bio_2005_935-941                   |         3 |             | 2005-10-02 16:24:59.592117-07
00026893 | doi10.1534/genetics.105.043497            |         2 | two1823     | 2010-03-21 01:02:51.75882-07
00027222 | 10.1895/wormbook.1.104.1                  |         2 | two22       | 2006-04-28 20:28:39.557101-07
00027223 | 10.1895/wormbook.1.54.1                   |         1 | two22       | 2006-04-28 20:28:39.98071-07
00027224 | 10.1895/wormbook.1.79.1                   |         1 | two22       | 2006-04-2

[4/25/11 5:49:49 PM] j chan: 00027773 | devevowm06abs6771 | 1 | two480 | 2006-07-28 00:13:09.776414-07 [4/25/11 5:50:14 PM] j chan: 00028065 | neubehwm06abs10073 | 1 | two480 | 2006-07-29 22:13:37.836857-07 [4/25/11 5:50:41 PM] j chan: 00028597 | genomed2 | 1 | two480 | 2006-10-25 12:18:43.489753-07 [4/25/11 5:50:50 PM] j chan: 00029417 | austwm07abs1 | 1 | two480 | 2007-05-26 11:26:48.89684-07 [4/25/11 5:50:57 PM] j chan: 00032536 | aging2008aging14671 | 1 | two480 | 2009-02-04 14:46:35.708278-08 [4/25/11 5:51:18 PM] j chan: 00032674 | neuro2008neuro11621 | 1 | two480 | 2009-02-04 17:26:54.045923-08 [4/25/11 5:51:40 PM] j chan: 00034719 | 34759 | 3 | two712 | 2009-07-27 17:07:23.896376-07 [4/25/11 5:53:34 PM] j chan: evowm2010_ab [4/25/11 5:53:39 PM] j chan: devgenewm2010_ab [4/25/11 5:54:00 PM] j chan: neurowm2010_ab2 [4/25/11 5:54:21 PM] j chan: malewm2010_ab205 [4/25/11 5:54:59 PM] j chan: agingwm2010_ab1 [4/25/11 5:55:49 PM] j chan: SELECT * FROM pap_identifier WHERE pap_identifier !~ '^pmid[0-9]'

AND pap_identifier !~ '^cgc[0-9]'
AND pap_identifier !~ '^med[0-9]'
AND pap_identifier !~ '^00[0-9]'
AND pap_identifier !~ '^cam[0-9]'
AND pap_identifier !~ '^eawm[0-9]'
AND pap_identifier !~ '^ med[0-9]'
AND pap_identifier !~ '^pimd[0-9]'
AND pap_identifier !~ '^ecwm[0-9]'
AND pap_identifier !~ '^euwm[0-9]'
AND pap_identifier !~ '^jwm[0-9]'
AND pap_identifier !~ '^mcwm[0-9]'
AND pap_identifier !~ '^mwwm[0-9]'
AND pap_identifier !~ '^wb[0-9]'
AND pap_identifier !~ '^wbg[0-9]'
AND pap_identifier !~ '^wcwm[0-9]'
AND pap_identifier !~ '^wm[0-9]'
AND pap_identifier !~ '^devevowm[0-9]'
AND pap_identifier !~ '^neubehwm[0-9]'
AND pap_identifier !~ '^austwm[0-9]'
AND pap_identifier !~ '^aging[0-9]'
AND pap_identifier !~ '^agingwm[0-9]'
AND pap_identifier !~ '^neuro[0-9]'
AND pap_identifier !~ '^neurowm[0-9]'
AND pap_identifier !~ '^evowm[0-9]'
AND pap_identifier !~ '^devgenewm[0-9]'
AND pap_identifier !~ '^malewm[0-9]'
AND pap_identifier !~ '^10'
AND pap_identifier !~ '^doi[0-9]'
AND pap_identifier !~ '^doi$'
AND joinkey IN (SELECT joinkey FROM pap_status WHERE pap_status = 'valid')

[4/25/11 5:56:01 PM] j chan: joinkey | pap_identifier | pap_order | pap_curator | pap_timestamp


+----------------------------------+-----------+-------------+-------------------------------

00006525 | CSHSQB04p159                     |         1 | two1841     | 2005-07-19 16:20:14.089611-07
00011845 | isbn0-14-051288-8                |         1 | two1841     | 2005-07-19 16:20:35.35042-07
00026601 | othCur_Bio_2005_935-941          |         3 |             | 2005-10-02 16:24:59.592117-07
00028597 | genomed2                         |         1 | two480      | 2006-10-25 12:18:43.489753-07
00035148 | doi 10.1534/genetics.109.108654  |         1 | two555      | 2009-08-26 11:53:42.274019-07
00036062 | doi:10.1016/j.rbmret.2008.04.001 |         1 | two1843     | 2010-03-23 07:35:08.825693-07
00010005 | cwbg11.2p57                      |         1 | two1841     | 2005-07-19 16:20:14.418583-07
00034719 | 34759                            |         3 | two712      | 2009-07-27 17:07:23.896376-07
00037634 | ISBN-13:978-0470519691           |         1 | two1843     | 2010-09-28 12:49:06.158974-07

(9 rows) [4/25/11 5:57:08 PM] j chan: SELECT * FROM pap_identifier WHERE pap_identifier !~ '^pmid[0-9]'

AND pap_identifier !~ '^cgc[0-9]'
AND pap_identifier !~ '^med[0-9]'
AND pap_identifier !~ '^00[0-9]'
AND pap_identifier !~ '^cam[0-9]'
AND pap_identifier !~ '^eawm[0-9]'
AND pap_identifier !~ '^ecwm[0-9]'
AND pap_identifier !~ '^euwm[0-9]'
AND pap_identifier !~ '^jwm[0-9]'
AND pap_identifier !~ '^mcwm[0-9]'
AND pap_identifier !~ '^mwwm[0-9]'
AND pap_identifier !~ '^wb[0-9]'
AND pap_identifier !~ '^wbg[0-9]'
AND pap_identifier !~ '^wcwm[0-9]'
AND pap_identifier !~ '^wm[0-9]'
AND pap_identifier !~ '^devevowm[0-9]'
AND pap_identifier !~ '^neubehwm[0-9]'
AND pap_identifier !~ '^austwm[0-9]'
AND pap_identifier !~ '^aging[0-9]'
AND pap_identifier !~ '^agingwm[0-9]'
AND pap_identifier !~ '^neuro[0-9]'
AND pap_identifier !~ '^neurowm[0-9]'
AND pap_identifier !~ '^evowm[0-9]'
AND pap_identifier !~ '^devgenewm[0-9]'
AND pap_identifier !~ '^malewm[0-9]'
AND pap_identifier !~ '^10'
AND pap_identifier !~ '^doi[0-9]'
AND joinkey IN (SELECT joinkey FROM pap_status WHERE pap_status = 'valid')

pap_ignore (Eventually Delete?)

This table holds information about what papers are non worm or for functional annotation that can be ignored for the purposes of collecting pdfs, curating authors, etc.

Information in this table might be able to be transferred to pap_curation_flags and this table subsequently deleted.

testdb=# SELECT * FROM pap_ignore WHERE joinkey NOT IN (SELECT joinkey FROM pap_curation_flags WHERE pap_curation_flags = 'functional_annotation');

joinkey  |         pap_ignore         | pap_order | pap_curator |         pap_timestamp         

+----------------------------+-----------+-------------+-------------------------------

00006530 | non worm                   |         1 | two1        | 2008-10-12 00:48:32.484369-07
00012980 | non worm                   |         1 | two1        | 2008-10-12 00:48:32.498247-07
00013144 | non worm                   |         1 | two1        | 2008-10-12 00:48:32.441194-07
00013545 | non worm                   |         1 | two1        | 2008-10-12 00:48:32.492718-07
00013634 | non worm                   |         1 | two1        | 2008-10-17 13:35:14.162157-07
00028285 | functional annotation only |         1 | two567      | 2008-10-08 17:33:41.648074-07
00030743 | non worm                   |         1 | two1        | 2008-10-12 00:48:32.504099-07
00031377 | non worm                   |         1 | two1        | 2008-10-12 00:48:35.574224-07

==pap_internal_comment

This table contains any internal comments that curators may wish to make about the paper.

Currently, there are 114 comments, all added manually.

pap_journal

This table contains the journal in which a paper was published.

For PubMed-indexed journals, the name of the journal is contained within the Journal tag.

Note that not all papers in postgres use the proper ISOAbbreviation, but we are trying to clean this up as much as possible.

<Journal>

<Title>Cell</Title>

<ISOAbbreviation>Cell</ISOAbbreviation>

</Journal>

Type override?

pap_month

This table contains the month a paper was published. See pap_day table info above for PubMed XML.

In the editor, the full month name is spelled out, so any three-letter abbreviations in the PubMed XML is converted to the full name for display in the editor.

For the ACeDB dumping script, the month is converted to a number.

pap_pages

This table contains the page numbers for a paper. It can contains both numbers and letters.

pap_primary_data

This table contains information about whether a paper has primary research data. This tag is new as of 2009 and was put in place as a means to potentially sort papers based upon the likelihood of having curatable data. New papers are given primary_data as a default value, but curators can change, on the paper editor, the value to not_primary. Since PubMed does not assign Paper Type until a paper is fully indexed which can be weeks or months after a paper first appears in PubMed, this tag allows us to categorize papers regardless of Paper Type.

There are three values:

primary - given to all papers that contain experimental data

not_primary - given to all papers, such as review articles, that don't contain experimental data

not_designated - given to meeting abstracts since these often report experimental data but are not typically curated

History Tables

All paper history tables are in the format h_pap_X where X is the name of the corresponding table above.


Back to Paper Pipeline