Paper Tables in Postgres

From WormBaseWiki
Jump to navigationJump to search

Back to Paper Pipeline

List of Paper Tables in Postgres (Alphabetical)

pap_affiliation

Contains the affiliation (location) of one or more authors of the paper, meeting abstract, or gazette article

For papers, this table mostly contains legacy information that was imported from the CGC when WormBase starting curating paper data, as PubMed currently does not curate all author affiliations for papers.

For meeting abstracts and gazette articles, this table contains the full list of affiliations, but the affiliations are not mapped to specific individuals.

pap_author

Contains the author_id and the orders of the authors.

pap_author_index

author_id as joinkey, contains the author name. When coming from pmid it is "<Last_name><space><First_initial (and sometimes middle initials)>". Note that while the format in the XML is :

<Author>

<LastName>Riddle</LastName>

<ForeName>Donald L</ForeName>

<Initials>DL</Initials>

</Author>

We're only capturing the Initials and LastName tags in the postgres tables.

pap_author_possible

author_id as joinkey, possible people assigned to that author ID, and a pap_join number to associate with that author_id in pap_author_sent and pap_author_verified

pap_author_sent

author_id as joinkey, a pap_join number to associate with that author_id in pap_author_possible and pap_author_verified, and whether that person was SENT an email to verify their connection.

pap_author_verified

author_id as joinkey, a pap_join number to associate with that author_id in pap_author_possible and pap_author_sent, a verification can be YES or NO followed by the person that said so. The pap_curator is the person who verified it, either the actual person, Cecilia if she did it manually, or Raymond if verified by the script that connects by labs or lineage.

pap_contained_in

Contains the WBPaper IDs for books in which book chapters are found.

pap_contains (Delete?)

Currently empty. May be able to delete.

Information is already found in the pap_contained_in table.

pap_curation_flags

Contains one of five values for information on how the paper should be treated in subsequent pipelines, RNAi data, and the status of the gene-paper associations.

non_nematode (formerly functional_annotation_only) - this tag is placed on papers that are used to compose concise descriptions, but for which there is no curatable C. elegans data.

genestudied_done - this tag is placed on papers when the gene-paper associations are complete.

Phenotype2GO - this tag is placed on papers that will be used to create Phenotype2GO annotations.

rnai_curation - this tag is placed on papers that are flagged for RNAi curation. These papers are displayed on the paper editor (by clicking on the "RNAi Curation" button). The RNAi curation table on the paper editor keeps track of papers that have been curated for RNAi.

rnai_int_done - this tag was used to mark the RNAi papers that had been curated for RNAi-based interactions. We no longer use this tag as it was used inconsistently by curators and did not reflect what was actually curated. Now if a paper has been checked out we assume that it has/will be curated completely for all aspects of RNAi.

pap_day

The day the paper is published. Note that not all papers have an associated publication day, some just have month and year, or just year.

If a PubMed-indexed paper, the day is taken from the day tag in the corresponding paper XML:

<PubDate>

<Year>2011</Year>

<Month>Feb</Month>

<Day>4</Day>

</PubDate>

pap_editor

This table holds the information on editors of paper objects, mostly books. The current ACeDB ?Paper model dumps Editors as ?Text, but there is not a separate ?Editor class like there is for ?Author and ?Person. Future work may be to create an ?Editor class.

Current stats:

142 editors for 137 books

15 books without an editor

344 book chapters

PubMed has recently started indexing book chapters and differentiating between authors of the chapter and editors of the book. This is done in the XML by designating and Author Type (example using an excerpt from PMID:21413247):


<AuthorList Type="editors">

<Author>

<LastName>Riddle</LastName>

<ForeName>Donald L</ForeName>

<Initials>DL</Initials>

</Author>

</AuthorList>


<AuthorList Type="authors">

<Author>

<LastName>Ambros</LastName>

<ForeName>Victor</ForeName>

<Initials>V</Initials>

</Author>

</AuthorList>

pap_electronic_path

This table holds the path (location) of a paper, including its associated supplemental information.

When a paper's pdf is downloaded, this value is populated via a script.

pap_erratum_for (Delete?)

There is no data in this table (see below).

pap_erratum_in

This table contains the WBPaper IDs for errata or corrections that have been published for a paper object.

pap_fulltext_url

This table contains the URL to the full text of a paper. This information is not routinely captured for papers, but is used for the new on-line Worm Breeder's Gazette.

pap_gene

This table contains the list of genes that are associated with a paper. Paper-gene information is added automatically by a script that processes incoming abstracts, automatically from the author first-pass form, and manually by curators.

pap_identifier

This table contains all identifiers for a paper. There are numerous types of identifiers; past worm meetings have not used a consistent format. It would be desirable to have all meeting and WBG abstract identifiers adhere to the same format, but that is currently not a high priority. There are also numerous errors and typos that need to be corrected.

Types of identifiers and examples:

8-digit number - this denotes the WBPaper ID for papers merged into the existing valid paper.

pmid - this denotes the corresponding PubMed identifier, e.g. pmid21471153

cgc - this denotes the old cgc paper identifiers, e.g. cgc5654. These identifiers were discontinued after WormBase took over responsibility for paper curation.

medline - this denotes old, invalid MEDLINE IDs, e.g. med20110504


00004204 | doi                                       |         2 | two1823     | 2010-02-14 01:00:40.186078-08
00006525 | CSHSQB04p159                              |         1 | two1841     | 2005-07-19 16:20:14.089611-07

[4/25/11 5:45:21 PM] j chan: 00010002 | cam3030 | 1 | two1841 | 2005-07-19 16:20:14.385265-07

00010003 | cam3031                                   |         1 | two1841     | 2005-07-19 16:20:14.394107-07
00010004 | cam3035                                   |         1 | two1841     | 2005-07-19 16:20:14.404771-07
00010005 | cwbg11.2p57                               |         1 | two1841     | 2005-07-19 16:20:14.418583-07
00010006 | eawm2004ab1                               |         1 | two1841     | 2005-07-19 16:20:14.419828-07

[4/25/11 5:45:58 PM] j chan: 00001858 | med94089766 | 1 | | 2005-10-05 17:41:41.552402-07 [4/25/11 5:46:19 PM] j chan: ecwm [4/25/11 5:47:01 PM] j chan: 00012010 | mcwm2000ab98 | 1 | two1841 | 2005-07-19 16:20:37.492126-07 [4/25/11 5:47:10 PM] j chan: 00012294 | mwwm02abs102296 | 1 | two1841 | 2005-07-19 16:20:40.731929-07 [4/25/11 5:47:28 PM] j chan: 00011845 | isbn0-14-051288-8 | 1 | two1841 | 2005-07-19 16:20:35.35042-07 [4/25/11 5:47:47 PM] j chan: 00013640 | wb2001p658 | 1 | two1841 | 2005-07-19 16:20:52.630007-07 [4/25/11 5:47:55 PM] j chan: 00013642 | wbg1.1p12 | 1 | two1841 | 2005-07-19 16:20:52.641639-07 [4/25/11 5:48:42 PM] j chan: 00017551 | wm02ab137 | 1 | two1841 | 2005-07-19 16:21:37.745368-07

00017552 | wm02ab6                                   |         1 | two1841     | 2005-07-19 16:21:37.746729-07
00017553 | wm2000ab73                                |         1 | two1841     | 2005-07-19 16:21:37.747903-07
00017554 | wm2000ab76

[4/25/11 5:49:18 PM] j chan: 00025147 | doi10.1534/genetics.104.036137 | 2 | two1823 | 2010-03-21 01:02:04.12676-07

00026601 | othCur_Bio_2005_935-941                   |         3 |             | 2005-10-02 16:24:59.592117-07
00026893 | doi10.1534/genetics.105.043497            |         2 | two1823     | 2010-03-21 01:02:51.75882-07
00027222 | 10.1895/wormbook.1.104.1                  |         2 | two22       | 2006-04-28 20:28:39.557101-07
00027223 | 10.1895/wormbook.1.54.1                   |         1 | two22       | 2006-04-28 20:28:39.98071-07
00027224 | 10.1895/wormbook.1.79.1                   |         1 | two22       | 2006-04-2

[4/25/11 5:49:49 PM] j chan: 00027773 | devevowm06abs6771 | 1 | two480 | 2006-07-28 00:13:09.776414-07 [4/25/11 5:50:14 PM] j chan: 00028065 | neubehwm06abs10073 | 1 | two480 | 2006-07-29 22:13:37.836857-07 [4/25/11 5:50:41 PM] j chan: 00028597 | genomed2 | 1 | two480 | 2006-10-25 12:18:43.489753-07 [4/25/11 5:50:50 PM] j chan: 00029417 | austwm07abs1 | 1 | two480 | 2007-05-26 11:26:48.89684-07 [4/25/11 5:50:57 PM] j chan: 00032536 | aging2008aging14671 | 1 | two480 | 2009-02-04 14:46:35.708278-08 [4/25/11 5:51:18 PM] j chan: 00032674 | neuro2008neuro11621 | 1 | two480 | 2009-02-04 17:26:54.045923-08 [4/25/11 5:51:40 PM] j chan: 00034719 | 34759 | 3 | two712 | 2009-07-27 17:07:23.896376-07 [4/25/11 5:53:34 PM] j chan: evowm2010_ab [4/25/11 5:53:39 PM] j chan: devgenewm2010_ab [4/25/11 5:54:00 PM] j chan: neurowm2010_ab2 [4/25/11 5:54:21 PM] j chan: malewm2010_ab205 [4/25/11 5:54:59 PM] j chan: agingwm2010_ab1 [4/25/11 5:55:49 PM] j chan: SELECT * FROM pap_identifier WHERE pap_identifier !~ '^pmid[0-9]'

AND pap_identifier !~ '^cgc[0-9]'
AND pap_identifier !~ '^med[0-9]'
AND pap_identifier !~ '^00[0-9]'
AND pap_identifier !~ '^cam[0-9]'
AND pap_identifier !~ '^eawm[0-9]'
AND pap_identifier !~ '^ med[0-9]'
AND pap_identifier !~ '^pimd[0-9]'
AND pap_identifier !~ '^ecwm[0-9]'
AND pap_identifier !~ '^euwm[0-9]'
AND pap_identifier !~ '^jwm[0-9]'
AND pap_identifier !~ '^mcwm[0-9]'
AND pap_identifier !~ '^mwwm[0-9]'
AND pap_identifier !~ '^wb[0-9]'
AND pap_identifier !~ '^wbg[0-9]'
AND pap_identifier !~ '^wcwm[0-9]'
AND pap_identifier !~ '^wm[0-9]'
AND pap_identifier !~ '^devevowm[0-9]'
AND pap_identifier !~ '^neubehwm[0-9]'
AND pap_identifier !~ '^austwm[0-9]'
AND pap_identifier !~ '^aging[0-9]'
AND pap_identifier !~ '^agingwm[0-9]'
AND pap_identifier !~ '^neuro[0-9]'
AND pap_identifier !~ '^neurowm[0-9]'
AND pap_identifier !~ '^evowm[0-9]'
AND pap_identifier !~ '^devgenewm[0-9]'
AND pap_identifier !~ '^malewm[0-9]'
AND pap_identifier !~ '^10'
AND pap_identifier !~ '^doi[0-9]'
AND pap_identifier !~ '^doi$'
AND joinkey IN (SELECT joinkey FROM pap_status WHERE pap_status = 'valid')

[4/25/11 5:56:01 PM] j chan: joinkey | pap_identifier | pap_order | pap_curator | pap_timestamp


+----------------------------------+-----------+-------------+-------------------------------

00006525 | CSHSQB04p159                     |         1 | two1841     | 2005-07-19 16:20:14.089611-07
00011845 | isbn0-14-051288-8                |         1 | two1841     | 2005-07-19 16:20:35.35042-07
00026601 | othCur_Bio_2005_935-941          |         3 |             | 2005-10-02 16:24:59.592117-07
00028597 | genomed2                         |         1 | two480      | 2006-10-25 12:18:43.489753-07
00035148 | doi 10.1534/genetics.109.108654  |         1 | two555      | 2009-08-26 11:53:42.274019-07
00036062 | doi:10.1016/j.rbmret.2008.04.001 |         1 | two1843     | 2010-03-23 07:35:08.825693-07
00010005 | cwbg11.2p57                      |         1 | two1841     | 2005-07-19 16:20:14.418583-07
00034719 | 34759                            |         3 | two712      | 2009-07-27 17:07:23.896376-07
00037634 | ISBN-13:978-0470519691           |         1 | two1843     | 2010-09-28 12:49:06.158974-07

(9 rows) [4/25/11 5:57:08 PM] j chan: SELECT * FROM pap_identifier WHERE pap_identifier !~ '^pmid[0-9]'

AND pap_identifier !~ '^cgc[0-9]'
AND pap_identifier !~ '^med[0-9]'
AND pap_identifier !~ '^00[0-9]'
AND pap_identifier !~ '^cam[0-9]'
AND pap_identifier !~ '^eawm[0-9]'
AND pap_identifier !~ '^ecwm[0-9]'
AND pap_identifier !~ '^euwm[0-9]'
AND pap_identifier !~ '^jwm[0-9]'
AND pap_identifier !~ '^mcwm[0-9]'
AND pap_identifier !~ '^mwwm[0-9]'
AND pap_identifier !~ '^wb[0-9]'
AND pap_identifier !~ '^wbg[0-9]'
AND pap_identifier !~ '^wcwm[0-9]'
AND pap_identifier !~ '^wm[0-9]'
AND pap_identifier !~ '^devevowm[0-9]'
AND pap_identifier !~ '^neubehwm[0-9]'
AND pap_identifier !~ '^austwm[0-9]'
AND pap_identifier !~ '^aging[0-9]'
AND pap_identifier !~ '^agingwm[0-9]'
AND pap_identifier !~ '^neuro[0-9]'
AND pap_identifier !~ '^neurowm[0-9]'
AND pap_identifier !~ '^evowm[0-9]'
AND pap_identifier !~ '^devgenewm[0-9]'
AND pap_identifier !~ '^malewm[0-9]'
AND pap_identifier !~ '^10'
AND pap_identifier !~ '^doi[0-9]'
AND joinkey IN (SELECT joinkey FROM pap_status WHERE pap_status = 'valid')

pap_ignore (Eventually Delete?)

This table holds information about what papers are non worm or for functional annotation that can be ignored for the purposes of collecting pdfs, curating authors, etc.

Information in this table might be able to be transferred to pap_curation_flags and this table subsequently deleted.

testdb=# SELECT * FROM pap_ignore WHERE joinkey NOT IN (SELECT joinkey FROM pap_curation_flags WHERE pap_curation_flags = 'functional_annotation');

joinkey  |         pap_ignore         | pap_order | pap_curator |         pap_timestamp         

+----------------------------+-----------+-------------+-------------------------------

00006530 | non worm                   |         1 | two1        | 2008-10-12 00:48:32.484369-07
00012980 | non worm                   |         1 | two1        | 2008-10-12 00:48:32.498247-07
00013144 | non worm                   |         1 | two1        | 2008-10-12 00:48:32.441194-07
00013545 | non worm                   |         1 | two1        | 2008-10-12 00:48:32.492718-07
00013634 | non worm                   |         1 | two1        | 2008-10-17 13:35:14.162157-07
00028285 | functional annotation only |         1 | two567      | 2008-10-08 17:33:41.648074-07
00030743 | non worm                   |         1 | two1        | 2008-10-12 00:48:32.504099-07
00031377 | non worm                   |         1 | two1        | 2008-10-12 00:48:35.574224-07

Follow-up:

00006530 - paper on nomenclature of mediator subunits

some of the authors are worm people, but this is not a worm research article

the pap_ignore comment is this case probably reflected that we didn't need to do full author-person curation

this would now be handled with the author priority flag

I still think we'd want the pdf

00012980 - Drosophila atrophin paper, 2002

no elegans experimental data; the only actual mention of C. elegans is to say that elegans doesn't have an Atrophin homolog

I also can't find this paper in the concise description file, so it doesn't seem to have been used as evidence there

Bottom line - I don't know how this paper was used, but will mark as functional annotation so that it doesn't filter into any other pipelines

I don't want to delete it, though, if I don't know where else it might have been used in WormBase this paper would not typically be added to WB

00013144 - paper on ASPM and cerebral cortex size

paper mentions ASPM-1 as a homolog and shows alignments although there is no experimental elegans data

was used in the aspm-1 concise description

marked as functional annotation

00013545 - paper on regulation of PLC-mediated signalling in vivo by CDP-diacylglycerol synthase

no mention of elegans in the paper

was used in the C33H5.18 concise description

marked as functional annotation

00013634 - paper on p53 and spermatogenesis in mice

no mention of elegans in the paper

was used in the cep-1 concise description

marked as functional annotation

00028285 - no bibliographic information for this paper in postgres

checked WormBase - this paper was merged with WBPaper00013108 which was used for concise descriptions

00013108 has an alternative identifier of 00028285

made 00028285 invalid

00030743 - this paper does contain worm data, using C. elegans as a killing model for C. neoformans

okay to delete the non worm tag

00031377 - this paper has no curatable C. elegans data, but does describe a microscopy technique that is used in studying worms

okay to delete the non worm tag


==pap_internal_comment

This table contains any internal comments that curators may wish to make about the paper.

Currently, there are 114 comments, all added manually.

These comments are not dumped in the .ace file and are used for internal purposes only.

pap_journal

This table contains the journal in which a paper was published.

For PubMed-indexed journals, the name of the journal is contained within the Journal tag.

Note that not all papers in postgres use the proper ISOAbbreviation, but we are trying to clean this up as much as possible.

<Journal>

<Title>Cell</Title>

<ISOAbbreviation>Cell</ISOAbbreviation>

</Journal>

Type override?

pap_month

This table contains the month a paper was published. See pap_day table info above for PubMed XML.

In the editor, the full month name is spelled out, so any three-letter abbreviations in the PubMed XML is converted to the full name for display in the editor.

For the ACeDB dumping script, the month is converted to a number.

pap_pages

This table contains the page numbers for a paper. It can contains both numbers and letters.

pap_primary_data

This table contains information about whether a paper has primary research data. This tag is new as of 2009 and was put in place as a means to potentially sort papers based upon the likelihood of having curatable data. New papers are given primary_data as a default value, but curators can change, on the paper editor, the value to not_primary. Since PubMed does not assign Paper Type until a paper is fully indexed which can be weeks or months after a paper first appears in PubMed, this tag allows us to categorize papers regardless of Paper Type.

There are three values:

primary - given to all papers that contain experimental data

not_primary - given to all papers, such as review articles, that don't contain experimental data

not_designated - given to meeting abstracts since these often report experimental data but are not typically curated

Retrieved for Textpresso via:

/data2/data-processing/download/celegans/scripts/scripts_to_fetch_data_from_postgres/primary.com

pap_publisher

This table contains information on the publisher of books. It is currently entered manually.

pap_pubmed_final

This table has two values that are displayed solely in the paper editor and are not dumped in the .ace file:

final

not_final

It indicates whether the paper has been fully indexed by PubMed and therefore, whether the bibliographic information is complete.

pap_remark

This tag contains remarks about a paper that are meant to be included in WormBase.

They may include comments about a paper or meeting abstract that we have made in response to communication with a user.

There are currently nine paper entries that contain a remark.

pap_species

This table has been created to record the taxon ID of the nematode species studied in the paper. For example:

    00049513 | 6239        |        1 | two10877    | 2016-04-25 07:21:32.83144-07

For new papers, curators will populate this table when approving papers by selecting from an autocomplete list on the paper editor.

For existing papers, the table will be populated via two scripts:

    1) populate_pap_species_from_gene_species.pl - this script uses the species information from genes already associated with papers to 
       add correct species information
       Output is in: gene_method
    2) populate_pap_species_from_abstract_species.pl - this script performs string matching in titles and abstracts to add correct 
       species information
       Output is in: abstract_method
       Note that this script will likely need an expanded list for pattern matching
       Also include: C. elegans, C. briggsae
    Current location of scripts on the sandbox: /home/postgres/work/pgpopulation/pap_papers/20160401_species_from_gene/

Results of preliminary analysis:

       00000029 - no genes associated, no species in abstracts, species only in full text
       00000129 - no genes associated, species in title and abstract - captured by abstract_method script

pap_species_index

This table maps NCBI taxon IDs to species names. For example:

    6239    | Caenorhabditis elegans |          | two1843    | 2016-04-01 10:40:18.953828-07

pap_status

This table contains the current status of a paper.

There are two values:

valid

invalid

Papers are made invalid only when they are found to be duplicates of existing papers.

pap_title

This table contains the title of the paper object.

For PubMed-indexed papers, this information derives from the ArticleTitle tag in the corresponding XML:

<ArticleTitle>The zinc-finger protein SEA-2 regulates larval developmental timing and adult lifespan in C. elegans.</ArticleTitle>

pap_type

This table contains the numerical values for Paper Type.

Papers may have more than one type.

A more detailed discussion of Paper Type and the rules for assignment are at the links below:

Type - mapping of current to future Type

Type - mapping incoming Type

pap_type_index

This table contains the mappings between the numerical values of type to the text values. Current values:

1       | Journal_article       |           | two1841     | 2005-07-19 16:18:18.797932-07
2       | Review                |           | two1841     | 2005-07-19 16:18:18.799328-07
3       | Meeting_abstract      |           | two1841     | 2005-07-19 16:18:18.79984-07
4       | Gazette_article       |           | two1841     | 2005-07-19 16:18:18.800304-07
5       | Book_chapter          |           | two1841     | 2005-07-19 16:18:18.800755-07
6       | News                  |           | two1841     | 2005-07-19 16:18:18.801227-07
7       | Email                 |           | two1841     | 2005-07-19 16:18:18.801711-07
8       | Book                  |           | two1843     | 2010-01-22 13:44:25.465733-08
9       | Historical_article    |           | two1843     | 2010-01-22 13:44:25.465733-08
10      | Comment               |           | two1841     | 2005-07-19 16:18:18.803449-07
11      | Letter                |           | two1841     | 2005-07-19 16:18:18.803871-07
12      | Monograph             |           | two1841     | 2005-07-19 16:18:18.804305-07
13      | Editorial             |           | two1841     | 2005-07-19 16:18:18.804739-07
14      | Published_erratum     |           | two1841     | 2005-07-19 16:18:18.805171-07
15      | Retracted_publication |           | two1843     | 2010-01-22 13:44:25.465733-08
16      | Technical_report      |           | two1843     | 2010-01-22 13:44:25.465733-08
17      | Other                 |           | two324      | 2005-08-03 14:16:58.133568-07
18      | WormBook              |           | two1823     | 2006-06-02 13:58:02.301589-07
19      | Interview             |           | two1843     | 2009-07-23 16:11:32.147294-07
20      | Lectures              |           | two1843     | 2009-07-23 16:11:50.877118-07
21      | Congresses            |           | two1843     | 2009-07-23 16:12:06.780768-07
22      | Interactive_tutorial  |           | two1843     | 2009-07-23 16:12:20.299137-07
23      | Biography             |           | two1843     | 2010-01-22 13:44:25.465733-08
24      | Directory

pap_volume

This table contains the value for the volume of a journal in which a publication appears. The value can contain both numbers and letters.

For PubMed-indexed papers, the volume information is found in the XML as follows:

<Journal> <JournalIssue CitedMedium="Internet"> <Volume>138</Volume> </JournalIssue> <?Journal>

pap_year

This table contains the value for the year in which a publication is published. The value is numerical.

For PubMed-indexed papers, the year information is found in the XML as shown for pap_day.

History Tables

All paper history tables are in the format h_pap_X where X is the name of the corresponding table above.


Back to Paper Pipeline