Dumping Script

From WormBaseWiki
Jump to navigationJump to search

Overview:

Papers are dumped in a .ace file format to every Thursday that is either a 20- or 30-something.

Every Thursday at 4 am, a cronjob on spica copies the file from tazendra to spica into the Data_from_Kimberly directory.


Details:

The papers cronjob is on the acedb account : 0 2 * * thu /home/postgres/work/citace_upload/papers/wrapper.pl

The dumping script lives here:

/home/postgres/work/citace_upload/papers/dumpPapAce.pl

The papers.ace file is dumped automatically at 2am on the Thursday of the upload and copied to spica at 4am on that same Thursday.

The dumping script will check for any dead gene IDs attached to papers and comment them out of the .ace file until they are fixed/deleted from postgres by a curator.

  • How the script works:
    • General points:
      • Some fields/tags are single value (e.g., Journal, Volume) and some fields/tags are multi-value (e.g., Identifier, Gene). The full list of fields classified according to single- or multi-value is listed at the top of the script.
      • There is close, but not exact, correspondence between the paper table names in postgres (pap_nnnnn) and the tag names in the ?Paper model for ACeDB. At the top of the script, there is a list that maps postgres paper table names to ?Paper model tag names.
      • Paper types are indicated using an index table, pap_type_index. Same also for pap_author_index, and the possible evidences for author-person connections.
      • Data from the different paper tables is queried and put into a hash.
      • Dead genes are identified by querying the gin_dead table and are NOT put into the hash.






The AQL query that finds all dead genes in WB is:

select all class gene where ->Species like "*elegans" and ->Status like "Dead"

The dumping script will also check that all associated genes are in the format: WBGenennnnnnnn where the 8 n's correspond to numbers.


Back to 2010_-_Paper_Pipeline:_Documentation_and_Instructions