Updating Postgres with New WS Information

From WormBaseWiki
Jump to navigationJump to search

UPDATES IN PROGRESS (2022-01-07)

This page documents how information from the latest WS release is used to populate postgres tables on tazendra for curation.

Scripts

Dumping .ace files from latest WS on spica

  • The script that dumps select .ace files from the latest WS is on spica here:
    • /home/citpub/wen/WSDump/forKimberly.pl
  • The .ace files dumped are listed below. Currently, these files contain all members of a given class, regardless of species.
    • WSCDS.ace - used to populate gin_ tables
    • WSExpr_pattern.ace - used to make paper-data type connections for UniProt
    • WSGene.ace - used to populate gin_ tables
    • WSGO_annotation.ace - used to make paper-data type connections for UniProt
    • WSRNAi.ace - used to make paper-data type connections for UniProt
    • WSVariation.ace - used to supplement variation-gene connections not found in the nightly geneace dump; also for UniProt?

Mirroring dumped .ace files on spica to tazendra

  • The script that syncs select .ace files from the latest WS on spica to tazendra is here:
    • /home/citpub/wen/WSDump/rsyncWSDump2tazendra.sh
  • This script uses password-less rsync, to copy all .ace files from spica:/home/citpub/wen/WSDump/ to tazendra:/home2/acedb/cron/dump_from_ws/files/
  • Note that as of 2022-01-05, older files were archived here: tazendra:/tmp/files
    • These files will go away after a tazendra reboot

Populating relevant postgres tables with WS data

  • In the home2/acedb/cron/dump_from_ws/files directory, there is a file, latestDate, that logs the dates on which the existing .ace files are replaced with the more current files from the latest WS release
    • Question for Juancarlos - how does the latestDate file get updated?
  • This script populates:
    • gin_molname
    • gin_sequence
    • gin_seqprot
    • gin_molname
    • obo_name_exprcluster.pg - might not be needed
    • obo_data_exprcluster.pg - might not be needed

Files in /home2/acedb/cron/dump_from_ws/files/ are generated by Wen and uploaded here.

Date last reviewed: 2022-01-07

ARCHIVED OLD DOCUMENTATION

There are 2 cronjobs and 3 scripts:

in the acedb account:

cronjob 1 on script 1 (runs daily at 3am PST) :

0 3 * * * /home/acedb/cron/update_ws_tazendra.pl

Also triggers to run 2nd script : /home/acedb/cron/dump_from_ws.sh # dump .ace files from WS to update postgres tables

and write to a timestamp file my $dateOfWsDumpFile = /home3/acedb/cron/dump_from_ws/files/latestDate; echo "$date" >> $dateOfWsDumpFile;

in the postgres account :

cronjob 2 on script 3 (runs daily at 5am PST) :

0 5 * * * /home/postgres/work/pgpopulation/obo_oa_ontologies/ws_updates/populate_pg_from_ws.pl

This script went live on tazendra on 2013 10 21

Script 1

0 3 * * * /home/acedb/cron/update_ws_tazendra.pl

Every day at 3 am, check ftp.sanger.ac.uk for a new WS release. If there's a new one:

  • download
  • install
  • put date in /home3/acedb/cron/dump_from_ws/files/latestDate
  • run /home/acedb/cron/dump_from_ws.sh

Script 2

/home/acedb/cron/dump_from_ws.sh

  • This script will write to files in directory /home3/acedb/cron/dump_from_ws/files/
  • Use tace to access latest WS
  • Dump whole Expression_cluster class into WSExpression_cluster.ace
  • Dump whole CDS class into WSCDS.ace
  • Dump whole Gene class into WSGene.ace
  • Dump whole Feature class into WSFeature.ace
cd /home3/acedb/cron/dump_from_ws/files
export ACEDB=/home3/acedb/ws/acedb
#Start dumping files
/home/acedb/bin/tace <<END_TACE
find Expression_cluster
show -a -f WSExpression_cluster.ace
find CDS
show -a -f WSCDS.ace
find Gene
show -a -f WSGene.ace
find Feature
show -a -f WSFeature.ace
quit
END_TACE

Script 3

/home/postgres/work/pgpopulation/obo_oa_ontologies/ws_updates/populate_pg_from_ws.pl


Script 3 generates files to populate postgres tables for some gin_ tables and obo_ tables (for 'exprcluster' and 'feature') when the date in the timestamp file (see above) is more recent than the latest timestamp in gin_molname. In other words, if WS is more recent than the postgres timestamp, the tables are updated (Lines 34 - 55). Then populates postgres based on those files (it's more time efficient to populate the whole table from flatfiles, but requires doing this from the postgres account instead of the acedb account ; conversely, WS is in the acedb account, so the files generated by the scripts above have to be in that account).


use Ace;
use strict;
use diagnostics;
use Jex;
use DBI;

my $dbh = DBI->connect ( "dbi:Pg:dbname=testdb", "", "") or die "Cannot connect to database!\n";
my $result = '';

my $ws_files_dir = '/home/postgres/work/pgpopulation/obo_oa_ontologies/ws_updates/files/';      # for testing on mangolassi
# my $ws_files_dir = '/home3/acedb/cron/dump_from_ws/files/';   # on tazendra to go live
my $pg_files_dir = '/home/postgres/work/pgpopulation/obo_oa_ontologies/ws_updates/files/';

my $latestUpdateFile = '/home3/acedb/cron/dump_from_ws/files/latestDate';
open (IN, "<$latestUpdateFile") or die "Cannot open $latestUpdateFile : $!";
my ($wsUpdateDate) = <IN>; chomp $wsUpdateDate;
close (IN) or die "Cannot close $latestUpdateFile : $!";

$result = $dbh->prepare( "SELECT * FROM gin_molname ORDER BY gin_timestamp DESC;" );
$result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
my @row = $result->fetchrow(); my $pgTimestamp = $row[2];
my ($year, $month, $day) = $pgTimestamp =~ m/^(\d{4})-(\d{2})-(\d{2})/;
my $fromPgSimpleDate = $year . $month . $day;

# if ($wsUpdateDate > $fromPgSimpleDate) { print "do stuff\n"; } else { print "do nothing\n"; }

my $timestamp = &getPgDate();
if ($wsUpdateDate > $fromPgSimpleDate) {                        # if ws was updated more recently than postgres timestamp, do the updates
  &processGeneCds();
  &processExprCluster();
  &processFeature();
#   print "DONE\n";
}


Data type specific updating

Code to update Gene Information in gin_ tables

For genes (gin_ tables), this script is species-restricted for C. elegans only (we can change this, if needed).

This script populates information that can only be derived from the WS build:

  1. Corresponding_transcript
  2. Corresponding_CDS
  3. Corresponding_protein
  4. Molecular_name

If the WS timestamp is more recent than the postgres timestamp, then the subroutine processGeneCds (Line 57) will run.

Values and timestamps are written to a file, then the postgres tables are deleted, values from the files are copied to the postgres tables, and then the files are closed.

For populating gin_protein, the subroutine looks at the CDS objects (WSCDS.ace) and retrieves the Corresponding_protein. To map genes to proteins, CDS and protein are matched (gin_seqprot), genes are matched to CDS (gin_seqprot) and then mapped to protein (gin_protein).

For populating gin_molname (Molecular_name), the subroutine looks at gene objects (WSGene.ace) and adds the value of Molecular_name to gin_molname.

For populating gin_sequence (Corresponding_transcript and Corresponding_CDS) the subroutine looks at gene objects (WSGene.ace) and adds the value to gin_sequence.


We'll know when WS141 comes out (should be mid-to-late December 2013) and cronjob 1 picks it up, if this pipeline works automatically.


sub processGeneCds {
  my @pgcommands;
  my %cdsToProt;
  my $molfile = $pg_files_dir . 'gin_molname.pg';
  my $seqfile = $pg_files_dir . 'gin_sequence.pg';
  my $protfile = $pg_files_dir . 'gin_protein.pg';
  my $seqprotfile = $pg_files_dir . 'gin_seqprot.pg';
  open (MOL, ">$molfile") or die "Cannot create $molfile : $!";
  open (SEQ, ">$seqfile") or die "Cannot create $seqfile : $!";
  open (PRO, ">$protfile") or die "Cannot create $protfile : $!";
  open (SPR, ">$seqprotfile") or die "Cannot create $seqprotfile : $!";
  $/ = "";
  my $infile = $ws_files_dir . 'WSCDS.ace';
  open (IN, "<$infile") or die "Cannot open $infile : $!";
  while (my $entry = <IN>) {
    next unless ($entry =~ m/CDS : \"/);
    $entry =~ s/\\//g;                          # take out all backslashes
    my (@lines) = split/\n/, $entry;
    my $header = shift @lines;
    my $cds = '';
    if ($header =~ m/CDS : \"([^"]+)\"/) { $cds = $1; }
    foreach my $line (@lines) {
      if ($line =~ m/^Corresponding_protein\t \"(.*)\"/) { $cdsToProt{$cds} = $1; } }
  } # while (my $entry = <IN>)
  close (IN) or die "Cannot close $infile : $!";
  $infile = $ws_files_dir . 'WSGene.ace';
  open (IN, "<$infile") or die "Cannot open $infile : $!";
  while (my $entry = <IN>) {
    next unless ($entry =~ m/Gene : \"/);
    $entry =~ s/\\//g;                          # take out all backslashes
    my %data;
    my (@lines) = split/\n/, $entry;
    my $header = shift @lines;
    my $joinkey = '';
    if ($header =~ m/Gene : \"WBGene([^"]+)\"/) { $joinkey = $1; }
    my @tags = qw( Molecular_name Corresponding_transcript Corresponding_CDS );
    foreach my $line (@lines) {
      foreach my $tag (@tags) {
        if ($line =~ m/^$tag\t \"(.*)\"/) { $data{$tag}{$1}++; } } }
    foreach my $molname (sort keys %{ $data{"Molecular_name"} }) {
      print MOL qq($joinkey\t$molname\t$timestamp\n); }
    foreach my $sequence (sort keys %{ $data{"Corresponding_transcript"} }) {
      print SEQ qq($joinkey\t$sequence\t$timestamp\n); }
    foreach my $sequence (sort keys %{ $data{"Corresponding_CDS"} }) {
      print SEQ qq($joinkey\t$sequence\t$timestamp\n);
      if ($cdsToProt{$sequence}) {
        my $protein = $cdsToProt{$sequence};
        print SPR qq($joinkey\t$sequence\t$protein\t$timestamp\n);
        print PRO qq($joinkey\t$protein\t$timestamp\n); } }
  } # while (my $entry = <IN>)
  close (IN) or die "Cannot close $infile : $!";
  $/ = "\n";
  close (MOL) or die "Cannot close $molfile : $!";
  close (SEQ) or die "Cannot close $seqfile : $!";
  close (PRO) or die "Cannot close $protfile : $!";
  close (SPR) or die "Cannot close $seqprotfile : $!";
  push @pgcommands, qq(DELETE FROM gin_molname;);
  push @pgcommands, qq(DELETE FROM gin_sequence;);
  push @pgcommands, qq(DELETE FROM gin_protein;);
  push @pgcommands, qq(DELETE FROM gin_seqprot;);
  push @pgcommands, qq(COPY gin_molname  FROM '$molfile';);
  push @pgcommands, qq(COPY gin_sequence FROM '$seqfile';);
  push @pgcommands, qq(COPY gin_protein  FROM '$protfile';);
  push @pgcommands, qq(COPY gin_seqprot  FROM '$seqprotfile';);

  foreach my $pgcommand (@pgcommands) {
    print qq($pgcommand\n);
    $dbh->do( $pgcommand );
  } # foreach my $pgcommand (@pgcommands)
} # sub processGeneCds

Code to update Expression Cluster Information in obo_*_exprcluster tables

sub processExprCluster {
  my @pgcommands;
  my @tags = qw( Description Reference Remark );
  my %tags; foreach (@tags) { $tags{$_}++; }
  my $name_file = $pg_files_dir . 'obo_name_exprcluster.pg';
  my $data_file = $pg_files_dir . 'obo_data_exprcluster.pg';
  open (NAME, ">$name_file") or die "Cannot create $name_file : $!";
  open (DATA, ">$data_file") or die "Cannot create $data_file : $!";
  my $infile = $ws_files_dir . 'WSExpression_cluster.ace';
  $/ = "";
  open (IN, "<$infile") or die "Cannot open $infile : $!";
  while (my $entry = <IN>) {
    next unless ($entry =~ m/Expression_cluster : \"/);
    $entry =~ s/\\//g;                          # take out all backslashes
    my %data;
    my (@lines) = split/\n/, $entry;
    my $header = shift @lines;
    my $id = '';
    if ($header =~ m/Expression_cluster : \"([^"]+)\"/) { $id = $1; }
    next if ($id =~ m/^WBPaper00029359/);       # skip for Karen, chronograms.  2013 10 01
    my @all_data; push @all_data, qq(id: $id); push @all_data, qq(name: "$id");
    foreach my $line (@lines) {
      foreach my $tag (@tags) {
        if ($line =~ m/^$tag\t \"(.*)\"/) { $data{$tag}{$1}++; } } }
    foreach my $tag (@tags) {
      foreach my $data (sort keys %{ $data{$tag} }) {
        my $lctag = lc($tag);
        push @all_data, qq($lctag: "$data"); } }
    my $all_data = join"\\n", @all_data;
    if ($all_data =~ m/\'/) { $all_data =~ s/\'/''/g; }
    print NAME qq($id\t$id\t$timestamp\n);
    print DATA qq($id\t$all_data\t$timestamp\n);
  } # while (my $entry = <IN>)
  close (IN) or die "Cannot close $infile : $!";
  $/ = "\n";
  close (DATA) or die "Cannot close $data_file : $!";
  close (NAME) or die "Cannot close $name_file : $!";
  push @pgcommands, qq(DELETE FROM obo_name_exprcluster;);
  push @pgcommands, qq(DELETE FROM obo_data_exprcluster;);
  push @pgcommands, qq(COPY obo_name_exprcluster FROM '$name_file';);
  push @pgcommands, qq(COPY obo_data_exprcluster FROM '$data_file';);

  foreach my $pgcommand (@pgcommands) {
    print qq($pgcommand\n);
    $dbh->do( $pgcommand );
  } # foreach my $pgcommand (@pgcommands)
} # sub processExprCluster

Code to update Sequence Feature Information in obo_*_feature tables

sub processFeature {
  my @pgcommands;
  my @tags = qw( Public_name DNA_text Species Defined_by_paper Associated_with_gene Associated_with_Interaction Associated_with_expression_pattern Bound_by_product_of Transcription_factor Method );

The code above specifies which ACEDB tags we are pulling ?Feature object data from.

  my %allowedMethods; my %expectedMethods; my %unexpectedMethods;
  my @allowedMethods = qw( binding_site binding_site_region DNAseI_hypersensitive_site enhancer histone_binding_site_region promoter regulatory_region TF_binding_site TF_binding_site_region );
  my @rejectedMethods = qw( Corrected_genome_sequence_error Genome_sequence_error history history_feature micro_ORF polyA_signal_sequence polyA_site segmental_duplication SL1 SL2 three_prime_UTR transcription_end_site transcription_start_site TSS_region );

  foreach (@allowedMethods) { $allowedMethods{$_}++; $expectedMethods{$_}++; }
  foreach (@rejectedMethods) { $expectedMethods{$_}++; }

The code above specifies which ?Feature class 'Method' entries are allowed (entered into @allowedMethods), rejected (entered into @rejectedMethods), expected (combination of allowed and rejected), or unexpected (anything 'Method' entry that isn't in the expected list).

  my %tags; foreach (@tags) { $tags{$_}++; }
  my $name_file = $pg_files_dir . 'obo_name_feature.pg';
  my $data_file = $pg_files_dir . 'obo_data_feature.pg';
  open (NAME, ">$name_file") or die "Cannot create $name_file : $!";
  open (DATA, ">$data_file") or die "Cannot create $data_file : $!";

We're writing the obo_name_feature/obo_data_feature postgres tables, so we're writing to corresponding flatfiles first. obo_name_feature table contains ?Feature object names only (e.g. 'WBsf019089') and the obo_data_feature table contains all info displayed for a ?Feature object in the 'Term Info' panel of the OA.

  my $infile = $ws_files_dir . 'WSFeature.ace';
  $/ = "";
  open (IN, "<$infile") or die "Cannot open $infile : $!";
  while (my $entry = <IN>) {
    next unless ($entry =~ m/Feature : \"WBsf/);
    next unless ($entry =~ m/Method\s+\"(.*)\"/);

The code looks for the 'WSFeature.ace' .ACE file to read in all of the feature data from the WS release. The code ignores any .ACE entries that do not contain the text 'Feature : "WBsf' or that are missing a 'Method' tag value (i.e. there needs to be an entry for the 'Method' tag, like 'TF_binding_site').

    my $method = $1;
    unless ($expectedMethods{$method}) { $unexpectedMethods{$method}++; }
    next unless ($allowedMethods{$method});

If the 'Method' entry is not expected, it gets added to the list of unexpected 'Method' entries. If the 'Method' entry is not in the '%allowedMethods' hash, then the whole ?Feature object entry (.ace entry) will get skipped.

    $entry =~ s/\\//g;                          # take out all backslashes

The code removes any backslashes that might be present in the .ACE file, as ACEDB can often add backslash 'escape' characters for certain text entries/characters.

    my %data;
    my (@lines) = split/\n/, $entry;
    my $header = shift @lines;
    my $id = '';
    if ($header =~ m/Feature : \"(WBsf\d+)\"/) { $id = $1; }

For each .ACE entry the code will pull out the header info such as the 'WBsf' ID. If the naming scheme (or model) changes such that we don't refer to ?Feature objects by the 'WBsf' prefix, then this code will need to be changed accordingly.

    my @all_data; push @all_data, qq(id: $id); push @all_data, qq(name: "$id");
    foreach my $line (@lines) {
      foreach my $tag (@tags) {
        if ($line =~ m/^$tag\t \"(.*)\"/) { $data{$tag}{$1}++; } } }
    foreach my $tag (@tags) {
      foreach my $data (sort keys %{ $data{$tag} }) {
        my $lctag = lc($tag);
        push @all_data, qq($lctag: "$data"); } }

For each line in the .ACE file, the data tag is checked to see if it is an 'allowed' tag. If so, the value for the tag is entered into a filter for the Term Info, with the lower-cased tag as the Term Info tag.

    my $all_data = join"\\n", @all_data;
    if ($all_data =~ m/\'/) { $all_data =~ s/\'/''/g; }

All lines entered for the Term Info are separated by an escaped '\n' ('\\n') so that they go into Postgres as a valid newline character '\n'. All apostrophes (') are escaped for Postgres into double apostrophes ().

    print NAME qq($id\t$id\t$timestamp\n);
    print DATA qq($id\t$all_data\t$timestamp\n);

All data entered have the timestamp that was determined previously by the script (see above), and all data is written into the corresponding files.

  } # while (my $entry = <IN>)
  close (IN) or die "Cannot close $infile : $!";
  $/ = "\n";
  close (DATA) or die "Cannot close $data_file : $!";
  close (NAME) or die "Cannot close $name_file : $!";
  push @pgcommands, qq(DELETE FROM obo_name_feature;);
  push @pgcommands, qq(DELETE FROM obo_data_feature;);
  push @pgcommands, qq(COPY obo_name_feature FROM '$name_file';);
  push @pgcommands, qq(COPY obo_data_feature FROM '$data_file';);

The files are closed, the data from the Postgres tables are deleted, and copied from the corresponding files.

  if (scalar %unexpectedMethods > 0) {
    my $body = join", ", keys %unexpectedMethods;
    my $user = 'populate_pg_from_ws.pl';
    my $email = 'cgrove@caltech.edu';
    my $subject = 'unexpected feature methods';
    &mailer($user, $email, $subject, $body);
  } # if (scalar %unexpectedMethods > 0)

If any unexpected 'Method' entries are encountered, an e-mail will be sent to Chris Grove.

  foreach my $pgcommand (@pgcommands) {
    print qq($pgcommand\n);
    $dbh->do( $pgcommand );
  } # foreach my $pgcommand (@pgcommands)
} # sub processFeature