Updating Postgres with New WS Information

From WormBaseWiki
Revision as of 23:46, 14 November 2013 by Cgrove (talk | contribs)
Jump to navigationJump to search

This page is dedicated to documenting the script:

populate_pg_from_ws.pl

located at:

/home/postgres/work/pgpopulation/obo_oa_ontologies/ws_updates

which runs as a cronjob at 5am Pacific time every day:

0 5 * * * /home/postgres/work/pgpopulation/obo_oa_ontologies/ws_updates/populate_pg_from_ws.pl

Opening Code

use Ace;
use strict;
use diagnostics;
use Jex;
use DBI;

my $dbh = DBI->connect ( "dbi:Pg:dbname=testdb", "", "") or die "Cannot connect to database!\n";
my $result = '';

my $ws_files_dir = '/home/postgres/work/pgpopulation/obo_oa_ontologies/ws_updates/files/';      # for testing on mangolassi
# my $ws_files_dir = '/home3/acedb/cron/dump_from_ws/files/';   # on tazendra to go live
my $pg_files_dir = '/home/postgres/work/pgpopulation/obo_oa_ontologies/ws_updates/files/';

my $latestUpdateFile = '/home3/acedb/cron/dump_from_ws/files/latestDate';
open (IN, "<$latestUpdateFile") or die "Cannot open $latestUpdateFile : $!";
my ($wsUpdateDate) = <IN>; chomp $wsUpdateDate;
close (IN) or die "Cannot close $latestUpdateFile : $!";

$result = $dbh->prepare( "SELECT * FROM gin_molname ORDER BY gin_timestamp DESC;" );
$result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
my @row = $result->fetchrow(); my $pgTimestamp = $row[2];
my ($year, $month, $day) = $pgTimestamp =~ m/^(\d{4})-(\d{2})-(\d{2})/;
my $fromPgSimpleDate = $year . $month . $day;

# if ($wsUpdateDate > $fromPgSimpleDate) { print "do stuff\n"; } else { print "do nothing\n"; }

my $timestamp = &getPgDate();
if ($wsUpdateDate > $fromPgSimpleDate) {                        # if ws was updated more recently than postgres timestamp, do the updates
  &processGeneCds();
  &processExprCluster();
  &processFeature();
#   print "DONE\n";
}

Code to update Gene Information in gin_ tables

sub processGeneCds {
  my @pgcommands;
  my %cdsToProt;
  my $molfile = $pg_files_dir . 'gin_molname.pg';
  my $seqfile = $pg_files_dir . 'gin_sequence.pg';
  my $protfile = $pg_files_dir . 'gin_protein.pg';
  my $seqprotfile = $pg_files_dir . 'gin_seqprot.pg';
  open (MOL, ">$molfile") or die "Cannot create $molfile : $!";
  open (SEQ, ">$seqfile") or die "Cannot create $seqfile : $!";
  open (PRO, ">$protfile") or die "Cannot create $protfile : $!";
  open (SPR, ">$seqprotfile") or die "Cannot create $seqprotfile : $!";
  $/ = "";
  my $infile = $ws_files_dir . 'WSCDS.ace';
  open (IN, "<$infile") or die "Cannot open $infile : $!";
  while (my $entry = <IN>) {
    next unless ($entry =~ m/CDS : \"/);
    $entry =~ s/\\//g;                          # take out all backslashes
    my (@lines) = split/\n/, $entry;
    my $header = shift @lines;
    my $cds = '';
    if ($header =~ m/CDS : \"([^"]+)\"/) { $cds = $1; }
    foreach my $line (@lines) {
      if ($line =~ m/^Corresponding_protein\t \"(.*)\"/) { $cdsToProt{$cds} = $1; } }
  } # while (my $entry = <IN>)
  close (IN) or die "Cannot close $infile : $!";
  $infile = $ws_files_dir . 'WSGene.ace';
  open (IN, "<$infile") or die "Cannot open $infile : $!";
  while (my $entry = <IN>) {
    next unless ($entry =~ m/Gene : \"/);
    $entry =~ s/\\//g;                          # take out all backslashes
    my %data;
    my (@lines) = split/\n/, $entry;
    my $header = shift @lines;
    my $joinkey = '';
    if ($header =~ m/Gene : \"WBGene([^"]+)\"/) { $joinkey = $1; }
    my @tags = qw( Molecular_name Corresponding_transcript Corresponding_CDS );
    foreach my $line (@lines) {
      foreach my $tag (@tags) {
        if ($line =~ m/^$tag\t \"(.*)\"/) { $data{$tag}{$1}++; } } }
    foreach my $molname (sort keys %{ $data{"Molecular_name"} }) {
      print MOL qq($joinkey\t$molname\t$timestamp\n); }
    foreach my $sequence (sort keys %{ $data{"Corresponding_transcript"} }) {
      print SEQ qq($joinkey\t$sequence\t$timestamp\n); }
    foreach my $sequence (sort keys %{ $data{"Corresponding_CDS"} }) {
      print SEQ qq($joinkey\t$sequence\t$timestamp\n);
      if ($cdsToProt{$sequence}) {
        my $protein = $cdsToProt{$sequence};
        print SPR qq($joinkey\t$sequence\t$protein\t$timestamp\n);
        print PRO qq($joinkey\t$protein\t$timestamp\n); } }
  } # while (my $entry = <IN>)
  close (IN) or die "Cannot close $infile : $!";
  $/ = "\n";
  close (MOL) or die "Cannot close $molfile : $!";
  close (SEQ) or die "Cannot close $seqfile : $!";
  close (PRO) or die "Cannot close $protfile : $!";
  close (SPR) or die "Cannot close $seqprotfile : $!";
  push @pgcommands, qq(DELETE FROM gin_molname;);
  push @pgcommands, qq(DELETE FROM gin_sequence;);
  push @pgcommands, qq(DELETE FROM gin_protein;);
  push @pgcommands, qq(DELETE FROM gin_seqprot;);
  push @pgcommands, qq(COPY gin_molname  FROM '$molfile';);
  push @pgcommands, qq(COPY gin_sequence FROM '$seqfile';);
  push @pgcommands, qq(COPY gin_protein  FROM '$protfile';);
  push @pgcommands, qq(COPY gin_seqprot  FROM '$seqprotfile';);

  foreach my $pgcommand (@pgcommands) {
    print qq($pgcommand\n);
    $dbh->do( $pgcommand );
  } # foreach my $pgcommand (@pgcommands)
} # sub processGeneCds

Code to update Expression Cluster Information in obo_*_exprcluster tables

sub processExprCluster {
  my @pgcommands;
  my @tags = qw( Description Reference Remark );
  my %tags; foreach (@tags) { $tags{$_}++; }
  my $name_file = $pg_files_dir . 'obo_name_exprcluster.pg';
  my $data_file = $pg_files_dir . 'obo_data_exprcluster.pg';
  open (NAME, ">$name_file") or die "Cannot create $name_file : $!";
  open (DATA, ">$data_file") or die "Cannot create $data_file : $!";
  my $infile = $ws_files_dir . 'WSExpression_cluster.ace';
  $/ = "";
  open (IN, "<$infile") or die "Cannot open $infile : $!";
  while (my $entry = <IN>) {
    next unless ($entry =~ m/Expression_cluster : \"/);
    $entry =~ s/\\//g;                          # take out all backslashes
    my %data;
    my (@lines) = split/\n/, $entry;
    my $header = shift @lines;
    my $id = '';
    if ($header =~ m/Expression_cluster : \"([^"]+)\"/) { $id = $1; }
    next if ($id =~ m/^WBPaper00029359/);       # skip for Karen, chronograms.  2013 10 01
    my @all_data; push @all_data, qq(id: $id); push @all_data, qq(name: "$id");
    foreach my $line (@lines) {
      foreach my $tag (@tags) {
        if ($line =~ m/^$tag\t \"(.*)\"/) { $data{$tag}{$1}++; } } }
    foreach my $tag (@tags) {
      foreach my $data (sort keys %{ $data{$tag} }) {
        my $lctag = lc($tag);
        push @all_data, qq($lctag: "$data"); } }
    my $all_data = join"\\n", @all_data;
    if ($all_data =~ m/\'/) { $all_data =~ s/\'/''/g; }
    print NAME qq($id\t$id\t$timestamp\n);
    print DATA qq($id\t$all_data\t$timestamp\n);
  } # while (my $entry = <IN>)
  close (IN) or die "Cannot close $infile : $!";
  $/ = "\n";
  close (DATA) or die "Cannot close $data_file : $!";
  close (NAME) or die "Cannot close $name_file : $!";
  push @pgcommands, qq(DELETE FROM obo_name_exprcluster;);
  push @pgcommands, qq(DELETE FROM obo_data_exprcluster;);
  push @pgcommands, qq(COPY obo_name_exprcluster FROM '$name_file';);
  push @pgcommands, qq(COPY obo_data_exprcluster FROM '$data_file';);

  foreach my $pgcommand (@pgcommands) {
    print qq($pgcommand\n);
    $dbh->do( $pgcommand );
  } # foreach my $pgcommand (@pgcommands)
} # sub processExprCluster

Code to update Sequence Feature Information in obo_*_feature tables

sub processFeature {
  my @pgcommands;
  my @tags = qw( Public_name DNA_text Species Defined_by_paper Associated_with_gene Associated_with_Interaction Associated_with_expression_pattern Bound_by_product_of Transcription_factor Method );
  my %allowedMethods; my %expectedMethods; my %unexpectedMethods;
  my @allowedMethods = qw( binding_site binding_site_region DNAseI_hypersensitive_site enhancer histone_binding_site_region promoter regulatory_region TF_binding_site TF_binding_site_region );
  my @rejectedMethods = qw( Corrected_genome_sequence_error Genome_sequence_error history history_feature micro_ORF polyA_signal_sequence polyA_site segmental_duplication SL1 SL2 three_prime_UTR transcription_end_site transcription_start_site TSS_region );

  foreach (@allowedMethods) { $allowedMethods{$_}++; $expectedMethods{$_}++; }
  foreach (@rejectedMethods) { $expectedMethods{$_}++; }
  my %tags; foreach (@tags) { $tags{$_}++; }
  my $name_file = $pg_files_dir . 'obo_name_feature.pg';
  my $data_file = $pg_files_dir . 'obo_data_feature.pg';
  open (NAME, ">$name_file") or die "Cannot create $name_file : $!";
  open (DATA, ">$data_file") or die "Cannot create $data_file : $!";
  my $infile = $ws_files_dir . 'WSFeature.ace';
  $/ = "";
  open (IN, "<$infile") or die "Cannot open $infile : $!";
  while (my $entry = <IN>) {
    next unless ($entry =~ m/Feature : \"WBsf/);
    next unless ($entry =~ m/Method\s+\"(.*)\"/);
    my $method = $1;
    unless ($expectedMethods{$method}) { $unexpectedMethods{$method}++; }
    next unless ($allowedMethods{$method});
    $entry =~ s/\\//g;                          # take out all backslashes
    my %data;
    my (@lines) = split/\n/, $entry;
    my $header = shift @lines;
    my $id = '';
    if ($header =~ m/Feature : \"(WBsf\d+)\"/) { $id = $1; }
    my @all_data; push @all_data, qq(id: $id); push @all_data, qq(name: "$id");
    foreach my $line (@lines) {
      foreach my $tag (@tags) {
        if ($line =~ m/^$tag\t \"(.*)\"/) { $data{$tag}{$1}++; } } }
    foreach my $tag (@tags) {
      foreach my $data (sort keys %{ $data{$tag} }) {
        my $lctag = lc($tag);
        push @all_data, qq($lctag: "$data"); } }
    my $all_data = join"\\n", @all_data;
    if ($all_data =~ m/\'/) { $all_data =~ s/\'/''/g; }
    print NAME qq($id\t$id\t$timestamp\n);
    print DATA qq($id\t$all_data\t$timestamp\n);
  } # while (my $entry = <IN>)
  close (IN) or die "Cannot close $infile : $!";
  $/ = "\n";
  close (DATA) or die "Cannot close $data_file : $!";
  close (NAME) or die "Cannot close $name_file : $!";
  push @pgcommands, qq(DELETE FROM obo_name_feature;);
  push @pgcommands, qq(DELETE FROM obo_data_feature;);
  push @pgcommands, qq(COPY obo_name_feature FROM '$name_file';);
  push @pgcommands, qq(COPY obo_data_feature FROM '$data_file';);

  if (scalar %unexpectedMethods > 0) {
    my $body = join", ", keys %unexpectedMethods;
    my $user = 'populate_pg_from_ws.pl';
    my $email = 'cgrove@caltech.edu';
    my $subject = 'unexpected feature methods';
    &mailer($user, $email, $subject, $body);
  } # if (scalar %unexpectedMethods > 0)

  foreach my $pgcommand (@pgcommands) {
    print qq($pgcommand\n);
    $dbh->do( $pgcommand );
  } # foreach my $pgcommand (@pgcommands)
} # sub processFeature


For updating from latest WS release:

There are 2 cronjobs and 3 scripts:

in the acedb account:

cronjob 1 on script 1 (runs daily at 3am PST) : 0 3 * * * /home/acedb/cron/update_ws_tazendra.pl

Also triggers to run 2nd script : /home/acedb/cron/dump_from_ws.sh # dump .ace files from WS to update postgres tables

and write to a timestamp file my $dateOfWsDumpFile = /home3/acedb/cron/dump_from_ws/files/latestDate; echo "$date" >> $dateOfWsDumpFile;

in the postgres account :

cronjob 2 on script 3 (runs daily at 5am PST) : 0 5 * * * /home/postgres/work/pgpopulation/obo_oa_ontologies/ws_updates/populate_pg_from_ws.pl

This script went live on tazendra on 2013 10 21

This script generates files to populate postgres tables for some gin_ tables and obo_ tables for exprcluster when the date in the timestamp file (see above) is more recent than the latest timestamp in gin_molname. In other words, if WS is more recent than the postgres timestamp, the tables are updated (Lines 34 - 55).

For genes (gin_ tables), this script is species-restricted for C. elegans only (we can change this, if needed).

This script populates information that can only be derived from the WS build:

  1. Corresponding_transcript
  2. Corresponding_CDS
  3. Corresponding_protein
  4. Molecular_name

If the WS timestamp is more recent than the postgres timestamp, then the subroutine processGeneCds (Line 57) will run.

Values and timestamps are written to a file, then the postgres tables are deleted, values from the files are copied to the postgres tables, and then the files are closed.

For populating gin_protein, the subroutine looks at the CDS objects (WSCDS.ace) and retrieves the Corresponding_protein. To map genes to proteins, CDS and protein are matched (gin_seqprot), genes are matched to CDS (gin_seqprot) and then mapped to protein (gin_protein).

For populating gin_molname (Molecular_name), the subroutine looks at gene objects (WSGene.ace) and adds the value of Molecular_name to gin_molname.

For populating gin_sequence (Corresponding_transcript and Corresponding_CDS) the subroutine looks at gene objects (WSGene.ace) and adds the value to gin_sequence.


We'll know when WS141 comes out (should be mid-to-late December 2013) and cronjob 1 picks it up, if this pipeline works automatically.