Updating Postgres with New WS Information

From WormBaseWiki
Revision as of 23:48, 14 November 2013 by Cgrove (talk | contribs)
Jump to navigationJump to search

This page is dedicated to documenting the three scripts that update WS based on the Hinxton release, dump out .ace files for specific classes, and populate Postgres based on that data.


Script 3

populate_pg_from_ws.pl

located at:

/home/postgres/work/pgpopulation/obo_oa_ontologies/ws_updates

which runs as a cronjob at 5am Pacific time every day:

0 5 * * * /home/postgres/work/pgpopulation/obo_oa_ontologies/ws_updates/populate_pg_from_ws.pl


use Ace;
use strict;
use diagnostics;
use Jex;
use DBI;

my $dbh = DBI->connect ( "dbi:Pg:dbname=testdb", "", "") or die "Cannot connect to database!\n";
my $result = '';

my $ws_files_dir = '/home/postgres/work/pgpopulation/obo_oa_ontologies/ws_updates/files/';      # for testing on mangolassi
# my $ws_files_dir = '/home3/acedb/cron/dump_from_ws/files/';   # on tazendra to go live
my $pg_files_dir = '/home/postgres/work/pgpopulation/obo_oa_ontologies/ws_updates/files/';

my $latestUpdateFile = '/home3/acedb/cron/dump_from_ws/files/latestDate';
open (IN, "<$latestUpdateFile") or die "Cannot open $latestUpdateFile : $!";
my ($wsUpdateDate) = <IN>; chomp $wsUpdateDate;
close (IN) or die "Cannot close $latestUpdateFile : $!";

$result = $dbh->prepare( "SELECT * FROM gin_molname ORDER BY gin_timestamp DESC;" );
$result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
my @row = $result->fetchrow(); my $pgTimestamp = $row[2];
my ($year, $month, $day) = $pgTimestamp =~ m/^(\d{4})-(\d{2})-(\d{2})/;
my $fromPgSimpleDate = $year . $month . $day;

# if ($wsUpdateDate > $fromPgSimpleDate) { print "do stuff\n"; } else { print "do nothing\n"; }

my $timestamp = &getPgDate();
if ($wsUpdateDate > $fromPgSimpleDate) {                        # if ws was updated more recently than postgres timestamp, do the updates
  &processGeneCds();
  &processExprCluster();
  &processFeature();
#   print "DONE\n";
}


Data type specific updating

Code to update Gene Information in gin_ tables

sub processGeneCds {
  my @pgcommands;
  my %cdsToProt;
  my $molfile = $pg_files_dir . 'gin_molname.pg';
  my $seqfile = $pg_files_dir . 'gin_sequence.pg';
  my $protfile = $pg_files_dir . 'gin_protein.pg';
  my $seqprotfile = $pg_files_dir . 'gin_seqprot.pg';
  open (MOL, ">$molfile") or die "Cannot create $molfile : $!";
  open (SEQ, ">$seqfile") or die "Cannot create $seqfile : $!";
  open (PRO, ">$protfile") or die "Cannot create $protfile : $!";
  open (SPR, ">$seqprotfile") or die "Cannot create $seqprotfile : $!";
  $/ = "";
  my $infile = $ws_files_dir . 'WSCDS.ace';
  open (IN, "<$infile") or die "Cannot open $infile : $!";
  while (my $entry = <IN>) {
    next unless ($entry =~ m/CDS : \"/);
    $entry =~ s/\\//g;                          # take out all backslashes
    my (@lines) = split/\n/, $entry;
    my $header = shift @lines;
    my $cds = '';
    if ($header =~ m/CDS : \"([^"]+)\"/) { $cds = $1; }
    foreach my $line (@lines) {
      if ($line =~ m/^Corresponding_protein\t \"(.*)\"/) { $cdsToProt{$cds} = $1; } }
  } # while (my $entry = <IN>)
  close (IN) or die "Cannot close $infile : $!";
  $infile = $ws_files_dir . 'WSGene.ace';
  open (IN, "<$infile") or die "Cannot open $infile : $!";
  while (my $entry = <IN>) {
    next unless ($entry =~ m/Gene : \"/);
    $entry =~ s/\\//g;                          # take out all backslashes
    my %data;
    my (@lines) = split/\n/, $entry;
    my $header = shift @lines;
    my $joinkey = '';
    if ($header =~ m/Gene : \"WBGene([^"]+)\"/) { $joinkey = $1; }
    my @tags = qw( Molecular_name Corresponding_transcript Corresponding_CDS );
    foreach my $line (@lines) {
      foreach my $tag (@tags) {
        if ($line =~ m/^$tag\t \"(.*)\"/) { $data{$tag}{$1}++; } } }
    foreach my $molname (sort keys %{ $data{"Molecular_name"} }) {
      print MOL qq($joinkey\t$molname\t$timestamp\n); }
    foreach my $sequence (sort keys %{ $data{"Corresponding_transcript"} }) {
      print SEQ qq($joinkey\t$sequence\t$timestamp\n); }
    foreach my $sequence (sort keys %{ $data{"Corresponding_CDS"} }) {
      print SEQ qq($joinkey\t$sequence\t$timestamp\n);
      if ($cdsToProt{$sequence}) {
        my $protein = $cdsToProt{$sequence};
        print SPR qq($joinkey\t$sequence\t$protein\t$timestamp\n);
        print PRO qq($joinkey\t$protein\t$timestamp\n); } }
  } # while (my $entry = <IN>)
  close (IN) or die "Cannot close $infile : $!";
  $/ = "\n";
  close (MOL) or die "Cannot close $molfile : $!";
  close (SEQ) or die "Cannot close $seqfile : $!";
  close (PRO) or die "Cannot close $protfile : $!";
  close (SPR) or die "Cannot close $seqprotfile : $!";
  push @pgcommands, qq(DELETE FROM gin_molname;);
  push @pgcommands, qq(DELETE FROM gin_sequence;);
  push @pgcommands, qq(DELETE FROM gin_protein;);
  push @pgcommands, qq(DELETE FROM gin_seqprot;);
  push @pgcommands, qq(COPY gin_molname  FROM '$molfile';);
  push @pgcommands, qq(COPY gin_sequence FROM '$seqfile';);
  push @pgcommands, qq(COPY gin_protein  FROM '$protfile';);
  push @pgcommands, qq(COPY gin_seqprot  FROM '$seqprotfile';);

  foreach my $pgcommand (@pgcommands) {
    print qq($pgcommand\n);
    $dbh->do( $pgcommand );
  } # foreach my $pgcommand (@pgcommands)
} # sub processGeneCds

Code to update Expression Cluster Information in obo_*_exprcluster tables

sub processExprCluster {
  my @pgcommands;
  my @tags = qw( Description Reference Remark );
  my %tags; foreach (@tags) { $tags{$_}++; }
  my $name_file = $pg_files_dir . 'obo_name_exprcluster.pg';
  my $data_file = $pg_files_dir . 'obo_data_exprcluster.pg';
  open (NAME, ">$name_file") or die "Cannot create $name_file : $!";
  open (DATA, ">$data_file") or die "Cannot create $data_file : $!";
  my $infile = $ws_files_dir . 'WSExpression_cluster.ace';
  $/ = "";
  open (IN, "<$infile") or die "Cannot open $infile : $!";
  while (my $entry = <IN>) {
    next unless ($entry =~ m/Expression_cluster : \"/);
    $entry =~ s/\\//g;                          # take out all backslashes
    my %data;
    my (@lines) = split/\n/, $entry;
    my $header = shift @lines;
    my $id = '';
    if ($header =~ m/Expression_cluster : \"([^"]+)\"/) { $id = $1; }
    next if ($id =~ m/^WBPaper00029359/);       # skip for Karen, chronograms.  2013 10 01
    my @all_data; push @all_data, qq(id: $id); push @all_data, qq(name: "$id");
    foreach my $line (@lines) {
      foreach my $tag (@tags) {
        if ($line =~ m/^$tag\t \"(.*)\"/) { $data{$tag}{$1}++; } } }
    foreach my $tag (@tags) {
      foreach my $data (sort keys %{ $data{$tag} }) {
        my $lctag = lc($tag);
        push @all_data, qq($lctag: "$data"); } }
    my $all_data = join"\\n", @all_data;
    if ($all_data =~ m/\'/) { $all_data =~ s/\'/''/g; }
    print NAME qq($id\t$id\t$timestamp\n);
    print DATA qq($id\t$all_data\t$timestamp\n);
  } # while (my $entry = <IN>)
  close (IN) or die "Cannot close $infile : $!";
  $/ = "\n";
  close (DATA) or die "Cannot close $data_file : $!";
  close (NAME) or die "Cannot close $name_file : $!";
  push @pgcommands, qq(DELETE FROM obo_name_exprcluster;);
  push @pgcommands, qq(DELETE FROM obo_data_exprcluster;);
  push @pgcommands, qq(COPY obo_name_exprcluster FROM '$name_file';);
  push @pgcommands, qq(COPY obo_data_exprcluster FROM '$data_file';);

  foreach my $pgcommand (@pgcommands) {
    print qq($pgcommand\n);
    $dbh->do( $pgcommand );
  } # foreach my $pgcommand (@pgcommands)
} # sub processExprCluster

Code to update Sequence Feature Information in obo_*_feature tables

sub processFeature {
  my @pgcommands;
  my @tags = qw( Public_name DNA_text Species Defined_by_paper Associated_with_gene Associated_with_Interaction Associated_with_expression_pattern Bound_by_product_of Transcription_factor Method );
  my %allowedMethods; my %expectedMethods; my %unexpectedMethods;
  my @allowedMethods = qw( binding_site binding_site_region DNAseI_hypersensitive_site enhancer histone_binding_site_region promoter regulatory_region TF_binding_site TF_binding_site_region );
  my @rejectedMethods = qw( Corrected_genome_sequence_error Genome_sequence_error history history_feature micro_ORF polyA_signal_sequence polyA_site segmental_duplication SL1 SL2 three_prime_UTR transcription_end_site transcription_start_site TSS_region );

  foreach (@allowedMethods) { $allowedMethods{$_}++; $expectedMethods{$_}++; }
  foreach (@rejectedMethods) { $expectedMethods{$_}++; }
  my %tags; foreach (@tags) { $tags{$_}++; }
  my $name_file = $pg_files_dir . 'obo_name_feature.pg';
  my $data_file = $pg_files_dir . 'obo_data_feature.pg';
  open (NAME, ">$name_file") or die "Cannot create $name_file : $!";
  open (DATA, ">$data_file") or die "Cannot create $data_file : $!";
  my $infile = $ws_files_dir . 'WSFeature.ace';
  $/ = "";
  open (IN, "<$infile") or die "Cannot open $infile : $!";
  while (my $entry = <IN>) {
    next unless ($entry =~ m/Feature : \"WBsf/);
    next unless ($entry =~ m/Method\s+\"(.*)\"/);
    my $method = $1;
    unless ($expectedMethods{$method}) { $unexpectedMethods{$method}++; }
    next unless ($allowedMethods{$method});
    $entry =~ s/\\//g;                          # take out all backslashes
    my %data;
    my (@lines) = split/\n/, $entry;
    my $header = shift @lines;
    my $id = '';
    if ($header =~ m/Feature : \"(WBsf\d+)\"/) { $id = $1; }
    my @all_data; push @all_data, qq(id: $id); push @all_data, qq(name: "$id");
    foreach my $line (@lines) {
      foreach my $tag (@tags) {
        if ($line =~ m/^$tag\t \"(.*)\"/) { $data{$tag}{$1}++; } } }
    foreach my $tag (@tags) {
      foreach my $data (sort keys %{ $data{$tag} }) {
        my $lctag = lc($tag);
        push @all_data, qq($lctag: "$data"); } }
    my $all_data = join"\\n", @all_data;
    if ($all_data =~ m/\'/) { $all_data =~ s/\'/''/g; }
    print NAME qq($id\t$id\t$timestamp\n);
    print DATA qq($id\t$all_data\t$timestamp\n);
  } # while (my $entry = <IN>)
  close (IN) or die "Cannot close $infile : $!";
  $/ = "\n";
  close (DATA) or die "Cannot close $data_file : $!";
  close (NAME) or die "Cannot close $name_file : $!";
  push @pgcommands, qq(DELETE FROM obo_name_feature;);
  push @pgcommands, qq(DELETE FROM obo_data_feature;);
  push @pgcommands, qq(COPY obo_name_feature FROM '$name_file';);
  push @pgcommands, qq(COPY obo_data_feature FROM '$data_file';);

  if (scalar %unexpectedMethods > 0) {
    my $body = join", ", keys %unexpectedMethods;
    my $user = 'populate_pg_from_ws.pl';
    my $email = 'cgrove@caltech.edu';
    my $subject = 'unexpected feature methods';
    &mailer($user, $email, $subject, $body);
  } # if (scalar %unexpectedMethods > 0)

  foreach my $pgcommand (@pgcommands) {
    print qq($pgcommand\n);
    $dbh->do( $pgcommand );
  } # foreach my $pgcommand (@pgcommands)
} # sub processFeature


For updating from latest WS release:

There are 2 cronjobs and 3 scripts:

in the acedb account:

cronjob 1 on script 1 (runs daily at 3am PST) : 0 3 * * * /home/acedb/cron/update_ws_tazendra.pl

Also triggers to run 2nd script : /home/acedb/cron/dump_from_ws.sh # dump .ace files from WS to update postgres tables

and write to a timestamp file my $dateOfWsDumpFile = /home3/acedb/cron/dump_from_ws/files/latestDate; echo "$date" >> $dateOfWsDumpFile;

in the postgres account :

cronjob 2 on script 3 (runs daily at 5am PST) : 0 5 * * * /home/postgres/work/pgpopulation/obo_oa_ontologies/ws_updates/populate_pg_from_ws.pl

This script went live on tazendra on 2013 10 21

This script generates files to populate postgres tables for some gin_ tables and obo_ tables for exprcluster when the date in the timestamp file (see above) is more recent than the latest timestamp in gin_molname. In other words, if WS is more recent than the postgres timestamp, the tables are updated (Lines 34 - 55).

For genes (gin_ tables), this script is species-restricted for C. elegans only (we can change this, if needed).

This script populates information that can only be derived from the WS build:

  1. Corresponding_transcript
  2. Corresponding_CDS
  3. Corresponding_protein
  4. Molecular_name

If the WS timestamp is more recent than the postgres timestamp, then the subroutine processGeneCds (Line 57) will run.

Values and timestamps are written to a file, then the postgres tables are deleted, values from the files are copied to the postgres tables, and then the files are closed.

For populating gin_protein, the subroutine looks at the CDS objects (WSCDS.ace) and retrieves the Corresponding_protein. To map genes to proteins, CDS and protein are matched (gin_seqprot), genes are matched to CDS (gin_seqprot) and then mapped to protein (gin_protein).

For populating gin_molname (Molecular_name), the subroutine looks at gene objects (WSGene.ace) and adds the value of Molecular_name to gin_molname.

For populating gin_sequence (Corresponding_transcript and Corresponding_CDS) the subroutine looks at gene objects (WSGene.ace) and adds the value to gin_sequence.


We'll know when WS141 comes out (should be mid-to-late December 2013) and cronjob 1 picks it up, if this pipeline works automatically.