Difference between revisions of "New 2012 Curation Status"

From WormBaseWiki
Jump to navigationJump to search
Line 14: Line 14:
 
= Screenshots of the Curation Status Form =
 
= Screenshots of the Curation Status Form =
  
<pre>    MAIN PAGE    </pre>
+
 
 +
== Main Page ==
  
  
Line 20: Line 21:
  
  
-----------  '''Specific Paper Page'''  -----------
+
== Specific Paper Page ==
  
  

Revision as of 19:35, 6 December 2012

Curation Status & Statistics Form (2012)

The sandbox/testing form can be found here


The CGI code is located on Tazendra/Mangolassi here:

/home/postgres/public_html/cgi-bin/curation_status.cgi



Screenshots of the Curation Status Form

Main Page

Curation Status Form Main Page2.png


Specific Paper Page

Curation Status Form Specific Paper Page2.png

Precanned Comments

The keys stored in postgres for the hashes of these values are only numbers, so the descriptions/titles can change or be updated and still apply retroactively.

Code:

sub populatePremadeComments {
  $premadeComments{"1"} = "SVM Positive, Curation Negative";
  $premadeComments{"2"} = "pre-made comment #2";
  $premadeComments{"3"} = "pre-made comment #3";}

So, as of now:

| Key | Comment |
| 1 | "SVM Positive, Curation Negative" |
| 2 | "pre-made comment #2" |
| 3 | "pre-made comment #3" |


Hence, if a completely new comment is desired, a new key will need to be made and there after associated with that new comment. Also, old keys should never be recycled and documentation describing what each key refers to should be maintained in this Wiki.


New Result

Each paper-data-type pair can be assigned a "New Result" indicating its status as curated (or not) or validated (or not), and if validated, positive or negative for the particular paper-data-type pair. These results can be entered via the Add Results Page or directly in the Detailed Results of Papers page via the "New Results" column. The code is below:

Code:

sub populateDonPosNegOptions {
  $donPosNegOptions{""}             = "";
  $donPosNegOptions{"curated"}      = "curated and positive";
  $donPosNegOptions{"positive"}     = "validated positive";
  $donPosNegOptions{"negative"}     = "validated negative";
  $donPosNegOptions{"notvalidated"} = "not validated";}

where "curated", "positive", "negative", and "notvalidated" are the keys (for the %donPosNegOptions hash table in the form code) that will be stored in postgres and the corresponding values (e.g. "curated and positive") are what will be displayed on the form.

Note that "" and "not validated" represent no data for that paper-data-type pair, but "not validated" is present as an option to overwrite accidental validations (it is impossible to go back to a blank "" field via the form).

Data Types

The form determines which data types exist via a 'populateDatatypes' subroutine in the form code. As of 12-5-2012, the form first collects all data types used in SVM from the 'cur_svmdata' postgres table (which, as of 12-5-2012, all also are identically named in the Author First Pass (AFP) and Curator First Pass (CFP) tables) and then supplements with other data types not in SVM but in AFP and CFP (as of 12-5-2012, all anatomy curation related data types) plus one additional data type ("geneticablation") not in SVM, AFP, or CFP.

Here is the code:

sub populateDatatypes {
  $result = $dbh->prepare( "SELECT DISTINCT(cur_datatype) FROM cur_svmdata " );
  $result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
  while (my @row = $result->fetchrow) { $datatypesAfpCfp{$row[0]} = $row[0]; }
  $datatypesAfpCfp{'blastomere'}    = 'cellfunc';
  $datatypesAfpCfp{'exprmosaic'}    = 'siteaction';
  $datatypesAfpCfp{'geneticmosaic'} = 'mosaic';
  $datatypesAfpCfp{'laserablation'} = 'ablationdata';
  foreach my $datatype (keys %datatypesAfpCfp) { $datatypes{$datatype}++; }
  $datatypes{'geneticablation'}++;
} # sub populateDatatypes


As for the data types currently (12-5-2012) NOT in SVM but IN AFP and CFP, the data type name is different between the Curation Status form and the AFP and CFP forms. So, the data types named "cellfunc", "siteaction", "mosaic", and "ablationdata" in the AFP and CFP tables are respectively named "blastomere", "exprmosaic", "geneticmosaic", "laserablation" in the Curation Status form.

The IMPORTANT thing here is: if, at some point, the data types are changed (added, renamed, etc.), and the code is not updated in kind, the form will likely break. Curators should tell Juancarlos/Chris/Daniela to update the code.

new datatypes should be accounted in this code :

  • - no svm, no afp/cfp : add to %datatypes hash like 'geneticablation'.
  • - no svm, yes afp/cfp : add to %datatypesAfpCfp + %datatypes hashes like 'blastomere'
  • - yes svm, yes afp/cfp : add to code to populate cur_svmdata, which will populate in the SELECT query
  • - yes svm, no afp/cfp : add to code to populate cur_svmdata, which will populate in the SELECT query, but also subsequently delete from %datatypesAfpCfp (to prevent a postgres query to a non-existing table which will crash the form)


Creating PDF links to papers

In the Detailed Results of Papers page, each paper ID is linked to its corresponding PDF document using the code below:

Code:

sub populatePdf {
  $result = $dbh->prepare( "SELECT * FROM pap_electronic_path WHERE pap_electronic_path IS NOT NULL");
  $result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
  my %temp;
  while (my @row = $result->fetchrow) {
    my ($data, $isPdf) = &makePdfLinkFromPath($row[1]);
    $temp{$row[0]}{$isPdf}{$data}++; }
  foreach my $joinkey (sort keys %temp) {
    my @pdfs;
    foreach my $isPdf (reverse sort keys %{ $temp{$joinkey} }) {
      foreach my $pdfLink (sort keys %{ $temp{$joinkey}{$isPdf} }) {
        push @pdfs, $pdfLink; } }
    my ($pdfs) = join"<br/>", @pdfs;
    $pdf{$joinkey} = $pdfs;
  } # foreach my $joinkey (sort keys %temp)
} # sub populatePdf

sub makePdfLinkFromPath {
  my ($path) = shift;
  my ($pdf) = $path =~ m/\/([^\/]*)$/;
  my $isPdf = 0; if ($pdf =~ m/\.pdf$/) { $isPdf++; }           # kimberly wants .pdf files on top, so need to flag to sort
  my $link = 'http://tazendra.caltech.edu/~acedb/daniel/' . $pdf;
  my $data = "<a href=\"$link\" target=\"new\">$pdf</a>"; return ($data, $isPdf); }


Note the table name ("pap_electronic_path"), the URL path ("http://tazendra.caltech.edu/~acedb/daniel/"), and (because of the code 'target=\"new\"') that the link will open a new window or tab. Also note that opening another link on the original page (e.g. Detailed Results of Papers page) will open that link in that same new window/tab, clearing out what you had opened previously.

Creating hyperlinks to PubMed paper pages

In the Detailed Results of Papers page each PubMed ID is linked to its corresponding PubMed webpage using the code below:

Code:

sub populatePmid {
  $result = $dbh->prepare( "SELECT * FROM pap_identifier WHERE pap_identifier ~ 'pmid'" );
  $result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
  my %temp;
  while (my @row = $result->fetchrow) { if ($row[0]) {
    my ($data) = &makeNcbiLinkFromPmid($row[1]);
    $temp{$row[0]}{$data}++; } }
  foreach my $joinkey (sort keys %temp) {
    my ($pmids) = join"<br/>", keys %{ $temp{$joinkey} };
    $pmid{$joinkey} = $pmids;
  } # foreach my $joinkey (sort keys %temp)
} # sub populatePmid
sub makeNcbiLinkFromPmid {
  my $pmid = shift;
  my ($id) = $pmid =~ m/(\d+)/;
  my $link = 'http://www.ncbi.nlm.nih.gov/pubmed/' . $id;
  my $data = "<a href=\"$link\" target=\"new\">$pmid</a>"; return $data; }

Note the table name ("pap_identifier"), the table specifier ("WHERE pap_identifier ~ 'pmid'"), the URL path ("http://www.ncbi.nlm.nih.gov/pubmed/"), and (because of the code 'target=\"new\"') that the link will open a new window or tab. Also note that opening another link on the original page (e.g. Detailed Results of Papers page) will open that link in that same new window/tab, clearing out what you had opened previously.

Populating the Journal Names

Journal names for each paper are populated via the following code:

sub populateJournal {
  $result = $dbh->prepare( "SELECT * FROM pap_journal WHERE pap_journal IS NOT NULL" );
  $result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
  while (my @row = $result->fetchrow) { if ($row[0]) { $journal{$row[0]} = $row[1]; } }
} # sub populateJournal


Note the table "pap_journal".


Loading Data into the Form

On the Curation Statistics Options Page, the Specific Paper Page, or the Pre-populated Specific Paper Page, curators have the option to specify what flagging methods (SVM, AFP, and/or CFP), curation sources (Ontology Annotator or cur_curdata [which is the data generated from this form]), and/or data types (e.g. geneint, rnai) they would like to view.

There are separate hashes for storing the different types of data, all of which have a key of datatype, subkey paperID, sub-subkeys of other things depending on the hash (see individual subsections below).

There is an option to select specific datatype, in which case only the data for those datatypes is loaded. Similarly if only some paperIDs have been selected, only those paperIDs are loaded.

Loading curatable papers

Only papers that have a 'valid' pap_status value and a 'primary' pap_primary_data value are considered curatable. These are stored in the %curatablePapers hash. ( paperID => status )

sub populateCuratablePapers {
  my $query = "SELECT * FROM pap_status WHERE pap_status = 'valid' AND joinkey IN (SELECT joinkey FROM pap_primary_data WHERE pap_primary_data = 'primary')";
  $result = $dbh->prepare( $query );
  $result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
  while (my @row = $result->fetchrow) { $curatablePapers{$row[0]} = $row[1]; }
} # sub populateCuratablePapers

Loading afp_ data

Populate %afpEmailed, %afpData, %afpFlagged, %afpPos, %afpNeg.

for each of the chosen datatypes, if they are allowed in %datatypeAfpCfp, query the corresponding afp_ postgres table, and if it's a curatable paper store the value in the %afpData hash (data type, paper ID => AFP result).

Query afp_email and if it's a curatable paper store in %afpEmailed hash ( paperID => 1 ) for afp emailed statistics.

Query afp_lasttouched to see if a paper has been flagged for afp. Skip if it's not a curatable paper. For all %chosenDatatypes store in %afpFlagged ( datatype, paperID => 1 )

For each of the %afpFlagged datatypes that have been chosen (%chosenDatatypes), if there is an %afpData value, store in %afpPos hash ( positive flag for afp ), otherwise store in %afpNeg hash (negative flag for afp ) ( datatype, paperID => 1 )

sub populateAfpData {
  foreach my $datatype (sort keys %chosenDatatypes) {
    next unless $datatypesAfpCfp{$datatype};
    my $pgtable_datatype = $datatypesAfpCfp{$datatype};
    $result = $dbh->prepare( "SELECT * FROM afp_$pgtable_datatype" );
    $result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
    while (my @row = $result->fetchrow) {
      next unless ($curatablePapers{$row[0]});
      $afpData{$datatype}{$row[0]} = $row[1]; }
  } # foreach my $datatype (sort keys %chosenDatatypes)

  $result = $dbh->prepare( "SELECT * FROM afp_email" );
  $result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
  while (my @row = $result->fetchrow) {
    next unless ($curatablePapers{$row[0]});
    $afpEmailed{$row[0]}++; }
  $result = $dbh->prepare( "SELECT * FROM afp_lasttouched" );
  $result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
  while (my @row = $result->fetchrow) {
    next unless ($curatablePapers{$row[0]});
    foreach my $datatype (sort keys %chosenDatatypes) {
      $afpFlagged{$datatype}{$row[0]}++; } }
  foreach my $datatype (sort keys %chosenDatatypes) {
    foreach my $joinkey (sort keys %{ $afpFlagged{$datatype} }) {
      if ($afpData{$datatype}{$joinkey}) { $afpPos{$datatype}{$joinkey}++; }
        else { $afpNeg{$datatype}{$joinkey}++; } } }
} # sub populateAfpData

Loading cfp_ data

Populate %cfpData, %cfpFlagged, %cfpPos, %cfpNeg.

for each of the chosen datatypes, if they are allowed in %datatypeAfpCfp, query the corresponding cfp_ postgres table, and if it's a curatable paper store the value in the %cfpData hash (data type, paper ID => CFP result).

Query cfp_curator to see if a paper has been flagged for cfp. Skip if it's not a curatable paper. For all %chosenDatatypes store in %cfpFlagged ( datatype, paperID => 1 )

For each of the %cfpFlagged datatypes that have been chosen (%chosenDatatypes), if there is an %cfpData value, store in %cfpPos hash ( positive flag for cfp ), otherwise store in %cfpNeg hash (negative flag for cfp ) ( datatype, paperID => 1 )

sub populateCfpData {
  foreach my $datatype (sort keys %chosenDatatypes) {
    next unless $datatypesAfpCfp{$datatype};
    my $pgtable_datatype = $datatypesAfpCfp{$datatype};
    $result = $dbh->prepare( "SELECT * FROM cfp_$pgtable_datatype" );
    $result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
    while (my @row = $result->fetchrow) {
      next unless ($curatablePapers{$row[0]});
      $cfpData{$datatype}{$row[0]} = $row[1]; }
  } # foreach my $datatype (sort keys %chosenDatatypes)

  $result = $dbh->prepare( "SELECT * FROM cfp_curator" );
  $result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
  while (my @row = $result->fetchrow) {
    next unless ($curatablePapers{$row[0]});
    foreach my $datatype (sort keys %chosenDatatypes) {
      $cfpFlagged{$datatype}{$row[0]}++; } }
  foreach my $datatype (sort keys %chosenDatatypes) {
    foreach my $joinkey (sort keys %{ $cfpFlagged{$datatype} }) {
      if ($cfpData{$datatype}{$joinkey}) { $cfpPos{$datatype}{$joinkey}++; }
        else { $cfpNeg{$datatype}{$joinkey}++; } } }
} # sub populateCfpData

Loading svm data

Populate %svmData hash.

For each of the chosen datatypes, query the cur_svmdata table where cur_datatype is that datatype, and sort by cur_date so that we always have the latest value for a given paper-data-type pair. The svm result is the 4th column, the paper ID is the first column. skip papers that are not %curatablePapers. store in %svmData ( datatype, paper => svm_result ). cur_svmdata could have multiple results for a given paper-data-type pair, we'll consider only the most recent result (by the directory name/date on Yuling's machine).

sub populateSvmData {
#     $result = $dbh->prepare( "SELECT * FROM cur_svmdata ORDER BY cur_datatype, cur_date" );   # always doing for all datatypes vs looping for chosen takes 4.66vs 2.74 secs
  foreach my $datatype (sort keys %chosenDatatypes) {
    $result = $dbh->prepare( "SELECT * FROM cur_svmdata WHERE cur_datatype = '$datatype' ORDER BY cur_date" );
      # table stores multiple dates for same paper-datatype in case we want to see multiple results later.  if it didn't and we didn't order it would take 2.05 vs 2.74 secs, so not worth changing the way we're storing data
    $result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
    while (my @row = $result->fetchrow) {
      my $joinkey = $row[0]; my $svmdata = $row[3];
      next unless ($curatablePapers{$row[0]});
      $svmData{$datatype}{$joinkey} = $svmdata; } }
} # sub populateSvmData

Loading OA data

Populate %objsCurated and %oaData hashes.

Each datatype is stored in different tables and has to be queried separately. The queries are mostly the same.

  if ($chosenDatatypes{'newmutant'}) {
    $result = $dbh->prepare( "SELECT * FROM app_variation" );
    $result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
    while (my @row = $result->fetchrow) { $objsCurated{'newmutant'}{$row[1]}++; }
    $result = $dbh->prepare( "SELECT * FROM app_paper" );
    $result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
    while (my @row = $result->fetchrow) {
      my (@papers) = $row[1] =~ m/WBPaper(\d+)/g;
      foreach my $paper (@papers) {
        $oaData{'newmutant'}{$paper} = 'curated'; } } }

and similarly for other datatypes.

The example above is for the datatype 'newmutant'. If that datatype is a %chosenDatatypes, query app_variation and store in %objsCurated ( datatype, object => 1 ), then query app_paper matching for WBPaper IDs, and associating to %oaData ( datatype, paperID => 'curated' ).

For other datatypes :

  • overexpr : objects from app_transgene ; %oaData from app_paper WHERE joinkey IN (SELECT joinkey FROM app_transgene WHERE app_transgene IS NOT NULL AND app_transgene != ), meaning papers where the postgresID has a corresponding transgene that exists in app_transgene.
  • antibody : objects from abp_name ; %oaData from abp_paper
  • otherexpr : objects from exp_name ; %oaData from exp_paper
  • genereg : objects from grg_name; %oaData from grg_paper
  • geneint : objects from int_name; %oaData from int_paper
  • rnai : objects from rna_name; %oaData from rna_paper
  • blastomere : objects from wbb_wbbtf WHERE joinkey IN (SELECT joinkey FROM wbb_assay WHERE wbb_assay = 'Blastomere_isolation') ; %oaData from wbb_reference WHERE joinkey IN (SELECT joinkey FROM wbb_assay WHERE wbb_assay = 'Blastomere_isolation')
  • exprmosaic : objects from wbb_wbbtf WHERE joinkey IN (SELECT joinkey FROM wbb_assay WHERE wbb_assay = 'Expression_mosaic') ; %oaData from wbb_reference WHERE joinkey IN (SELECT joinkey FROM wbb_assay WHERE wbb_assay = 'Expression_mosaic')
  • geneticablation : objects from wbb_wbbtf WHERE joinkey IN (SELECT joinkey FROM wbb_assay WHERE wbb_assay = 'Genetic_ablation') ; %oaData from wbb_reference WHERE joinkey IN (SELECT joinkey FROM wbb_assay WHERE wbb_assay = 'Genetic_ablation')
  • geneticmosaic : objects from wbb_wbbtf WHERE joinkey IN (SELECT joinkey FROM wbb_assay WHERE wbb_assay = 'Genetic_mosaic') ; %oaData from wbb_reference WHERE joinkey IN (SELECT joinkey FROM wbb_assay WHERE wbb_assay = 'Genetic_mosaic')
  • laserablation : objects from wbb_wbbtf WHERE joinkey IN (SELECT joinkey FROM wbb_assay WHERE wbb_assay = 'Laser_ablation') ; %oaData from wbb_reference WHERE joinkey IN (SELECT joinkey FROM wbb_assay WHERE wbb_assay = 'Laser_ablation')

Loading cur_curdata

cur_curdata: this captures all data entered through this form, meaning paper ID, data type, curator ID, validation status (e.g. "curated and positive"), pre-canned comment, and/or free text comment (and timestamp). Note: this table only stores data (and associated paper-data-type pairs) that has been manually entered through this form.

Code:

sub populateCurCurData {
  $result = $dbh->prepare( "SELECT * FROM cur_curdata ORDER BY cur_timestamp" );        # in case multiple values get in for a paper-datatype (shouldn't happen), keep the latest
  $result->execute() or die "Cannot prepare statement: $DBI::errstr\n";
  while (my @row = $result->fetchrow) {
    next unless ($chosenPapers{$row[0]} || $chosenPapers{all});
    next unless ($chosenDatatypes{$row[1]});
    $curData{$row[1]}{$row[0]}{curator}    = $row[2];
    $curData{$row[1]}{$row[0]}{donposneg}  = $row[3];
    $curData{$row[1]}{$row[0]}{selcomment} = $row[4];
    $curData{$row[1]}{$row[0]}{txtcomment} = $row[5];
    $curData{$row[1]}{$row[0]}{timestamp}  = $row[6]; }
} # sub populateCurCurData


When populating curator data from curation status, read the cur_curdata postgres table, skip datatypes that were not chosen, skip papers that were not chosen. Store data in the %curData hash, key is datatype, subkey is paperID, then valuekeys are curator, donposneg (curator result of curated, validatedPos, validatedNeg, notValidated), select comment, text comment, timestamp.

cur_curdata can only have one result for a specific paper-data-type pair, if a new result is entered it will overwrite the previous result.