Difference between revisions of "Detailed Documentation of Form and Scripts"

Revision as of 21:26, 16 August 2013

Currently, on mangolassi the ccc.cgi and other scripts and files are here:

azurebrd/public_html/cgi-bin/forms/ccc

accession
- This is a file that maps WB WBPaper IDs to PMIDs.
- For WB, this file is generated each time the search is performed.
ccc_celegans_2013only
- I believe these are old test files that can be deleted.
ccc.cgi
- This is the code for the curation form.
- ccc.cgi Documentation
ccc.js
- This is the ccc form javascript code.
- If we want to change the number of characters needed to begin autocomplete, we can do that here.
c_elegans.WS234.xrefs.txt and c_elegans.WS236.xrefs.txt
- These are files generated with each WB build that were used to create the WB gpi file.
generate_gpi.pl
- This is the script that will be used to manually generate a new gpi file for WB.
jquery
- This directory contains...
notes
- This file contains a short bit about mapping WBGene IDs to UniProtKB accessions for the gpi file.
scripts
- This directory contains:
  - accession file - This is a mapping file that contains the mappings between WBPaper IDs and PMIDs as well as TAIR doc IDs and PMIDs. Note that we need PMIDs to send annotations to Protein2GO.
  - create_ccc_pgcuration.pl - This perl script creates two postgres tables for a given Textpresso source file:

creates these two tables:

ccc_sentenceclassification
  ccc_mod text,
  ccc_file text,
  ccc_paper text, 
  ccc_section text,
  ccc_sentnum text, 
  ccc_sentenceclassification text,
  ccc_comment text,
  ccc_curator text, 
  ccc_timestamp text,

ccc_sentenceannotation
  ccc_mod text,
  ccc_file text,
  ccc_paper text,
  ccc_section text,
  ccc_sentnum text,
  ccc_geneproduct text,
  ccc_component text,
  ccc_goterm text,
  ccc_evidencecode text,
  ccc_with text,
  ccc_alreadycurated text,
  ccc_comment text,
  ccc_valid text,
  ccc_ptgoid text,
  ccc_curator text,
  ccc_timestamp

create_ccc_pgindices.pl - This script creates the following two tables for each Textpresso source file:
- ccc_geneprodindex - this table contains the list of gene products mentioned in the sentence mapped to a MOD ID and a UniProtKB ID.
  - ccc_mod ccc_file ccc_paper ccc_section ccc_sentnum ccc_'table' ccc_timestamp
- ccc_componentindex - this table lists, for each sentence, the cellular components that matched the Textpresso cellular component category
  - ccc_mod ccc_file ccc_paper ccc_section ccc_sentnum ccc_'table' ccc_timestamp
gpi files - these files contain gene names, synonyms, and MOD and UniProtKB identifiers. For file format specifications, see:

GO's gpi file format specification

- dictyBase_07032013.gpi
- TAIR1_gpi
- worm_gpi This file name needs to be changed from ws234_gpi, but I don't have permission to do this.
meh - this looks like a test file for ccc_geneprodindex for TAIR.
old_tables - this is a file that lists the names of the tables used for the previous version of the CCC curation forms.
out - this looks like another test file for ccc_geneprodindex for TAIR.
populate_ccc_pg_indices.pg.tair1 and populate_ccc_pg_indices.pg.worm1 - these look like the output files for the populate_ccc_pg_indices.pl for TAIR and WB. Note that for TAIR there are some sentences that were not processed properly.
populate_ccc_pg_indices.pl - this script populates the two tables: ccc_component index and ccc_geneprodindex

http://textpresso-dev.caltech.edu/ccc_results/accession

my ($pmid, $modid) = split/\s+/, $line;

   $accession_map{$pmid} = $modid;

get pmid and modid and make mapping of pmid TO modid in %accession_map hash

&popTextpressoChars make mapping of textpresso characters to literal characters in %textpresso_chars hash.  e.g. map _DQ_ to " or map _PLS_ to +

pmid -> modid is in &populateTextpressoAccession subroutine

%alreadyIndexed hash has all files that have already been indexed in pg table ccc_geneprodindex $result = $dbh->prepare( "SELECT DISTINCT(ccc_file) FROM ccc_geneprodindex" ); while (my @row = $result->fetchrow) { $alreadyIndexed{$row[0]}++; } ###from ccc_geneprodindex getting only unique files

2 gpis files for 2 mods $gpi_files{'tair'} = 'TAIR1_gpi'; $gpi_files{'worm'} = 'ws234_gpi';

for each gpi file, for each line

   my ($dbObjId, $dbObjSym, $dbObjName, $dbObjSyn, $dbObjType, $taxon, $parObjId, $dbXref, $geneProdProp) = split/\t/, $line;

if there's a $dbXref matching UniProtKB:\w+, store each match in @uni array upcase the dbObjSym as the pairName

create a group of pairName + dbObjId + uniprots (from @uni)

join groups into a single value joined by | lowercase the dbObjSym as lc_dbObjSym ##use this as a key to:

create mapping of mod -> dbObjSym -> group in hash %geneprodToGroup

if there's a dbObjSyn, split on | to get synonyms. for each synonym, make the pairName be dbObjSym(synonym), and uppercase it ##each synonym treated individually

for each uniprot, also make the group pairName + dbObjId + uniprot ; join with | ; add to %geneprodToGroup hash of mod -> synonym -> group

my @mods = qw( dicty worm tair ); 3 mods

1. 1. for each mod, reading all files in the source directory

my (@infiles) = <../source/${mod}/*>; source/<mod>/<anything>

/home/azurebrd/public_html/cgi-bin/forms/ccc/source/<mod>/<filename>

1. Any file in here will get read into the form - if we want other files here, we need to specify the format
2. If file has already been indexed, will skip; if not, will open the file.

my ($score, $ident, $geneprods, $component, $sentence) = split/\t/, $line;

1. $ident = paper type, paper number, section, sentence number e.g., ident example PMID:23460676:introduction:4

geneprods list split geneprods on |

if ($geneprodToGroup{$mod}{$geneprod}) {
foreach my $group (sort keys %{ $geneprodToGroup{$mod}{$geneprod} }) {

$good{$group}++; } } else { $bad{$geneprod}++; }

1. Right now we're suppressing any bad mappings, but we could print them.

$byIdent{$mod}{$filename}{$ident}{componentindex}{$component}++;

byIdent hash has mappings of mod->filename->ident-> componentindex to components also has mappings of mod->filename->ident->geneprodindex to group

1. First - gives a list of components for that sentence for column 3.
2. Second - gives a list of gene product identifiers.

split ident on : paper <paper_type>:<paper_number> section sentnum

1. For paper info display

1. MOD to paper mappings

$papers{$mod}{$paper}++;

%papers hash

for geneprodindex and componentindex

for each of their values we join them with a <tab>

and then store them in ccc_

... mod, file, paper, section, sentnum, /home/azurebrd/public_html/cgi-bin/forms/ccc/source/<mod>/pmid_data.<mod>

pmid to mod id to title to abstract

$pmid, $modid, $title, $abstract

already in - don't duplicate

<tab> separated $already_in_paper_info{$pmid}++; reopen the same file to append new values for each pmid from the %papers{mod} hash skip if we laready have a value in the file get modId from textpresso accession_map hash get title and abstract, if we have both, add to this file urlMod, number http://textpresso-dev.caltech.edu/ccc_results/accession celegans:WBPaper00010001 TAIR:10 my $url = 'http://textpresso-dev.caltech.edu/' . $urlMod . '/tdb/' . $urlMod . '/txt/bib-all/' . $num; celegans:WBPaper00010001 split on : to get urlMod + number

1. Use these two things to get title and abstract information from Textpresso
lines 164-189
Note that if the paper display ever stops working for a mod, check the Textpresso URLs that are in these lines of code.

subroutine - to convert textpresso code to human readable 35 and 36 lines - have gpi file name - will need to be updated for WormBase. 93 line - skip if already indexed in postgres Back to WormBase

Difference between revisions of "Detailed Documentation of Form and Scripts"

Revision as of 21:26, 16 August 2013

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

@@ Line 70: / Line 70: @@
 *populate_ccc_pg_indices.pg.tair1 and populate_ccc_pg_indices.pg.worm1 - these look like the output files for the populate_ccc_pg_indices.pl for TAIR and WB.  Note that for TAIR there are some sentences that were not processed properly.
 *populate_ccc_pg_indices.pl - this script populates the two tables: ccc_component index and ccc_geneprodindex
+http://textpresso-dev.caltech.edu/ccc_results/accession
+my ($pmid, $modid) = split/\s+/, $line;
+    $accession_map{$pmid} = $modid;
+get pmid and modid and make mapping of pmid TO modid in %accession_map hash
+ &popTextpressoChars make mapping of textpresso characters to literal characters in %textpresso_chars hash.  e.g. map _DQ_ to " or map _PLS_ to +
+pmid -> modid is in &populateTextpressoAccession subroutine
+%alreadyIndexed hash has all files that have already been indexed in pg table ccc_geneprodindex
+$result = $dbh->prepare( "SELECT DISTINCT(ccc_file) FROM ccc_geneprodindex" );
+while (my @row = $result->fetchrow) { $alreadyIndexed{$row[0]}++; }   ###from ccc_geneprodindex getting only unique files
+gpis files for 2 mods
+$gpi_files{'tair'} = 'TAIR1_gpi';
+$gpi_files{'worm'} = 'ws234_gpi';
+for each gpi file, for each line
+    my ($dbObjId, $dbObjSym, $dbObjName, $dbObjSyn, $dbObjType, $taxon, $parObjId, $dbXref, $geneProdProp) = split/\t/, $line;
+if there's a $dbXref matching UniProtKB:\w+, store each match in @uni array
+upcase the dbObjSym as the pairName
+create a group of pairName + dbObjId + uniprots (from @uni)
+join groups into a single value joined by |
+lowercase the dbObjSym as lc_dbObjSym ##use this as a key to:
+create mapping of mod -> dbObjSym -> group in hash %geneprodToGroup
+if there's a dbObjSyn, split on | to get synonyms.  for each synonym, make the pairName be dbObjSym(synonym), and uppercase it ##each synonym treated individually
+for each uniprot, also make the group pairName + dbObjId + uniprot ; join with | ; add to %geneprodToGroup hash of mod -> synonym -> group
+my @mods = qw( dicty worm tair );
+mods
+###for each mod, reading all files in the source directory
+my (@infiles) = <../source/${mod}/*>;
+source/<mod>/<anything>
+/home/azurebrd/public_html/cgi-bin/forms/ccc/source/<mod>/<filename>
+##Any file in here will get read into the form - if we want other files here, we need to specify the format
+##If file has already been indexed, will skip; if not, will open the file.
+my ($score, $ident, $geneprods, $component, $sentence) = split/\t/, $line;
+##$ident = paper type, paper number, section, sentence number e.g., ident example PMID:23460676:introduction:4
+geneprods list split geneprods on |
+ if ($geneprodToGroup{$mod}{$geneprod}) {
+ foreach my $group (sort keys %{ $geneprodToGroup{$mod}{$geneprod} }) {
+$good{$group}++; } }
+else { $bad{$geneprod}++; }
+##Right now we're suppressing any bad mappings, but we could print them.
+$byIdent{$mod}{$filename}{$ident}{componentindex}{$component}++;
+byIdent hash has mappings of mod->filename->ident-> componentindex to components
+also has mappings of mod->filename->ident->geneprodindex to group
+##First - gives a list of components for that sentence for column 3.
+##Second - gives a list of gene product identifiers.
+split ident on :
+paper <paper_type>:<paper_number>
+section
+sentnum
+##For paper info display
+##MOD to paper mappings
+$papers{$mod}{$paper}++;
+%papers hash
+for geneprodindex and componentindex
+for each of their values we join them with a <tab>
+and then store them in ccc_<table> ... mod, file, paper, section, sentnum, <data>
+/home/azurebrd/public_html/cgi-bin/forms/ccc/source/<mod>/pmid_data.<mod>
+#pmid to mod id to title to abstract
+$pmid, $modid, $title, $abstract
+#already in - don't duplicate
+<tab> separated
+$already_in_paper_info{$pmid}++;
+reopen the same file to append new values
+for each pmid from the %papers{mod} hash
+skip if we laready have a value in the file
+get modId from textpresso accession_map hash
+get title and abstract, if we have both, add to this file
+urlMod, number
+http://textpresso-dev.caltech.edu/ccc_results/accession
+celegans:WBPaper00010001
+TAIR:10
+my $url = 'http://textpresso-dev.caltech.edu/' . $urlMod . '/tdb/' . $urlMod . '/txt/bib-all/' . $num;
+celegans:WBPaper00010001
+split on : to get urlMod + number
+##Use these two things to get title and abstract information from Textpresso
+#lines 164-189
+#Note that if the paper display ever stops working for a mod, check the Textpresso URLs that are in these lines of code.
+subroutine - to convert textpresso code to human readable
+and 36 lines - have gpi file name - will need to be updated for WormBase.
+line - skip if already indexed in postgres