Difference between revisions of "Detailed Documentation of Form and Scripts"
Line 70: | Line 70: | ||
*populate_ccc_pg_indices.pg.tair1 and populate_ccc_pg_indices.pg.worm1 - these look like the output files for the populate_ccc_pg_indices.pl for TAIR and WB. Note that for TAIR there are some sentences that were not processed properly. | *populate_ccc_pg_indices.pg.tair1 and populate_ccc_pg_indices.pg.worm1 - these look like the output files for the populate_ccc_pg_indices.pl for TAIR and WB. Note that for TAIR there are some sentences that were not processed properly. | ||
*populate_ccc_pg_indices.pl - this script populates the two tables: ccc_component index and ccc_geneprodindex | *populate_ccc_pg_indices.pl - this script populates the two tables: ccc_component index and ccc_geneprodindex | ||
+ | http://textpresso-dev.caltech.edu/ccc_results/accession | ||
+ | |||
+ | my ($pmid, $modid) = split/\s+/, $line; | ||
+ | $accession_map{$pmid} = $modid; | ||
+ | get pmid and modid and make mapping of pmid TO modid in %accession_map hash | ||
+ | &popTextpressoChars make mapping of textpresso characters to literal characters in %textpresso_chars hash. e.g. map _DQ_ to " or map _PLS_ to + | ||
+ | pmid -> modid is in &populateTextpressoAccession subroutine | ||
+ | |||
+ | |||
+ | %alreadyIndexed hash has all files that have already been indexed in pg table ccc_geneprodindex | ||
+ | $result = $dbh->prepare( "SELECT DISTINCT(ccc_file) FROM ccc_geneprodindex" ); | ||
+ | while (my @row = $result->fetchrow) { $alreadyIndexed{$row[0]}++; } ###from ccc_geneprodindex getting only unique files | ||
+ | |||
+ | 2 gpis files for 2 mods | ||
+ | $gpi_files{'tair'} = 'TAIR1_gpi'; | ||
+ | $gpi_files{'worm'} = 'ws234_gpi'; | ||
+ | |||
+ | for each gpi file, for each line | ||
+ | my ($dbObjId, $dbObjSym, $dbObjName, $dbObjSyn, $dbObjType, $taxon, $parObjId, $dbXref, $geneProdProp) = split/\t/, $line; | ||
+ | |||
+ | if there's a $dbXref matching UniProtKB:\w+, store each match in @uni array | ||
+ | upcase the dbObjSym as the pairName | ||
+ | |||
+ | create a group of pairName + dbObjId + uniprots (from @uni) | ||
+ | |||
+ | join groups into a single value joined by | | ||
+ | lowercase the dbObjSym as lc_dbObjSym ##use this as a key to: | ||
+ | |||
+ | create mapping of mod -> dbObjSym -> group in hash %geneprodToGroup | ||
+ | |||
+ | if there's a dbObjSyn, split on | to get synonyms. for each synonym, make the pairName be dbObjSym(synonym), and uppercase it ##each synonym treated individually | ||
+ | |||
+ | for each uniprot, also make the group pairName + dbObjId + uniprot ; join with | ; add to %geneprodToGroup hash of mod -> synonym -> group | ||
+ | |||
+ | my @mods = qw( dicty worm tair ); | ||
+ | 3 mods | ||
+ | |||
+ | ###for each mod, reading all files in the source directory | ||
+ | |||
+ | my (@infiles) = <../source/${mod}/*>; | ||
+ | source/<mod>/<anything> | ||
+ | |||
+ | /home/azurebrd/public_html/cgi-bin/forms/ccc/source/<mod>/<filename> | ||
+ | |||
+ | ##Any file in here will get read into the form - if we want other files here, we need to specify the format | ||
+ | ##If file has already been indexed, will skip; if not, will open the file. | ||
+ | |||
+ | |||
+ | my ($score, $ident, $geneprods, $component, $sentence) = split/\t/, $line; | ||
+ | |||
+ | ##$ident = paper type, paper number, section, sentence number e.g., ident example PMID:23460676:introduction:4 | ||
+ | |||
+ | geneprods list split geneprods on | | ||
+ | |||
+ | if ($geneprodToGroup{$mod}{$geneprod}) { | ||
+ | foreach my $group (sort keys %{ $geneprodToGroup{$mod}{$geneprod} }) { | ||
+ | |||
+ | $good{$group}++; } } | ||
+ | else { $bad{$geneprod}++; } | ||
+ | |||
+ | ##Right now we're suppressing any bad mappings, but we could print them. | ||
+ | |||
+ | $byIdent{$mod}{$filename}{$ident}{componentindex}{$component}++; | ||
+ | |||
+ | byIdent hash has mappings of mod->filename->ident-> componentindex to components | ||
+ | also has mappings of mod->filename->ident->geneprodindex to group | ||
+ | |||
+ | ##First - gives a list of components for that sentence for column 3. | ||
+ | ##Second - gives a list of gene product identifiers. | ||
+ | |||
+ | split ident on : | ||
+ | paper <paper_type>:<paper_number> | ||
+ | section | ||
+ | sentnum | ||
+ | |||
+ | ##For paper info display | ||
+ | |||
+ | ##MOD to paper mappings | ||
+ | |||
+ | $papers{$mod}{$paper}++; | ||
+ | |||
+ | %papers hash | ||
+ | |||
+ | for geneprodindex and componentindex | ||
+ | |||
+ | for each of their values we join them with a <tab> | ||
+ | |||
+ | and then store them in ccc_<table> ... mod, file, paper, section, sentnum, <data> | ||
+ | |||
+ | /home/azurebrd/public_html/cgi-bin/forms/ccc/source/<mod>/pmid_data.<mod> | ||
+ | |||
+ | #pmid to mod id to title to abstract | ||
+ | |||
+ | $pmid, $modid, $title, $abstract | ||
+ | |||
+ | #already in - don't duplicate | ||
+ | |||
+ | <tab> separated | ||
+ | $already_in_paper_info{$pmid}++; | ||
+ | reopen the same file to append new values | ||
+ | |||
+ | for each pmid from the %papers{mod} hash | ||
+ | skip if we laready have a value in the file | ||
+ | |||
+ | get modId from textpresso accession_map hash | ||
+ | |||
+ | get title and abstract, if we have both, add to this file | ||
+ | |||
+ | urlMod, number | ||
+ | http://textpresso-dev.caltech.edu/ccc_results/accession | ||
+ | |||
+ | celegans:WBPaper00010001 | ||
+ | TAIR:10 | ||
+ | |||
+ | my $url = 'http://textpresso-dev.caltech.edu/' . $urlMod . '/tdb/' . $urlMod . '/txt/bib-all/' . $num; | ||
+ | |||
+ | celegans:WBPaper00010001 | ||
+ | split on : to get urlMod + number | ||
+ | |||
+ | ##Use these two things to get title and abstract information from Textpresso | ||
+ | #lines 164-189 | ||
+ | #Note that if the paper display ever stops working for a mod, check the Textpresso URLs that are in these lines of code. | ||
+ | |||
+ | subroutine - to convert textpresso code to human readable | ||
+ | |||
+ | 35 and 36 lines - have gpi file name - will need to be updated for WormBase. | ||
+ | |||
+ | 93 line - skip if already indexed in postgres | ||
Revision as of 21:26, 16 August 2013
Currently, on mangolassi the ccc.cgi and other scripts and files are here:
azurebrd/public_html/cgi-bin/forms/ccc
- accession
- This is a file that maps WB WBPaper IDs to PMIDs.
- For WB, this file is generated each time the search is performed.
- ccc_celegans_2013only
- I believe these are old test files that can be deleted.
- ccc.cgi
- This is the code for the curation form.
- ccc.cgi Documentation
- ccc.js
- This is the ccc form javascript code.
- If we want to change the number of characters needed to begin autocomplete, we can do that here.
- c_elegans.WS234.xrefs.txt and c_elegans.WS236.xrefs.txt
- These are files generated with each WB build that were used to create the WB gpi file.
- generate_gpi.pl
- This is the script that will be used to manually generate a new gpi file for WB.
- jquery
- This directory contains...
- notes
- This file contains a short bit about mapping WBGene IDs to UniProtKB accessions for the gpi file.
- scripts
- This directory contains:
- accession file - This is a mapping file that contains the mappings between WBPaper IDs and PMIDs as well as TAIR doc IDs and PMIDs. Note that we need PMIDs to send annotations to Protein2GO.
- create_ccc_pgcuration.pl - This perl script creates two postgres tables for a given Textpresso source file:
- This directory contains:
creates these two tables:
ccc_sentenceclassification ccc_mod text, ccc_file text, ccc_paper text, ccc_section text, ccc_sentnum text, ccc_sentenceclassification text, ccc_comment text, ccc_curator text, ccc_timestamp text,
ccc_sentenceannotation ccc_mod text, ccc_file text, ccc_paper text, ccc_section text, ccc_sentnum text, ccc_geneproduct text, ccc_component text, ccc_goterm text, ccc_evidencecode text, ccc_with text, ccc_alreadycurated text, ccc_comment text, ccc_valid text, ccc_ptgoid text, ccc_curator text, ccc_timestamp
- create_ccc_pgindices.pl - This script creates the following two tables for each Textpresso source file:
- ccc_geneprodindex - this table contains the list of gene products mentioned in the sentence mapped to a MOD ID and a UniProtKB ID.
- ccc_mod ccc_file ccc_paper ccc_section ccc_sentnum ccc_'table' ccc_timestamp
- ccc_componentindex - this table lists, for each sentence, the cellular components that matched the Textpresso cellular component category
- ccc_mod ccc_file ccc_paper ccc_section ccc_sentnum ccc_'table' ccc_timestamp
- ccc_geneprodindex - this table contains the list of gene products mentioned in the sentence mapped to a MOD ID and a UniProtKB ID.
- gpi files - these files contain gene names, synonyms, and MOD and UniProtKB identifiers. For file format specifications, see:
GO's gpi file format specification
- dictyBase_07032013.gpi
- TAIR1_gpi
- worm_gpi This file name needs to be changed from ws234_gpi, but I don't have permission to do this.
- meh - this looks like a test file for ccc_geneprodindex for TAIR.
- old_tables - this is a file that lists the names of the tables used for the previous version of the CCC curation forms.
- out - this looks like another test file for ccc_geneprodindex for TAIR.
- populate_ccc_pg_indices.pg.tair1 and populate_ccc_pg_indices.pg.worm1 - these look like the output files for the populate_ccc_pg_indices.pl for TAIR and WB. Note that for TAIR there are some sentences that were not processed properly.
- populate_ccc_pg_indices.pl - this script populates the two tables: ccc_component index and ccc_geneprodindex
http://textpresso-dev.caltech.edu/ccc_results/accession
my ($pmid, $modid) = split/\s+/, $line;
$accession_map{$pmid} = $modid;
get pmid and modid and make mapping of pmid TO modid in %accession_map hash
&popTextpressoChars make mapping of textpresso characters to literal characters in %textpresso_chars hash. e.g. map _DQ_ to " or map _PLS_ to +
pmid -> modid is in &populateTextpressoAccession subroutine
%alreadyIndexed hash has all files that have already been indexed in pg table ccc_geneprodindex
$result = $dbh->prepare( "SELECT DISTINCT(ccc_file) FROM ccc_geneprodindex" );
while (my @row = $result->fetchrow) { $alreadyIndexed{$row[0]}++; } ###from ccc_geneprodindex getting only unique files
2 gpis files for 2 mods $gpi_files{'tair'} = 'TAIR1_gpi'; $gpi_files{'worm'} = 'ws234_gpi';
for each gpi file, for each line
my ($dbObjId, $dbObjSym, $dbObjName, $dbObjSyn, $dbObjType, $taxon, $parObjId, $dbXref, $geneProdProp) = split/\t/, $line;
if there's a $dbXref matching UniProtKB:\w+, store each match in @uni array upcase the dbObjSym as the pairName
create a group of pairName + dbObjId + uniprots (from @uni)
join groups into a single value joined by | lowercase the dbObjSym as lc_dbObjSym ##use this as a key to:
create mapping of mod -> dbObjSym -> group in hash %geneprodToGroup
if there's a dbObjSyn, split on | to get synonyms. for each synonym, make the pairName be dbObjSym(synonym), and uppercase it ##each synonym treated individually
for each uniprot, also make the group pairName + dbObjId + uniprot ; join with | ; add to %geneprodToGroup hash of mod -> synonym -> group
my @mods = qw( dicty worm tair ); 3 mods
- for each mod, reading all files in the source directory
my (@infiles) = <../source/${mod}/*>; source/<mod>/<anything>
/home/azurebrd/public_html/cgi-bin/forms/ccc/source/<mod>/<filename>
- Any file in here will get read into the form - if we want other files here, we need to specify the format
- If file has already been indexed, will skip; if not, will open the file.
my ($score, $ident, $geneprods, $component, $sentence) = split/\t/, $line;
- $ident = paper type, paper number, section, sentence number e.g., ident example PMID:23460676:introduction:4
geneprods list split geneprods on |
if ($geneprodToGroup{$mod}{$geneprod}) { foreach my $group (sort keys %{ $geneprodToGroup{$mod}{$geneprod} }) {
$good{$group}++; } } else { $bad{$geneprod}++; }
- Right now we're suppressing any bad mappings, but we could print them.
$byIdent{$mod}{$filename}{$ident}{componentindex}{$component}++;
byIdent hash has mappings of mod->filename->ident-> componentindex to components also has mappings of mod->filename->ident->geneprodindex to group
- First - gives a list of components for that sentence for column 3.
- Second - gives a list of gene product identifiers.
split ident on : paper <paper_type>:<paper_number> section sentnum
- For paper info display
- MOD to paper mappings
$papers{$mod}{$paper}++;
%papers hash
for geneprodindex and componentindex
for each of their values we join them with a <tab>
and then store them in ccc_
... mod, file, paper, section, sentnum, /home/azurebrd/public_html/cgi-bin/forms/ccc/source/<mod>/pmid_data.<mod>- pmid to mod id to title to abstract
- already in - don't duplicate
- Use these two things to get title and abstract information from Textpresso
- lines 164-189
- Note that if the paper display ever stops working for a mod, check the Textpresso URLs that are in these lines of code.