Difference between revisions of "Detailed Documentation of Form and Scripts"

From WormBaseWiki
Jump to navigationJump to search
 
(14 intermediate revisions by the same user not shown)
Line 10: Line 10:
 
*ccc.cgi
 
*ccc.cgi
 
**This is the code for the curation form.
 
**This is the code for the curation form.
**[[ccc.cgi Documentation]]
+
**See the [[User_Guide_for_Curators]] for documentation of how the form works from a users' perspective.
 +
**See the [[ccc.cgi documentation]] for more specific details about the code.
 
*ccc.js
 
*ccc.js
 
**This is the ccc form javascript code.
 
**This is the ccc form javascript code.
Line 62: Line 63:
 
*gpi files - these files contain gene names, synonyms, and MOD and UniProtKB identifiers.  For file format specifications, see:
 
*gpi files - these files contain gene names, synonyms, and MOD and UniProtKB identifiers.  For file format specifications, see:
 
  [http://wiki.geneontology.org/index.php/Final_GPAD_and_GPI_file_format GO's gpi file format specification]
 
  [http://wiki.geneontology.org/index.php/Final_GPAD_and_GPI_file_format GO's gpi file format specification]
**dictyBase_07032013.gpi
+
**dictyBase_07032013.gpi '''Also change this file name to dicty_gpi for simplicity?
 
**TAIR1_gpi
 
**TAIR1_gpi
**worm_gpi '''This file name needs to be changed from ws234_gpi, but I don't have permission to do this.'''
+
**worm_gpi '''This file name needs to be changed from ws238_gpi, but I don't have permission to do this.'''
 
*meh - this looks like a test file for ccc_geneprodindex for TAIR.
 
*meh - this looks like a test file for ccc_geneprodindex for TAIR.
 
*old_tables - this is a file that lists the names of the tables used for the previous version of the CCC curation forms.
 
*old_tables - this is a file that lists the names of the tables used for the previous version of the CCC curation forms.
Line 70: Line 71:
 
*populate_ccc_pg_indices.pg.tair1 and populate_ccc_pg_indices.pg.worm1 - these look like the output files for the populate_ccc_pg_indices.pl for TAIR and WB.  Note that for TAIR there are some sentences that were not processed properly.
 
*populate_ccc_pg_indices.pg.tair1 and populate_ccc_pg_indices.pg.worm1 - these look like the output files for the populate_ccc_pg_indices.pl for TAIR and WB.  Note that for TAIR there are some sentences that were not processed properly.
 
*populate_ccc_pg_indices.pl - this script populates the two tables: ccc_component index and ccc_geneprodindex
 
*populate_ccc_pg_indices.pl - this script populates the two tables: ccc_component index and ccc_geneprodindex
http://textpresso-dev.caltech.edu/ccc_results/accession
+
**Inputs needed:
 +
***Textpresso source files
 +
***Mapping file of PMID to MOD Accession for WB and TAIR only (so far).  dictyBase docIDs are the same as the PMIDs.
 +
****'''This is available from a textpresso-dev URL that needs to be updated with each search run.'''
 +
***gpi files - see above. '''Need to establish where the updated files we be located (i.e., where I can put them) and also update the names of the variables in the script to be more generic.'''  This script maps the gene product names and/or synonyms used to MOD and all UniProtKB IDs.
 +
**This script takes the raw Textpresso output and generates a human readable version of the sentences, as well as creating mappings of the gene product names or synonyms to IDs and generating a file with paper titles and abstracts for display on the curation form.
 +
*source
 +
**This directory contains directories for each MOD that has a CCC implementation.
 +
**In each MOD's directories are the source files from Textpresso searches and the pmid_data file that maps PMIDs to MOD paper identifiers as well as paper titles and abstracts.
 +
*test.html -
 +
*ws234_tablemaker_info - the results of the tablemaker query for gene names and status for creating a gpi file.
  
my ($pmid, $modid) = split/\s+/, $line;
 
    $accession_map{$pmid} = $modid;
 
get pmid and modid and make mapping of pmid TO modid in %accession_map hash
 
&popTextpressoChars make mapping of textpresso characters to literal characters in %textpresso_chars hash.  e.g. map _DQ_ to " or map _PLS_ to +
 
pmid -> modid is in &populateTextpressoAccession subroutine
 
  
 
%alreadyIndexed hash has all files that have already been indexed in pg table ccc_geneprodindex
 
$result = $dbh->prepare( "SELECT DISTINCT(ccc_file) FROM ccc_geneprodindex" );
 
while (my @row = $result->fetchrow) { $alreadyIndexed{$row[0]}++; }  ###from ccc_geneprodindex getting only unique files
 
 
2 gpis files for 2 mods
 
$gpi_files{'tair'} = 'TAIR1_gpi';
 
$gpi_files{'worm'} = 'ws234_gpi';
 
 
for each gpi file, for each line
 
   my ($dbObjId, $dbObjSym, $dbObjName, $dbObjSyn, $dbObjType, $taxon, $parObjId, $dbXref, $geneProdProp) = split/\t/, $line;
 
 
if there's a $dbXref matching UniProtKB:\w+, store each match in @uni array
 
upcase the dbObjSym as the pairName
 
 
create a group of pairName + dbObjId + uniprots (from @uni)
 
 
join groups into a single value joined by |
 
lowercase the dbObjSym as lc_dbObjSym ##use this as a key to:
 
 
create mapping of mod -> dbObjSym -> group in hash %geneprodToGroup
 
 
if there's a dbObjSyn, split on | to get synonyms.  for each synonym, make the pairName be dbObjSym(synonym), and uppercase it ##each synonym treated individually
 
 
for each uniprot, also make the group pairName + dbObjId + uniprot ; join with | ; add to %geneprodToGroup hash of mod -> synonym -> group
 
 
my @mods = qw( dicty worm tair );
 
3 mods
 
 
###for each mod, reading all files in the source directory
 
 
my (@infiles) = <../source/${mod}/*>;
 
source/<mod>/<anything>
 
 
/home/azurebrd/public_html/cgi-bin/forms/ccc/source/<mod>/<filename>
 
 
##Any file in here will get read into the form - if we want other files here, we need to specify the format
 
##If file has already been indexed, will skip; if not, will open the file.
 
 
 
my ($score, $ident, $geneprods, $component, $sentence) = split/\t/, $line;
 
 
##$ident = paper type, paper number, section, sentence number e.g., ident example PMID:23460676:introduction:4
 
 
geneprods list split geneprods on |
 
 
if ($geneprodToGroup{$mod}{$geneprod}) {
 
foreach my $group (sort keys %{ $geneprodToGroup{$mod}{$geneprod} }) {
 
 
$good{$group}++; } }
 
else { $bad{$geneprod}++; }
 
 
##Right now we're suppressing any bad mappings, but we could print them.
 
 
$byIdent{$mod}{$filename}{$ident}{componentindex}{$component}++;
 
 
byIdent hash has mappings of mod->filename->ident-> componentindex to components
 
also has mappings of mod->filename->ident->geneprodindex to group
 
 
##First - gives a list of components for that sentence for column 3.
 
##Second - gives a list of gene product identifiers.
 
 
split ident on :
 
paper <paper_type>:<paper_number>
 
section
 
sentnum
 
 
##For paper info display
 
 
##MOD to paper mappings
 
 
$papers{$mod}{$paper}++;
 
 
%papers hash
 
 
for geneprodindex and componentindex
 
 
for each of their values we join them with a <tab>
 
 
and then store them in ccc_<table> ... mod, file, paper, section, sentnum, <data>
 
 
/home/azurebrd/public_html/cgi-bin/forms/ccc/source/<mod>/pmid_data.<mod>
 
 
#pmid to mod id to title to abstract
 
 
$pmid, $modid, $title, $abstract
 
 
#already in - don't duplicate
 
 
<tab> separated
 
$already_in_paper_info{$pmid}++;
 
reopen the same file to append new values
 
 
for each pmid from the %papers{mod} hash
 
skip if we laready have a value in the file
 
 
get modId from textpresso accession_map hash
 
 
get title and abstract, if we have both, add to this file
 
 
urlMod, number
 
http://textpresso-dev.caltech.edu/ccc_results/accession
 
 
celegans:WBPaper00010001
 
TAIR:10
 
 
my $url = 'http://textpresso-dev.caltech.edu/' . $urlMod . '/tdb/' . $urlMod . '/txt/bib-all/' . $num;
 
 
celegans:WBPaper00010001
 
split on : to get urlMod + number
 
 
##Use these two things to get title and abstract information from Textpresso
 
#lines 164-189
 
#Note that if the paper display ever stops working for a mod, check the Textpresso URLs that are in these lines of code.
 
 
subroutine - to convert textpresso code to human readable
 
 
35 and 36 lines - have gpi file name - will need to be updated for WormBase.
 
 
93 line - skip if already indexed in postgres
 
  
  

Latest revision as of 15:11, 18 February 2014

Currently, on mangolassi the ccc.cgi and other scripts and files are here:

azurebrd/public_html/cgi-bin/forms/ccc

  • accession
    • This is a file that maps WB WBPaper IDs to PMIDs.
    • For WB, this file is generated each time the search is performed.
  • ccc_celegans_2013only
    • I believe these are old test files that can be deleted.
  • ccc.cgi
  • ccc.js
    • This is the ccc form javascript code.
    • If we want to change the number of characters needed to begin autocomplete, we can do that here.
  • c_elegans.WS234.xrefs.txt and c_elegans.WS236.xrefs.txt
    • These are files generated with each WB build that were used to create the WB gpi file.
  • generate_gpi.pl
    • This is the script that will be used to manually generate a new gpi file for WB.
  • jquery
    • This directory contains...
  • notes
    • This file contains a short bit about mapping WBGene IDs to UniProtKB accessions for the gpi file.
  • scripts
    • This directory contains:
      • accession file - This is a mapping file that contains the mappings between WBPaper IDs and PMIDs as well as TAIR doc IDs and PMIDs. Note that we need PMIDs to send annotations to Protein2GO.
      • create_ccc_pgcuration.pl - This perl script creates two postgres tables for a given Textpresso source file:

creates these two tables:

ccc_sentenceclassification
  ccc_mod text,
  ccc_file text,
  ccc_paper text, 
  ccc_section text,
  ccc_sentnum text, 
  ccc_sentenceclassification text,
  ccc_comment text,
  ccc_curator text, 
  ccc_timestamp text,
ccc_sentenceannotation
  ccc_mod text,
  ccc_file text,
  ccc_paper text,
  ccc_section text,
  ccc_sentnum text,
  ccc_geneproduct text,
  ccc_component text,
  ccc_goterm text,
  ccc_evidencecode text,
  ccc_with text,
  ccc_alreadycurated text,
  ccc_comment text,
  ccc_valid text,
  ccc_ptgoid text,
  ccc_curator text,
  ccc_timestamp
  • create_ccc_pgindices.pl - This script creates the following two tables for each Textpresso source file:
    • ccc_geneprodindex - this table contains the list of gene products mentioned in the sentence mapped to a MOD ID and a UniProtKB ID.
      • ccc_mod ccc_file ccc_paper ccc_section ccc_sentnum ccc_'table' ccc_timestamp
    • ccc_componentindex - this table lists, for each sentence, the cellular components that matched the Textpresso cellular component category
      • ccc_mod ccc_file ccc_paper ccc_section ccc_sentnum ccc_'table' ccc_timestamp
  • gpi files - these files contain gene names, synonyms, and MOD and UniProtKB identifiers. For file format specifications, see:
GO's gpi file format specification
    • dictyBase_07032013.gpi Also change this file name to dicty_gpi for simplicity?
    • TAIR1_gpi
    • worm_gpi This file name needs to be changed from ws238_gpi, but I don't have permission to do this.
  • meh - this looks like a test file for ccc_geneprodindex for TAIR.
  • old_tables - this is a file that lists the names of the tables used for the previous version of the CCC curation forms.
  • out - this looks like another test file for ccc_geneprodindex for TAIR.
  • populate_ccc_pg_indices.pg.tair1 and populate_ccc_pg_indices.pg.worm1 - these look like the output files for the populate_ccc_pg_indices.pl for TAIR and WB. Note that for TAIR there are some sentences that were not processed properly.
  • populate_ccc_pg_indices.pl - this script populates the two tables: ccc_component index and ccc_geneprodindex
    • Inputs needed:
      • Textpresso source files
      • Mapping file of PMID to MOD Accession for WB and TAIR only (so far). dictyBase docIDs are the same as the PMIDs.
        • This is available from a textpresso-dev URL that needs to be updated with each search run.
      • gpi files - see above. Need to establish where the updated files we be located (i.e., where I can put them) and also update the names of the variables in the script to be more generic. This script maps the gene product names and/or synonyms used to MOD and all UniProtKB IDs.
    • This script takes the raw Textpresso output and generates a human readable version of the sentences, as well as creating mappings of the gene product names or synonyms to IDs and generating a file with paper titles and abstracts for display on the curation form.
  • source
    • This directory contains directories for each MOD that has a CCC implementation.
    • In each MOD's directories are the source files from Textpresso searches and the pmid_data file that maps PMIDs to MOD paper identifiers as well as paper titles and abstracts.
  • test.html -
  • ws234_tablemaker_info - the results of the tablemaker query for gene names and status for creating a gpi file.





Back to WormBase