Difference between revisions of "Documentation for workflow and scripts"
Line 62: | Line 62: | ||
*There are input files that must be downloaded and pre-processed, formatted for input; the following scripts must be run: | *There are input files that must be downloaded and pre-processed, formatted for input; the following scripts must be run: | ||
− | + | ./biomart_query.pl | |
− | + | ./download_gene_lists_elegans.pl | |
− | + | ./download_orthologs_all_parallel.pl | |
− | + | ./download_gene_associations_parallel_all.pl | |
− | + | ./go_terms_only.pl | |
− | + | ./get_alt_id_terms_only.pl | |
− | + | ./go_obo_to_go_ace.pl | |
− | + | ./list_dead_genes.pl | |
− | + | ./create_curated_gene_list.pl (queries Postgres) | |
− | + | ./list_uncurated_genes.pl | |
− | + | ./create_gene_list.pl (queries Postgres) | |
− | + | ./parse_gene_lists_elegans.pl | |
− | + | ./parse_orthologs_all_parallel.pl | |
− | + | ./acedb_gene_class.pl (queries ace server) | |
− | + | ./publication_gene.pl (queries Textpresso-dev server, looking at indices, takes 1-2 hours) | |
− | + | ./download_wbbt_elegans.pl (downloads the anatomy obo file from the WB FTP site) | |
− | + | ./download_anatomy_elegans.pl (downloads the anatomy association file from the ..) | |
− | + | ./generate_wbbt_obo_terms_elegans.pl (creates a txt file from the anatomy obo file) | |
− | + | ./generate_wbbt_obo_synonyms_elegans.pl (creates a txt file with synonyms) | |
− | + | ./00_wbbt_obo_to_wbbt_ace.pl (creates a .ace file with all parents and child terms) | |
− | + | ./01_wbbt_obo_to_wbbt_ace_names.pl (same as above with names) | |
− | + | ./neuron_xlsx_read.pl (reads the excel spreadsheet from O. Hobert) | |
− | + | ./download_hgnc_family_elegans.pl (Downloads the HGNC gene families file using LWP:simple) | |
− | + | ./convert_hgnc_gene_family_xlsx.pl (Not part of the generation of automated descriptions, but creates the excel spreadsheet for human gene families for viewing purposes) | |
====Scripts for generating descriptions==== | ====Scripts for generating descriptions==== |
Revision as of 18:52, 28 January 2016
Contents
Location of directories
On brahma.textpresso.org:
- Scripts are at:
/media/data1/jdone/concise_descriptions/concise_descriptions/parallel soft-link:concise_descriptions_scripts
- All scripts write the data to the files here:
/media/data1/jdone/concise_descriptions/www/concise_descriptions soft-link:concise_descriptions_data
/var/www/html/concise_descriptions points to the same directory
- GitHub ready files, the latest version (not in development) of scripts:
/media/data1/jdone/concise_descriptions/concise_descriptions/automated_descriptions/automated_descriptions
Soft-link:latest_version_scripts
On Textpresso-dev.caltech.edu:
Data at: /data2/srv/textpresso-dev.caltech.edu/www/docroot/concise_descriptions
Scripts at: /data2/srv/textpresso-dev.caltech.edu/www/cgi-bin/concise_descriptions
Scripts
The software for automated concise descriptions is available online: https://github.com/WormBase/automated_descriptions
The requirements are outlined here: http://wiki.wormbase.org/index.php/Generation_of_automated_descriptions
- Make sure that the files in the data directory , report the current production release, release of sources and the list of Wormbase supported species. These files need to be updated every release:
- production_release.txt holds the production release for the output of the concise descriptions. (needs updating every release)
- release.txt holds the release information for the sources. (needs updating every release)
- species.txt lists the tab separated values of the species abbreviation, project name, full name and gene prefix for each species (needs updating only when new species are added)
c_briggsae PRJNA10731 Caenorhabditis briggsae Cbr
- Go to the directory that includes the scripts:
/home/ranjana/concise_descriptions_scripts
- The following files must be edited when there are changes, eg., changing of the IP address of Postgres, change of ace server port, etc (will not happen every release, but watch)
At:/home/ranjana/concise_descriptions_scripts
- db_ip.txt is the IP address of the SQL database that holds much of the daily updated information for Wormbase.
- html.txt is the location of the docroot or /var/www directory for the output.
- parallel_path.txt is the path to where GNU Parallel is installed, different for Ubuntu and RedHat.
- cgi.txt is the location of the scripts.
- acedb_port.txt is the location of the port of the acedb server.
- acedb_host.txt is the IP address/hostname of the acedb server.
- In: /home/ranjana/concise_descriptions_scripts
- Create the directories needed by the scripts for a given production release:
$ ./create_release_directories_parallel.pl WS253
(Sometimes 'parallel' makes a mistake, run twice!)
Scripts for source files
- There are input files that must be downloaded and pre-processed, formatted for input; the following scripts must be run:
./biomart_query.pl ./download_gene_lists_elegans.pl ./download_orthologs_all_parallel.pl ./download_gene_associations_parallel_all.pl ./go_terms_only.pl ./get_alt_id_terms_only.pl ./go_obo_to_go_ace.pl ./list_dead_genes.pl ./create_curated_gene_list.pl (queries Postgres) ./list_uncurated_genes.pl ./create_gene_list.pl (queries Postgres) ./parse_gene_lists_elegans.pl ./parse_orthologs_all_parallel.pl ./acedb_gene_class.pl (queries ace server) ./publication_gene.pl (queries Textpresso-dev server, looking at indices, takes 1-2 hours) ./download_wbbt_elegans.pl (downloads the anatomy obo file from the WB FTP site) ./download_anatomy_elegans.pl (downloads the anatomy association file from the ..) ./generate_wbbt_obo_terms_elegans.pl (creates a txt file from the anatomy obo file) ./generate_wbbt_obo_synonyms_elegans.pl (creates a txt file with synonyms) ./00_wbbt_obo_to_wbbt_ace.pl (creates a .ace file with all parents and child terms) ./01_wbbt_obo_to_wbbt_ace_names.pl (same as above with names) ./neuron_xlsx_read.pl (reads the excel spreadsheet from O. Hobert) ./download_hgnc_family_elegans.pl (Downloads the HGNC gene families file using LWP:simple) ./convert_hgnc_gene_family_xlsx.pl (Not part of the generation of automated descriptions, but creates the excel spreadsheet for human gene families for viewing purposes)
Scripts for generating descriptions
- ./create_sentence_tissue_expressions_elegans.pl
- Creates sentences for tissue expression for C. elegans
- ./create_GO_sentences_elegans_species_parallel_all.pl
- Creates sentences for non-elegans species for the orthologous elegans gene and the processes in which it's involved.
- ./create_sentence_multiple_orthologs_species_all_parallel_all.pl
- this is a wrapper, runs several scripts, creates the orthology sentences for all species including elegans, reads the popularity file.
- ./create_GO_sentences_species_parallel_all.pl
- Creates the sentences for process, function and sub-cellular localization for all species including elegans.
- ./concatenate_sentences_species_parallel_all.pl
- Concatenates all sentences for the different semantic categories for all species
- ./generate_OA_concise_descriptions_parallel_all.pl
- Generates the file for import of the automated descriptions into the OA.
To create the concise descriptions do:
('space' between script names and the ampersands, do not introduce new line characters)
./create_sentence_tissue_expressions_elegans.pl && ./create_GO_sentences_elegans_species_parallel_all.pl
&& ./create_sentence_multiple_orthologs_species_all_parallel_all.pl && ./create_GO_sentences_species_parallel_all.pl && ./concatenate_sentences_species_parallel_all.pl && ./generate_OA_concise_descriptions_parallel_all.pl
One can create a file named 'WBGenes_descriptions_for_manual.txt separately, to have the automated descriptions for only those genes that have been curated manually.
./create_curated_gene_sentences_elegans.pl
Finally, a report is written to detail the number of concise descriptions for each species for a given production release:
./total_description_count.pl WS253
Finding gene popularity (number of mentions in papers)
If one wants to create a listing of Textpresso publications associated with genes in either the 'results' section or the 'body' (if the paper can not be sectioned), then one would execute the following:
$ ./publication_paper_gene_results.pl $ ./publication_paper_gene_body.pl
If one wants to examine gene popularity for C. elegans in the literature, one can type:
$ Rscript textpresso_gene_popularity.R WS250
The output will be a PDF file named, textpresso_gene_popularity.pdf.
If one would like to create a flow chart of the perl scripts available in this respository:
$ ./graph_perl.pl ./my_perl_script.pl jpg ./my_output_directory
This script will produce a text file (my_perl_script.txt), a HTML file (my_perl_script.html) and (if there's enough memory available) a jpg file (my_perl_script.jpg) outlining the flow of data within the script. These files are written into the ./my_output_directory Other formats besides jpg are available: bmp canon dot gv xdot xdot1.2 xdot1.4 cgimage cmap eps exr fig gd gd2 gif gtk ico imap cmapx imap_np cmapx_np ismap jp2 jpg jpeg jpe pct pict pdf pic plain plain-ext png pov ps ps2 psd sgi svg svgz tga tif tiff tk vml vmlz vrml wbmp webp xlib x11