Difference between revisions of "Documentation for workflow and scripts"
(47 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
− | == | + | ==Current SOP== |
+ | (Modified Aug 2016) | ||
+ | On: <username>@textpresso-dev.caltech.edu | ||
+ | |||
+ | All scripts at: | ||
+ | /data1/Users/liyuling/Curator_related/concise_descriptions/concise_descriptions_scripts | ||
+ | |||
+ | All data (by release) at: | ||
+ | /data1/Users/liyuling/Curator_related/concise_descriptions/concise_descriptions/release | ||
+ | |||
+ | Run all scripts using 'sudo' in the above directory | ||
+ | |||
+ | 1. Change the version of production_release.txt to the version of the upcoming citace data upload for which the files are being generated for, eg. WS256. Change the version in the release.txt file to reflect the correct version of source files, usually this is -1 from the number in the production_release.txt, eg. WS255. The files are here: | ||
+ | |||
+ | /data1/Users/liyuling/Curator_related/concise_descriptions/ | ||
+ | |||
+ | 2. Create the directories needed by the scripts for a given production release: | ||
+ | ./create_release_directories_parallel.pl WS256 | ||
+ | |||
+ | Make sure to run this script twice so it creates all 9 required directories. | ||
+ | |||
+ | 3. In order to download all required input/source files and for some pre-processing run the ./get_source_files_wrapper.sh, which runs a total of 26 scripts. | ||
+ | |||
+ | ./biomart_query.pl | ||
+ | ./download_gene_lists_elegans.pl | ||
+ | ./download_geneIDs_all_parallel.pl | ||
+ | ./download_orthologs_all_parallel.pl | ||
+ | ./download_gene_associations_parallel_all.pl | ||
+ | ./go_terms_only.pl | ||
+ | ./get_alt_id_terms_only.pl | ||
+ | ./go_obo_to_go_ace.pl | ||
+ | ./list_dead_genes.pl | ||
+ | ./create_curated_gene_list.pl (queries Postgres) | ||
+ | ./list_uncurated_genes.pl | ||
+ | ./create_gene_list.pl (queries Postgres) | ||
+ | ./parse_gene_lists_elegans.pl | ||
+ | ./parse_orthologs_all_parallel.pl | ||
+ | ./acedb_gene_class.pl (queries ace server) | ||
+ | ./publication_gene.pl (queries Textpresso-dev server, looking at indices, takes 1-2 hours) | ||
+ | ./download_wbbt_elegans.pl (downloads the anatomy obo file from the WB FTP site) | ||
+ | ./download_anatomy_elegans.pl (downloads the anatomy association file from the ..) | ||
+ | ./generate_wbbt_obo_terms_elegans.pl (creates a txt file from the anatomy obo file) | ||
+ | ./generate_wbbt_obo_synonyms_elegans.pl (creates a txt file with synonyms) | ||
+ | ./00_wbbt_obo_to_wbbt_ace.pl (creates a .ace file with all parents and child terms) | ||
+ | ./01_wbbt_obo_to_wbbt_ace_names.pl (same as above with names) | ||
+ | ./neuron_xlsx_read.pl (reads the excel spreadsheet from O. Hobert) | ||
+ | ./download_hgnc_family_elegans.pl (Downloads the HGNC gene families file using LWP:simple) | ||
+ | ./convert_hgnc_gene_family_xlsx.pl (Not part of the generation of automated descriptions, | ||
+ | but creates the excel spreadsheet for human gene families for viewing purposes) | ||
+ | ./download_expression_cluster_summary_all_parallel.pl (downloads expression cluster data for all species) | ||
− | + | Takes about 2.5 hrs to run. | |
− | |||
− | |||
− | |||
+ | Check that directories and source files have been created and downloaded. | ||
− | + | 4. To create the concise descriptions run the wrapper: | |
− | / | + | ./create_sentences_wrapper.sh which runs nine scripts: |
− | |||
− | / | + | ('space' between script names and the ampersands, do not introduce new line characters) |
+ | ./create_sentence_tissue_expressions_elegans.pl && ./create_GO_sentences_elegans_species_parallel_all.pl | ||
+ | && ./create_sentence_multiple_orthologs_species_all_parallel_all.pl && ./create_GO_sentences_species_parallel_all.pl | ||
+ | && ./create_gene_regulation_expression_cluster_sentences_species_parallel_all.pl | ||
+ | && ./create_molecule_regulation_expression_cluster_sentences_species_parallel_all.pl | ||
+ | && ./create_anatomy_expression_cluster_sentences_species_parallel_all.pl | ||
+ | && ./concatenate_sentences_species_parallel_all.pl && ./generate_OA_concise_descriptions_parallel_all.pl | ||
− | + | Takes about 6+ hours to run (used to take 4.5 hrs). | |
− | + | 5. For creating a report with all the numbers for each of the species, run | |
− | + | ./total_description_count.pl WS253 | |
− | + | ==Location of directories== | |
+ | All files: Textpresso-dev.caltech.edu: | ||
Data at: | Data at: | ||
Line 27: | Line 80: | ||
/data2/srv/textpresso-dev.caltech.edu/www/cgi-bin/concise_descriptions | /data2/srv/textpresso-dev.caltech.edu/www/cgi-bin/concise_descriptions | ||
− | == | + | |
+ | ==Initial requirements== | ||
The software for automated concise descriptions is available online: https://github.com/WormBase/automated_descriptions | The software for automated concise descriptions is available online: https://github.com/WormBase/automated_descriptions | ||
Line 36: | Line 90: | ||
**production_release.txt holds the production release for the output of the concise descriptions. (needs updating every release) | **production_release.txt holds the production release for the output of the concise descriptions. (needs updating every release) | ||
− | **release.txt holds the release information for the sources. (needs updating every release) | + | **release.txt holds the release information for the sources. (needs updating every release), = "production_release" minus 1 |
**species.txt lists the tab separated values of the species abbreviation, project name, full name and gene prefix for each species (needs updating only when new species are added) | **species.txt lists the tab separated values of the species abbreviation, project name, full name and gene prefix for each species (needs updating only when new species are added) | ||
c_briggsae PRJNA10731 Caenorhabditis briggsae Cbr | c_briggsae PRJNA10731 Caenorhabditis briggsae Cbr | ||
Line 62: | Line 116: | ||
*There are input files that must be downloaded and pre-processed, formatted for input; the following scripts must be run: | *There are input files that must be downloaded and pre-processed, formatted for input; the following scripts must be run: | ||
− | + | ./biomart_query.pl | |
− | + | ./download_gene_lists_elegans.pl | |
− | + | ./download_geneIDs_all_parallel.pl | |
− | + | ./download_orthologs_all_parallel.pl | |
− | + | ./download_gene_associations_parallel_all.pl | |
− | + | ./go_terms_only.pl | |
− | + | ./get_alt_id_terms_only.pl | |
− | + | ./go_obo_to_go_ace.pl | |
− | + | ./list_dead_genes.pl | |
− | + | ./create_curated_gene_list.pl (queries Postgres) | |
− | + | ./list_uncurated_genes.pl | |
− | + | ./create_gene_list.pl (queries Postgres) | |
− | + | ./parse_gene_lists_elegans.pl | |
− | + | ./parse_orthologs_all_parallel.pl | |
− | + | ./acedb_gene_class.pl (queries ace server) | |
− | + | ./publication_gene.pl (queries Textpresso-dev server, looking at indices, takes 1-2 hours) | |
− | + | ./download_wbbt_elegans.pl (downloads the anatomy obo file from the WB FTP site) | |
− | + | ./download_anatomy_elegans.pl (downloads the anatomy association file from the ..) | |
− | + | ./generate_wbbt_obo_terms_elegans.pl (creates a txt file from the anatomy obo file) | |
− | + | ./generate_wbbt_obo_synonyms_elegans.pl (creates a txt file with synonyms) | |
− | + | ./00_wbbt_obo_to_wbbt_ace.pl (creates a .ace file with all parents and child terms) | |
− | + | ./01_wbbt_obo_to_wbbt_ace_names.pl (same as above with names) | |
− | + | ./neuron_xlsx_read.pl (reads the excel spreadsheet from O. Hobert) | |
− | + | ./download_hgnc_family_elegans.pl (Downloads the HGNC gene families file using LWP:simple) | |
+ | ./convert_hgnc_gene_family_xlsx.pl (Not part of the generation of automated descriptions, | ||
+ | but creates the excel spreadsheet for human gene families for viewing purposes) | ||
+ | ./download_expression_cluster_summary_all_parallel.pl (downloads expression cluster data for all species) | ||
====Scripts for generating descriptions==== | ====Scripts for generating descriptions==== | ||
Line 102: | Line 159: | ||
*./concatenate_sentences_species_parallel_all.pl | *./concatenate_sentences_species_parallel_all.pl | ||
**Concatenates all sentences for the different semantic categories for all species | **Concatenates all sentences for the different semantic categories for all species | ||
+ | |||
+ | ==Old SOP== | ||
+ | '''We no longer use Brahma as a development or production server:''' | ||
+ | |||
+ | On brahma.textpresso.org: | ||
+ | |||
+ | *Scripts are at: | ||
+ | /media/data1/jdone/concise_descriptions/concise_descriptions/parallel | ||
+ | soft-link:concise_descriptions_scripts | ||
+ | |||
+ | *All scripts write the data to the files here: concise_descriptions_data | ||
+ | |||
+ | above soft-links to: | ||
+ | /media/data1/jdone/concise_descriptions/www/concise_descriptions | ||
+ | |||
+ | /var/www/html/concise_descriptions points to the same directory | ||
+ | |||
+ | *GitHub ready files, the latest version (not in development) of scripts: | ||
+ | latest_version_scripts | ||
+ | |||
+ | above soft-links to: | ||
+ | |||
+ | /media/data1/jdone/concise_descriptions/concise_descriptions/automated_descriptions/automated_descriptions | ||
*./generate_OA_concise_descriptions_parallel_all.pl | *./generate_OA_concise_descriptions_parallel_all.pl | ||
Line 107: | Line 187: | ||
− | To create the concise descriptions | + | '''To create the concise descriptions run:''' |
+ | |||
+ | The new wrapper written by Yuling, for WS254 (April 2016), that calls all nine of the below scripts, run using sudo: | ||
+ | ./create_sentences_wrapper.sh | ||
− | + | ('space' between script names and the ampersands, do not introduce new line characters) | |
+ | ./create_sentence_tissue_expressions_elegans.pl && ./create_GO_sentences_elegans_species_parallel_all.pl | ||
&& ./create_sentence_multiple_orthologs_species_all_parallel_all.pl && ./create_GO_sentences_species_parallel_all.pl | && ./create_sentence_multiple_orthologs_species_all_parallel_all.pl && ./create_GO_sentences_species_parallel_all.pl | ||
+ | && ./create_gene_regulation_expression_cluster_sentences_species_parallel_all.pl | ||
+ | && ./create_molecule_regulation_expression_cluster_sentences_species_parallel_all.pl | ||
+ | && ./create_anatomy_expression_cluster_sentences_species_parallel_all.pl | ||
&& ./concatenate_sentences_species_parallel_all.pl && ./generate_OA_concise_descriptions_parallel_all.pl | && ./concatenate_sentences_species_parallel_all.pl && ./generate_OA_concise_descriptions_parallel_all.pl | ||
+ | Takes about 4.5 hours to run, took about 6hr 20 mins for the WS257 upload (10.12.2016). | ||
+ | |||
+ | ====Script for a separate file for manually curated genes==== | ||
One can create a file named 'WBGenes_descriptions_for_manual.txt separately, to have the automated descriptions for only those genes that have been curated manually. | One can create a file named 'WBGenes_descriptions_for_manual.txt separately, to have the automated descriptions for only those genes that have been curated manually. | ||
− | + | ./create_curated_gene_sentences_elegans.pl | |
+ | ====Script for Numbers==== | ||
Finally, a report is written to detail the number of concise descriptions for each species for a given production release: | Finally, a report is written to detail the number of concise descriptions for each species for a given production release: | ||
− | + | ./total_description_count.pl WS253 | |
+ | |||
+ | ==Changes to software, files, workflow== | ||
+ | ====Aug 2017: WS262==== | ||
+ | *Hinxton switches from Sanger FTP site to EBI FTP site | ||
+ | *Old URL ftp://ftp.sanger.ac.uk/pub/wormbase/releases needs to be switched to ftp://ftp.ebi.ac.uk/pub/databases/wormbase/releases in all scripts that fetch the source files | ||
+ | |||
+ | ====June 2016: WS255==== | ||
+ | *Biomart main web services was still down, we used mirror site as recommended, changes in biomartWebExample.pl: "http://uswest.ensembl.org/biomart/martservice" | ||
+ | *Added a wrapper to get all source files: get_source_files_wrapper.sh, totally 26 scripts in it. | ||
+ | *For future runs, both "production_release.txt" and "release.txt" need to be changed, where production_release = release + 1 | ||
+ | *outputs on web are now linked to output files directly: /data1/Users/liyuling/Curator_related/concise_descriptions/concise_descriptions/release | ||
+ | |||
+ | ====April 2016: WS254==== | ||
+ | *Biomart web services was down, so used the WS253 HumanIDs_mart_export.org.txt, protocol for Biomart web services may have changed, so check in advance for WS255. | ||
+ | **New wrapper written to run all the create* scripts, 9 scripts, called create_sentences_wrapper.sh | ||
+ | *Yuling modified ./create_sentence_tissue_expressions_elegans.pl, due to changes in anatomy_association_WS253 | ||
+ | **will look for IDA instead of IEP | ||
+ | **will ignore all lines with the 'Enrich' qualifier, column 4, as this is the expression cluster data | ||
==Finding gene popularity (number of mentions in papers)== | ==Finding gene popularity (number of mentions in papers)== | ||
Line 129: | Line 238: | ||
If one wants to examine gene popularity for C. elegans in the literature, one can type: | If one wants to examine gene popularity for C. elegans in the literature, one can type: | ||
− | $ Rscript textpresso_gene_popularity.R | + | $ Rscript textpresso_gene_popularity.R WS253 |
The output will be a PDF file named, textpresso_gene_popularity.pdf. | The output will be a PDF file named, textpresso_gene_popularity.pdf. | ||
Line 138: | Line 247: | ||
This script will produce a text file (my_perl_script.txt), a HTML file (my_perl_script.html) and (if there's enough memory available) a jpg file (my_perl_script.jpg) outlining the flow of data within the script. These files are written into the ./my_output_directory | This script will produce a text file (my_perl_script.txt), a HTML file (my_perl_script.html) and (if there's enough memory available) a jpg file (my_perl_script.jpg) outlining the flow of data within the script. These files are written into the ./my_output_directory | ||
Other formats besides jpg are available: bmp canon dot gv xdot xdot1.2 xdot1.4 cgimage cmap eps exr fig gd gd2 gif gtk ico imap cmapx imap_np cmapx_np ismap jp2 jpg jpeg jpe pct pict pdf pic plain plain-ext png pov ps ps2 psd sgi svg svgz tga tif tiff tk vml vmlz vrml wbmp webp xlib x11 | Other formats besides jpg are available: bmp canon dot gv xdot xdot1.2 xdot1.4 cgimage cmap eps exr fig gd gd2 gif gtk ico imap cmapx imap_np cmapx_np ismap jp2 jpg jpeg jpe pct pict pdf pic plain plain-ext png pov ps ps2 psd sgi svg svgz tga tif tiff tk vml vmlz vrml wbmp webp xlib x11 | ||
+ | |||
+ | ==Software configuration management== | ||
+ | Pushing files to GitHub: | ||
+ | |||
+ | copy any new developed files or scripts from /home/ranjana/concise_descriptions_scripts to /home/ranjana/latest_version_scripts: | ||
+ | $ cp /home/ranjana/concise_descriptions_scripts/README.md /home/ranjana/latest_version_scripts | ||
+ | $ cd /home/ranjana/latest_version_scripts | ||
+ | |||
+ | git is installed and updated on brahma.textpresso.org and textpresso-dev.caltech.edu: | ||
+ | $ which git | ||
+ | $ git --version | ||
+ | $ git init (done once) | ||
+ | $ git remote add origin https://github.com/WormBase/automated_descriptions.git | ||
+ | (only done to change github repository) | ||
+ | $ git add README.md | ||
+ | $ git commit -m “Add Readme” | ||
+ | $ git push | ||
+ | |||
+ | To check if your changes were uploaded, open a browser with URL, https://github.com/WormBase/automated_descriptions | ||
+ | |||
+ | To examine the number of lines of code (without comments or spaces), use cloc or sloccount: | ||
+ | $ cloc --version | ||
+ | $ sloccount --version | ||
+ | $ cloc /media/data1/jdone/concise_descriptions/concise_descriptions/automated_descriptions/automated_descriptions | ||
+ | $ sloccount /media/data1/jdone/concise_descriptions/concise_descriptions/automated_descriptions/automated_descriptions | ||
+ | |||
+ | ==Old documentation== | ||
+ | |||
+ | Moving files to Textpresso-dev for production: | ||
+ | |||
+ | http://textpresso-dev.caltech.edu/concise_descriptions/ shows /home/ranjana/automated_descriptions_data which is actually a soft link to /data2/srv/textpresso-dev.caltech.edu/www/docroot/concise_descriptions | ||
+ | |||
+ | '''1. On ranjana@brahma.textpresso.org,''' go to : | ||
+ | |||
+ | concise_descriptions_data/release | ||
+ | For the WS254 release folder, used r sync, so: | ||
+ | |||
+ | ranjana@brahma:~/concise_descriptions_data/release$ rsync -r WS254 ranjana@textpresso-dev.caltech.edu:/home/ranjana/data/. | ||
+ | |||
+ | gives this type of error message: | ||
+ | |||
+ | skipping non-regular file "WS254/b_malayi/orthology/input_files/b_malayi.orthologs.txt" for all 9 species, | ||
+ | |||
+ | Takes a couple of minutes, less than 5. | ||
+ | |||
+ | |||
+ | '''Previously used scp to do the same:''' | ||
+ | then use 'scp -r' to scp the entire release directory, eg. WS253 to ranjana@textpresso-dev.caltech.edu:/home/ranjana/data | ||
+ | |||
+ | This takes about an hour to 1.5 hrs. | ||
+ | |||
+ | |||
+ | '''2. logon to Textpresso-dev''', | ||
+ | |||
+ | then move (mv) WS253 from /home/ranjana/data to /home/ranjana/automated_descriptions_data/release | ||
+ | |||
+ | In: /home/ranjana/automated_descriptions_data/release | ||
+ | mv ~/data/WS253/ . | ||
+ | |||
Back To [[Generation of automated descriptions]] | Back To [[Generation of automated descriptions]] |
Latest revision as of 17:36, 18 August 2017
Contents
Current SOP
(Modified Aug 2016) On: <username>@textpresso-dev.caltech.edu
All scripts at: /data1/Users/liyuling/Curator_related/concise_descriptions/concise_descriptions_scripts
All data (by release) at: /data1/Users/liyuling/Curator_related/concise_descriptions/concise_descriptions/release
Run all scripts using 'sudo' in the above directory
1. Change the version of production_release.txt to the version of the upcoming citace data upload for which the files are being generated for, eg. WS256. Change the version in the release.txt file to reflect the correct version of source files, usually this is -1 from the number in the production_release.txt, eg. WS255. The files are here:
/data1/Users/liyuling/Curator_related/concise_descriptions/
2. Create the directories needed by the scripts for a given production release:
./create_release_directories_parallel.pl WS256
Make sure to run this script twice so it creates all 9 required directories.
3. In order to download all required input/source files and for some pre-processing run the ./get_source_files_wrapper.sh, which runs a total of 26 scripts.
./biomart_query.pl ./download_gene_lists_elegans.pl ./download_geneIDs_all_parallel.pl ./download_orthologs_all_parallel.pl ./download_gene_associations_parallel_all.pl ./go_terms_only.pl ./get_alt_id_terms_only.pl ./go_obo_to_go_ace.pl ./list_dead_genes.pl ./create_curated_gene_list.pl (queries Postgres) ./list_uncurated_genes.pl ./create_gene_list.pl (queries Postgres) ./parse_gene_lists_elegans.pl ./parse_orthologs_all_parallel.pl ./acedb_gene_class.pl (queries ace server) ./publication_gene.pl (queries Textpresso-dev server, looking at indices, takes 1-2 hours) ./download_wbbt_elegans.pl (downloads the anatomy obo file from the WB FTP site) ./download_anatomy_elegans.pl (downloads the anatomy association file from the ..) ./generate_wbbt_obo_terms_elegans.pl (creates a txt file from the anatomy obo file) ./generate_wbbt_obo_synonyms_elegans.pl (creates a txt file with synonyms) ./00_wbbt_obo_to_wbbt_ace.pl (creates a .ace file with all parents and child terms) ./01_wbbt_obo_to_wbbt_ace_names.pl (same as above with names) ./neuron_xlsx_read.pl (reads the excel spreadsheet from O. Hobert) ./download_hgnc_family_elegans.pl (Downloads the HGNC gene families file using LWP:simple) ./convert_hgnc_gene_family_xlsx.pl (Not part of the generation of automated descriptions, but creates the excel spreadsheet for human gene families for viewing purposes) ./download_expression_cluster_summary_all_parallel.pl (downloads expression cluster data for all species)
Takes about 2.5 hrs to run.
Check that directories and source files have been created and downloaded.
4. To create the concise descriptions run the wrapper: ./create_sentences_wrapper.sh which runs nine scripts:
('space' between script names and the ampersands, do not introduce new line characters)
./create_sentence_tissue_expressions_elegans.pl && ./create_GO_sentences_elegans_species_parallel_all.pl && ./create_sentence_multiple_orthologs_species_all_parallel_all.pl && ./create_GO_sentences_species_parallel_all.pl && ./create_gene_regulation_expression_cluster_sentences_species_parallel_all.pl && ./create_molecule_regulation_expression_cluster_sentences_species_parallel_all.pl && ./create_anatomy_expression_cluster_sentences_species_parallel_all.pl && ./concatenate_sentences_species_parallel_all.pl && ./generate_OA_concise_descriptions_parallel_all.pl
Takes about 6+ hours to run (used to take 4.5 hrs).
5. For creating a report with all the numbers for each of the species, run
./total_description_count.pl WS253
Location of directories
All files: Textpresso-dev.caltech.edu:
Data at: /data2/srv/textpresso-dev.caltech.edu/www/docroot/concise_descriptions
Scripts at: /data2/srv/textpresso-dev.caltech.edu/www/cgi-bin/concise_descriptions
Initial requirements
The software for automated concise descriptions is available online: https://github.com/WormBase/automated_descriptions
The requirements are outlined here: http://wiki.wormbase.org/index.php/Generation_of_automated_descriptions
- Make sure that the files in the data directory , report the current production release, release of sources and the list of Wormbase supported species. These files need to be updated every release:
- production_release.txt holds the production release for the output of the concise descriptions. (needs updating every release)
- release.txt holds the release information for the sources. (needs updating every release), = "production_release" minus 1
- species.txt lists the tab separated values of the species abbreviation, project name, full name and gene prefix for each species (needs updating only when new species are added)
c_briggsae PRJNA10731 Caenorhabditis briggsae Cbr
- Go to the directory that includes the scripts:
/home/ranjana/concise_descriptions_scripts
- The following files must be edited when there are changes, eg., changing of the IP address of Postgres, change of ace server port, etc (will not happen every release, but watch)
At:/home/ranjana/concise_descriptions_scripts
- db_ip.txt is the IP address of the SQL database that holds much of the daily updated information for Wormbase.
- html.txt is the location of the docroot or /var/www directory for the output.
- parallel_path.txt is the path to where GNU Parallel is installed, different for Ubuntu and RedHat.
- cgi.txt is the location of the scripts.
- acedb_port.txt is the location of the port of the acedb server.
- acedb_host.txt is the IP address/hostname of the acedb server.
- In: /home/ranjana/concise_descriptions_scripts
- Create the directories needed by the scripts for a given production release:
$ ./create_release_directories_parallel.pl WS253
(Sometimes 'parallel' makes a mistake, run twice!)
Scripts for source files
- There are input files that must be downloaded and pre-processed, formatted for input; the following scripts must be run:
./biomart_query.pl ./download_gene_lists_elegans.pl ./download_geneIDs_all_parallel.pl ./download_orthologs_all_parallel.pl ./download_gene_associations_parallel_all.pl ./go_terms_only.pl ./get_alt_id_terms_only.pl ./go_obo_to_go_ace.pl ./list_dead_genes.pl ./create_curated_gene_list.pl (queries Postgres) ./list_uncurated_genes.pl ./create_gene_list.pl (queries Postgres) ./parse_gene_lists_elegans.pl ./parse_orthologs_all_parallel.pl ./acedb_gene_class.pl (queries ace server) ./publication_gene.pl (queries Textpresso-dev server, looking at indices, takes 1-2 hours) ./download_wbbt_elegans.pl (downloads the anatomy obo file from the WB FTP site) ./download_anatomy_elegans.pl (downloads the anatomy association file from the ..) ./generate_wbbt_obo_terms_elegans.pl (creates a txt file from the anatomy obo file) ./generate_wbbt_obo_synonyms_elegans.pl (creates a txt file with synonyms) ./00_wbbt_obo_to_wbbt_ace.pl (creates a .ace file with all parents and child terms) ./01_wbbt_obo_to_wbbt_ace_names.pl (same as above with names) ./neuron_xlsx_read.pl (reads the excel spreadsheet from O. Hobert) ./download_hgnc_family_elegans.pl (Downloads the HGNC gene families file using LWP:simple) ./convert_hgnc_gene_family_xlsx.pl (Not part of the generation of automated descriptions, but creates the excel spreadsheet for human gene families for viewing purposes) ./download_expression_cluster_summary_all_parallel.pl (downloads expression cluster data for all species)
Scripts for generating descriptions
- ./create_sentence_tissue_expressions_elegans.pl
- Creates sentences for tissue expression for C. elegans
- ./create_GO_sentences_elegans_species_parallel_all.pl
- Creates sentences for non-elegans species for the orthologous elegans gene and the processes in which it's involved.
- ./create_sentence_multiple_orthologs_species_all_parallel_all.pl
- this is a wrapper, runs several scripts, creates the orthology sentences for all species including elegans, reads the popularity file.
- ./create_GO_sentences_species_parallel_all.pl
- Creates the sentences for process, function and sub-cellular localization for all species including elegans.
- ./concatenate_sentences_species_parallel_all.pl
- Concatenates all sentences for the different semantic categories for all species
Old SOP
We no longer use Brahma as a development or production server:
On brahma.textpresso.org:
- Scripts are at:
/media/data1/jdone/concise_descriptions/concise_descriptions/parallel soft-link:concise_descriptions_scripts
- All scripts write the data to the files here: concise_descriptions_data
above soft-links to: /media/data1/jdone/concise_descriptions/www/concise_descriptions
/var/www/html/concise_descriptions points to the same directory
- GitHub ready files, the latest version (not in development) of scripts:
latest_version_scripts
above soft-links to:
/media/data1/jdone/concise_descriptions/concise_descriptions/automated_descriptions/automated_descriptions
- ./generate_OA_concise_descriptions_parallel_all.pl
- Generates the file for import of the automated descriptions into the OA.
To create the concise descriptions run:
The new wrapper written by Yuling, for WS254 (April 2016), that calls all nine of the below scripts, run using sudo:
./create_sentences_wrapper.sh
('space' between script names and the ampersands, do not introduce new line characters)
./create_sentence_tissue_expressions_elegans.pl && ./create_GO_sentences_elegans_species_parallel_all.pl && ./create_sentence_multiple_orthologs_species_all_parallel_all.pl && ./create_GO_sentences_species_parallel_all.pl && ./create_gene_regulation_expression_cluster_sentences_species_parallel_all.pl && ./create_molecule_regulation_expression_cluster_sentences_species_parallel_all.pl && ./create_anatomy_expression_cluster_sentences_species_parallel_all.pl && ./concatenate_sentences_species_parallel_all.pl && ./generate_OA_concise_descriptions_parallel_all.pl
Takes about 4.5 hours to run, took about 6hr 20 mins for the WS257 upload (10.12.2016).
Script for a separate file for manually curated genes
One can create a file named 'WBGenes_descriptions_for_manual.txt separately, to have the automated descriptions for only those genes that have been curated manually.
./create_curated_gene_sentences_elegans.pl
Script for Numbers
Finally, a report is written to detail the number of concise descriptions for each species for a given production release:
./total_description_count.pl WS253
Changes to software, files, workflow
Aug 2017: WS262
- Hinxton switches from Sanger FTP site to EBI FTP site
- Old URL ftp://ftp.sanger.ac.uk/pub/wormbase/releases needs to be switched to ftp://ftp.ebi.ac.uk/pub/databases/wormbase/releases in all scripts that fetch the source files
June 2016: WS255
- Biomart main web services was still down, we used mirror site as recommended, changes in biomartWebExample.pl: "http://uswest.ensembl.org/biomart/martservice"
- Added a wrapper to get all source files: get_source_files_wrapper.sh, totally 26 scripts in it.
- For future runs, both "production_release.txt" and "release.txt" need to be changed, where production_release = release + 1
- outputs on web are now linked to output files directly: /data1/Users/liyuling/Curator_related/concise_descriptions/concise_descriptions/release
April 2016: WS254
- Biomart web services was down, so used the WS253 HumanIDs_mart_export.org.txt, protocol for Biomart web services may have changed, so check in advance for WS255.
- New wrapper written to run all the create* scripts, 9 scripts, called create_sentences_wrapper.sh
- Yuling modified ./create_sentence_tissue_expressions_elegans.pl, due to changes in anatomy_association_WS253
- will look for IDA instead of IEP
- will ignore all lines with the 'Enrich' qualifier, column 4, as this is the expression cluster data
Finding gene popularity (number of mentions in papers)
If one wants to create a listing of Textpresso publications associated with genes in either the 'results' section or the 'body' (if the paper can not be sectioned), then one would execute the following:
$ ./publication_paper_gene_results.pl $ ./publication_paper_gene_body.pl
If one wants to examine gene popularity for C. elegans in the literature, one can type:
$ Rscript textpresso_gene_popularity.R WS253
The output will be a PDF file named, textpresso_gene_popularity.pdf.
If one would like to create a flow chart of the perl scripts available in this respository:
$ ./graph_perl.pl ./my_perl_script.pl jpg ./my_output_directory
This script will produce a text file (my_perl_script.txt), a HTML file (my_perl_script.html) and (if there's enough memory available) a jpg file (my_perl_script.jpg) outlining the flow of data within the script. These files are written into the ./my_output_directory Other formats besides jpg are available: bmp canon dot gv xdot xdot1.2 xdot1.4 cgimage cmap eps exr fig gd gd2 gif gtk ico imap cmapx imap_np cmapx_np ismap jp2 jpg jpeg jpe pct pict pdf pic plain plain-ext png pov ps ps2 psd sgi svg svgz tga tif tiff tk vml vmlz vrml wbmp webp xlib x11
Software configuration management
Pushing files to GitHub:
copy any new developed files or scripts from /home/ranjana/concise_descriptions_scripts to /home/ranjana/latest_version_scripts:
$ cp /home/ranjana/concise_descriptions_scripts/README.md /home/ranjana/latest_version_scripts $ cd /home/ranjana/latest_version_scripts
git is installed and updated on brahma.textpresso.org and textpresso-dev.caltech.edu:
$ which git $ git --version $ git init (done once) $ git remote add origin https://github.com/WormBase/automated_descriptions.git (only done to change github repository) $ git add README.md $ git commit -m “Add Readme” $ git push
To check if your changes were uploaded, open a browser with URL, https://github.com/WormBase/automated_descriptions
To examine the number of lines of code (without comments or spaces), use cloc or sloccount:
$ cloc --version $ sloccount --version $ cloc /media/data1/jdone/concise_descriptions/concise_descriptions/automated_descriptions/automated_descriptions $ sloccount /media/data1/jdone/concise_descriptions/concise_descriptions/automated_descriptions/automated_descriptions
Old documentation
Moving files to Textpresso-dev for production:
http://textpresso-dev.caltech.edu/concise_descriptions/ shows /home/ranjana/automated_descriptions_data which is actually a soft link to /data2/srv/textpresso-dev.caltech.edu/www/docroot/concise_descriptions
1. On ranjana@brahma.textpresso.org, go to :
concise_descriptions_data/release For the WS254 release folder, used r sync, so:
ranjana@brahma:~/concise_descriptions_data/release$ rsync -r WS254 ranjana@textpresso-dev.caltech.edu:/home/ranjana/data/.
gives this type of error message:
skipping non-regular file "WS254/b_malayi/orthology/input_files/b_malayi.orthologs.txt" for all 9 species,
Takes a couple of minutes, less than 5.
Previously used scp to do the same:
then use 'scp -r' to scp the entire release directory, eg. WS253 to ranjana@textpresso-dev.caltech.edu:/home/ranjana/data
This takes about an hour to 1.5 hrs.
2. logon to Textpresso-dev,
then move (mv) WS253 from /home/ranjana/data to /home/ranjana/automated_descriptions_data/release
In: /home/ranjana/automated_descriptions_data/release mv ~/data/WS253/ .