Difference between revisions of "Documentation for workflow and scripts"

From WormBaseWiki
Jump to navigationJump to search
(adding publication_paper_gene.pl)
m
Line 67: Line 67:
 
If one wants to create a listing of Textpresso publications associated with genes in either the 'results' section or the 'body' (if the paper can not be sectioned), then one would execute the following:
 
If one wants to create a listing of Textpresso publications associated with genes in either the 'results' section or the 'body' (if the paper can not be sectioned), then one would execute the following:
  
  $ ./publication_paper_gene.pl
+
  $ ./publication_paper_gene_results.pl
 +
$ ./publication_paper_gene_body.pl
  
 
If one wants to examine gene popularity for C. elegans in the literature, one can type:
 
If one wants to examine gene popularity for C. elegans in the literature, one can type:

Revision as of 22:27, 16 November 2015

The software for automated concise descriptions is available online: https://github.com/WormBase/automated_descriptions The requirements are outlined here: http://wiki.wormbase.org/index.php/Generation_of_automated_descriptions

The following files in the cgi-bin or wherever the scripts for concise descriptions are hosted must be written:

db_ip.txt is the IP address of the SQL database that holds much of the daily updated information for Wormbase.
html.txt  is the location of the docroot or /var/www directory for the output.
parallel_path.txt is the path to where GNU Parallel is installed.
cgi.txt is the location of the scripts.
acedb_port.txt is the location of the port of the acedb server.
acedb_host.txt is the IP address/hostname of the acedb server.

Make sure that the files in the docroot or /var/www directory in which the data are stored report the current production release, release of sources and the list of Wormbase supported species:

 production_release.txt will hold the production release for the output of the concise descriptions.
 release.txt            holds the release information for the sources.
 species.txt            lists the tab separated values of the species abbreviation, project name, 
                        full name and gene prefix for each species:
                        c_briggsae	PRJNA10731	Caenorhabditis briggsae	Cbr

The first step is to go to the directory that includes the location of the scripts.

Then create the directories needed by the scripts for a given production release:

$ ./create_release_directories_parallel.pl WS250

There are input files that must be downloaded and formatted for input; the following scripts must be run:

$ ./biomart_query.pl
$ ./download_gene_lists_elegans.pl
$ ./download_orthologs_all_parallel.pl
$ ./download_gene_associations_parallel_all.pl
$ ./go_terms_only.pl
$ ./get_alt_id_terms_only.pl
$ ./go_obo_to_go_ace.pl
$ ./list_dead_genes.pl
$ ./create_curated_gene_list.pl
$ ./list_uncurated_genes.pl
$ ./create_gene_list.pl
$ ./parse_gene_lists_elegans.pl
$ ./parse_orthologs_all_parallel.pl
$ ./acedb_gene_class.pl
$ ./publication_gene.pl
$ ./download_wbbt_elegans.pl
$ ./download_anatomy_elegans.pl
$ ./generate_wbbt_obo_terms_elegans.pl
$ ./generate_wbbt_obo_synonyms_elegans.pl
$ ./00_wbbt_obo_to_wbbt_ace.pl
$ ./01_wbbt_obo_to_wbbt_ace_names.pl
$ ./neuron_xlsx_read.pl
$ ./download_hgnc_family_elegans.pl
$ ./convert_hgnc_gene_family_xlsx.pl

Then the creation of the concise descriptions begin:

$ ./create_sentence_tissue_expressions_elegans.pl && ./create_GO_sentences_elegans_species_parallel_all.pl 
  && ./create_sentence_multiple_orthologs_species_all_parallel_all.pl && ./create_GO_sentences_species_parallel_all.pl 
  && ./concatenate_sentences_species_parallel_all.pl && ./generate_OA_concise_descriptions_parallel_all.pl

For curation purposes only, one can create a file named "WBGenes_descriptions_for_manual.txt" to list genes that have been curated manually with updated gene ontology and orthology data:

$ ./create_curated_gene_sentences_elegans.pl

Finally, a report is written to detail the number of concise descriptions for each species for a given production release:

$ ./total_description_count.pl WS250

If one wants to create a listing of Textpresso publications associated with genes in either the 'results' section or the 'body' (if the paper can not be sectioned), then one would execute the following:

$ ./publication_paper_gene_results.pl
$ ./publication_paper_gene_body.pl

If one wants to examine gene popularity for C. elegans in the literature, one can type:

$ Rscript textpresso_gene_popularity.R WS250

The output will be a PDF file named, textpresso_gene_popularity.pdf.

If one would like to create a flow chart of the perl scripts available in this respository:

$ ./graph_perl.pl ./my_perl_script.pl jpg ./my_output_directory

This script will produce a text file (my_perl_script.txt), a HTML file (my_perl_script.html) and (if there's enough memory available) a jpg file (my_perl_script.jpg) outlining the flow of data within the script. These files are written into the ./my_output_directory Other formats besides jpg are available: bmp canon dot gv xdot xdot1.2 xdot1.4 cgimage cmap eps exr fig gd gd2 gif gtk ico imap cmapx imap_np cmapx_np ismap jp2 jpg jpeg jpe pct pict pdf pic plain plain-ext png pov ps ps2 psd sgi svg svgz tga tif tiff tk vml vmlz vrml wbmp webp xlib x11

Back To Generation of automated descriptions