Difference between revisions of "Website:JBrowse set-up on dev.wormbase.org"

From WormBaseWiki
Jump to navigationJump to search
Line 73: Line 73:
 
===make_jbrowse.pl===
 
===make_jbrowse.pl===
  
The main driver of the JBrowse build procedure is the <code>make_jbrowse.pl</code> perl script, located in the website-admin repository.
+
The main driver of the JBrowse build procedure is the <code>make_jbrowse.pl</code> perl script, located in the website-admin repository at website-admin/jbrowse/bin. This script has several command line options:
 +
 
 +
        --config        Path to ini-style config file (required)
 +
        --species        Name of species (should use same format as ALL_SPECIES.stats)
 +
        --filedir        Path to parent directory where releases and files can be found
 +
        --gfffile        Path to an input GFF file
 +
        --fastafile      Path to an input FASTA file
 +
        --datadir        Relative path (from jbrowse root) to jbrowse data dir
 +
        --jbrowsedir    Path to jbrowse directory
 +
        --nosplit        Don't split GFF file by reference sequence
 +
        --usenice        Run formatting commands with Unix nice
 +
        --skipfilesplit  Don't split files or use grep to make subfiles
 +
        --skipprepare    Don't run prepare-refseqs.pl
 +
        --allstats      Path to the ALLSPECIES.stats file
 +
        --quiet          Limit output to errors
 +
 
 +
Most of these items can also be specified in the config file described above and have the same meanings, but if they are specified on the command line they will override the config file. The only thing that is absolutely required is --config, and in practice, --species is also required (if you are reusing the config file for building multiple species in one go it makes sense to specify the species on the command line).  The --quiet option is also nice in production, since by default, every script that gets executed (greps to create GFF files, prepare-refseqs.pl and flatfile-to-json.pl) are all echoed to stdout.  With --quiet turned on, the output is significantly less, generally only emitting information when a new run of make_jbrowse.pl is started and when there are warnings about missing data or configurations. All of the other parameters are either specified in the configuration file or are inferred from those data, like the paths to GFF and FASTA files for a particular species.
 +
 
 +
===Warnings from make_jbrowse.pl===
 +
 
 +
There are two warnings that make_jbrowse.pl may emit:
 +
 
 +
* WARNING: TRACK WITH DATA BUT NO CONFIG: $key
 +
 
 +
This happens when there is an non-zero entry for a type:source ALL_SPECIES.stats file for a species and not "NO CONFIG AVAILABLE" in the TRACK column, indicating that this data type is represented in GBrowse (and there is a track configuration for it) but there is not an entry in the build config; see [[#
 +
 
 +
* MISSING INCLUDE FILE: $section.json
 +
 
 +
This occurs when there is an non-zero entry for a type:source ALL_SPECIES.stats file for a species and not "NO CONFIG AVAILABLE" in the TRACK column, indicating that this data type is represented in GBrowse (and there is a track configuration for it), but there is not a json track configuration for this data. 
  
 
Create a "multi-species" build shell script and run (best done in a screen process, and/or grab stdout and stderr to a file); it'll take about a day)
 
Create a "multi-species" build shell script and run (best done in a screen process, and/or grab stdout and stderr to a file); it'll take about a day)

Revision as of 19:27, 5 January 2015

These are the directions for setting up a new instance of JBrowse on dev.wormbase.org, which is where the current production version of JBrowse is hosted.

Set up JBrowse

Get the current JBrowse release, put it somewhere with a fair amount of disk space (current build is about 45GB). This directory will be JBrowse is served from, so it it will need to be made accessible to Apache when the set up is complete. The actual setup of JBrowse is quite easy; in the directory where the JBrowse release has been unzipped, do this:

 ./setup.sh

without root privileges. The setup script will use local::lib to install any needed perl prerequisites as well as format some sample data (which can be deleted later from the sample_data directory).

Set up build configuration

From a fresh checkout of the website-admin repository, copy the build config file, website-admin/jbrowse/conf/c_elegans.jbrowse.conf, to a temporary working directory. This directory is where GFF files will be copied, parsed and the JBrowse script flatfile-to-json.pl will be run. While the config file is named "c_elegans..." it is actually organism agnostic and will be used for every organism. c_elegans.jbrowse.conf is a fairly simple ini-style config file, with a general section at the top which you will need to edit, and sections for each track below, which generally you will not need to edit unless a new data type is added that no organism in JBrowse as accessed before (for example, if a track is already configured for C. elegans but is added for C. briggsae, you won't have to do anything in the config file for the build script to automatically include that track from C. briggsae).

This is a sample from the top of the c_elegans.jbrowse.conf:

 release=246
 filedir=/usr/local/ftp/pub/wormbase/releases/
 nosplitgff = 1
 usenice=1
 skipprepare=0
 jbrowsedir=/home/scain/scain/jbrowse-test
 allstats=/usr/local/wormbase/website/tharris/conf/gbrowse/releases/WS246/ALL_SPECIES.stats
 includes=/home/scain/scain/website-admin/jbrowse/jbrowse/data/c_elegans/includes
 functions=/home/scain/scain/website-admin/jbrowse/jbrowse/data/functions.conf
 organisms=/home/scain/scain/website-admin/jbrowse/jbrowse/data/organisms.conf
 glyphs=/home/scain/scain/website-admin/jbrowse/jbrowse/src/JBrowse/View/FeatureGlyph

Most of these items you won't have to change often or at all. The primary things you need to change is the "release" entry and the "allstats" entry (though the need for the allstats entry may go away, as the path to it should be inferable from the release number). The paths to items in the website-admin directory could be updated to put to your current checkout to insure that it is the most currently available configuration. The other items probably won't need to be changed ever but here's an explanation of the non-obvious ones:

  • nosplit - (not currently used) Some of the formatting steps can take a very long time with large GFF files, so one of the intermeditate steps the script will take is to split the GFF file into multiple files based on the reference sequence. This may be less desirable when the genome consists of 10,000 contigs than when there are 6 chromosomes.
  • usenice - Run all of the commands with the Unix nice command to bump their priority down in the command scheduler.
  • skipprepare - Skips the running of the prepare-refseqs.pl JBrowse script. Generally there isn't much point in turning this on, even if you are rerunning the build in a directory where prepare-refseqs.pl has already been run for a given data set, as it is much faster than formatting track data.
  • jbrowsedir - Path to the JBrowse directory that data will be served from.

Track specific configurations

As noted above, you generally won't have to edit any of the track specific configuration stanzas, but I'll describe the options here for when they need updating. Here are two example track configurations demonstrating the available options:

[genes]
grep=\tWormBase
prefix=wormbase_genes
type=gene:WormBase,gene:WormBase_imported
label=Curated Genes
index=1

[genes_protein_coding]
label=Curated Genes (protein coding)
altfile=genes
type=mRNA:WormBase,mRNA:WormBase_imported
index=1

Options:

Section name (in square brackets): this corresponds to the name of the track configuration file in the includes directory (so if the section name is "genes" there will be a "genes.json" track configuration file). It also corresponds to the name of the track file in the GBrowse build procedure.

grep - A perl-style regular expression that will be used to create a "sub-GFF" file that contains only the features needed for building this track. (The build script uses "grep -P" under the hood.)

prefix - The string that will be used to create a temporary file for the sub-GFF file; the file name will be "prefix_originalfilename". In practice, you won't see this, since it will be deleted after an organism's JBrowse instance has been successfully built.

type - A comma-delimited list of the "type:source" GFF features that are included in this track. The source is not required.

label - Corresponds to GBrowse's key parameter--it's the human-readable string for identifying a track; in WormBase's GBrowse, this string also does double duty as it is used in URLs as well for specifying tracks to include.

index - Either include this track for building the name index for searching and autocompletion.

altfile - The name of another section where a sub-GFF file has already been built that can be used for this track too. The above example shows an instance where this is useful: since a GFF file that has all of the features in it to build the Curated Genes track has already been created, obviously that GFF file will also have all of the features in it to create the protein coding genes track as well. Note that the order in which these track configuration stanza appear in the configuration file doesn't matter, as all of the GFF files are created before any data are formatted.

Additionally, there is one configuration option that is available but not currently used: origfile - this option tells the build script to use the unmodified GFF for running flatfile-to-json but isn't used currently. This option could be used when it is too difficult to write a regular expression to creating a suitable sub-GFF file (really it means your regex-fu is lacking).

Create a multi-species build script

make_jbrowse.pl

The main driver of the JBrowse build procedure is the make_jbrowse.pl perl script, located in the website-admin repository at website-admin/jbrowse/bin. This script has several command line options:

       --config         Path to ini-style config file (required)
       --species        Name of species (should use same format as ALL_SPECIES.stats)
       --filedir        Path to parent directory where releases and files can be found
       --gfffile        Path to an input GFF file
       --fastafile      Path to an input FASTA file
       --datadir        Relative path (from jbrowse root) to jbrowse data dir
       --jbrowsedir     Path to jbrowse directory
       --nosplit        Don't split GFF file by reference sequence
       --usenice        Run formatting commands with Unix nice
       --skipfilesplit  Don't split files or use grep to make subfiles
       --skipprepare    Don't run prepare-refseqs.pl
       --allstats       Path to the ALLSPECIES.stats file
       --quiet          Limit output to errors

Most of these items can also be specified in the config file described above and have the same meanings, but if they are specified on the command line they will override the config file. The only thing that is absolutely required is --config, and in practice, --species is also required (if you are reusing the config file for building multiple species in one go it makes sense to specify the species on the command line). The --quiet option is also nice in production, since by default, every script that gets executed (greps to create GFF files, prepare-refseqs.pl and flatfile-to-json.pl) are all echoed to stdout. With --quiet turned on, the output is significantly less, generally only emitting information when a new run of make_jbrowse.pl is started and when there are warnings about missing data or configurations. All of the other parameters are either specified in the configuration file or are inferred from those data, like the paths to GFF and FASTA files for a particular species.

Warnings from make_jbrowse.pl

There are two warnings that make_jbrowse.pl may emit:

  • WARNING: TRACK WITH DATA BUT NO CONFIG: $key

This happens when there is an non-zero entry for a type:source ALL_SPECIES.stats file for a species and not "NO CONFIG AVAILABLE" in the TRACK column, indicating that this data type is represented in GBrowse (and there is a track configuration for it) but there is not an entry in the build config; see [[#

  • MISSING INCLUDE FILE: $section.json

This occurs when there is an non-zero entry for a type:source ALL_SPECIES.stats file for a species and not "NO CONFIG AVAILABLE" in the TRACK column, indicating that this data type is represented in GBrowse (and there is a track configuration for it), but there is not a json track configuration for this data.

Create a "multi-species" build shell script and run (best done in a screen process, and/or grab stdout and stderr to a file); it'll take about a day)

Update apache config to point at new build.