Website:JBrowse set-up on dev.wormbase.org

From WormBaseWiki
Revision as of 15:11, 26 July 2016 by Scain (talk | contribs) (→‎Update apache config)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

These are the directions for setting up a new instance of JBrowse on dev.wormbase.org, which is where the current production version of JBrowse is hosted.

Set up JBrowse

Get the current JBrowse release, put it somewhere with a fair amount of disk space (current build is about 45GB). This directory will be JBrowse is served from, so it it will need to be made accessible to Apache when the set up is complete. The actual setup of JBrowse is quite easy; in the directory where the JBrowse release has been unzipped, do this:

 ./setup.sh

without root privileges. The setup script will use local::lib to install any needed perl prerequisites as well as format some sample data (which can be deleted later from the sample_data directory). It may also emit a warning about failing to build legacy wiggle or bam support--ignore this, we don't need it.

Set up build configuration

From a fresh checkout of the website-admin repository, copy the build config file, website-admin/jbrowse/conf/c_elegans.jbrowse.conf, to a temporary working directory. This directory is where GFF files will be copied, parsed and the JBrowse script flatfile-to-json.pl will be run. While the config file is named "c_elegans..." it is actually organism agnostic and will be used for every organism. c_elegans.jbrowse.conf is a fairly simple ini-style config file, with a general section at the top which you will need to edit, and sections for each track below, which generally you will not need to edit unless a new data type is added that no organism in JBrowse as accessed before (for example, if a track is already configured for C. elegans but is added for C. briggsae, you won't have to do anything in the config file for the build script to automatically include that track from C. briggsae).

This is a sample from the top of the c_elegans.jbrowse.conf:

 release=246
 filedir=/usr/local/ftp/pub/wormbase/releases/
 nosplitgff = 1
 usenice=1
 skipprepare=0
 jbrowsedir=/home/scain/scain/jbrowse-test
 allstats=/usr/local/wormbase/website/tharris/conf/gbrowse/releases/WS246/ALL_SPECIES.stats
 includes=/home/scain/scain/website-admin/jbrowse/jbrowse/data/c_elegans/includes
 functions=/home/scain/scain/website-admin/jbrowse/jbrowse/data/functions.conf
 organisms=/home/scain/scain/website-admin/jbrowse/jbrowse/data/organisms.conf
 glyphs=/home/scain/scain/website-admin/jbrowse/jbrowse/src/JBrowse/View/FeatureGlyph

Most of these items you won't have to change often or at all. The primary things you need to change is the "release" entry and the "allstats" entry (though the need for the allstats entry may go away, as the path to it should be inferable from the release number). The paths to items in the website-admin directory could be updated to put to your current checkout to insure that it is the most currently available configuration. The other items probably won't need to be changed ever but here's an explanation of the non-obvious ones:

  • nosplit - (not currently used) Some of the formatting steps can take a very long time with large GFF files, so one of the intermeditate steps the script will take is to split the GFF file into multiple files based on the reference sequence. This may be less desirable when the genome consists of 10,000 contigs than when there are 6 chromosomes.
  • usenice - Run all of the commands with the Unix nice command to bump their priority down in the command scheduler.
  • skipprepare - Skips the running of the prepare-refseqs.pl JBrowse script. Generally there isn't much point in turning this on, even if you are rerunning the build in a directory where prepare-refseqs.pl has already been run for a given data set, as it is much faster than formatting track data. The prepare_refseqs.pl script formats a FASTA file for JBrowse and is the first step in building a jbrowse instance for a data set.
  • jbrowsedir - Path to the JBrowse directory that data will be served from.

Track specific configurations

As noted above, you generally won't have to edit any of the track specific configuration stanzas, but I'll describe the options here for when they need updating. Here are two example track configurations demonstrating the available options:

[genes]
grep=\tWormBase
prefix=wormbase_genes
type=gene:WormBase,gene:WormBase_imported
label=Curated Genes
index=1

[genes_protein_coding]
label=Curated Genes (protein coding)
altfile=genes
type=mRNA:WormBase,mRNA:WormBase_imported
index=1

Options:

Section name (in square brackets): this corresponds to the name of the track configuration file in the includes directory (so if the section name is "genes" there will be a "genes.json" track configuration file). It also corresponds to the name of the track file in the GBrowse build procedure.

grep - A perl-style regular expression that will be used to create a "sub-GFF" file that contains only the features needed for building this track. (The build script uses "grep -P" under the hood.)

prefix - The string that will be used to create a temporary file for the sub-GFF file; the file name will be "prefix_originalfilename". In practice, you won't see this, since it will be deleted after an organism's JBrowse instance has been successfully built.

type - A comma-delimited list of the "type:source" GFF features that are included in this track. The source is not required.

label - Corresponds to GBrowse's key parameter--it's the human-readable string for identifying a track; in WormBase's GBrowse, this string also does double duty as it is used in URLs as well for specifying tracks to include.

index - Either include this track for building the name index for searching and autocompletion.

altfile - The name of another section where a sub-GFF file has already been built that can be used for this track too. The above example shows an instance where this is useful: since a GFF file that has all of the features in it to build the Curated Genes track has already been created, obviously that GFF file will also have all of the features in it to create the protein coding genes track as well. Note that the order in which these track configuration stanza appear in the configuration file doesn't matter, as all of the GFF files are created before any data are formatted.

Additionally, there is one configuration option that is available but not currently used: origfile - this option tells the build script to use the unmodified GFF for running flatfile-to-json but isn't used currently. This option could be used when it is too difficult to write a regular expression to creating a suitable sub-GFF file (really it means your regex-fu is lacking).

Running the build

make_jbrowse.pl

The main driver of the JBrowse build procedure is the make_jbrowse.pl perl script, located in the website-admin repository at website-admin/jbrowse/bin. This script has several command line options:

       --config         Path to ini-style config file (required)
       --species        Name of species (should use same format as ALL_SPECIES.stats)
       --filedir        Path to parent directory where releases and files can be found
       --gfffile        Path to an input GFF file
       --fastafile      Path to an input FASTA file
       --datadir        Relative path (from jbrowse root) to jbrowse data dir
       --jbrowsedir     Path to jbrowse directory
       --nosplit        Don't split GFF file by reference sequence
       --usenice        Run formatting commands with Unix nice
       --skipfilesplit  Don't split files or use grep to make subfiles
       --skipprepare    Don't run prepare-refseqs.pl
       --allstats       Path to the ALLSPECIES.stats file
       --quiet          Limit output to errors

Most of these items can also be specified in the config file described above and have the same meanings, but if they are specified on the command line they will override the config file. The only thing that is absolutely required is --config, and in practice, --species is also required (if you are reusing the config file for building multiple species in one go it makes sense to specify the species on the command line). The --quiet option is also nice in production, since by default, every script that gets executed (greps to create GFF files, prepare-refseqs.pl and flatfile-to-json.pl) are all echoed to stdout. With --quiet turned on, the output is significantly less, generally only emitting information when a new run of make_jbrowse.pl is started and when there are warnings about missing data or configurations. All of the other parameters are either specified in the configuration file or are inferred from those data, like the paths to GFF and FASTA files for a particular species.

Warnings from make_jbrowse.pl

There are two warnings that make_jbrowse.pl may emit:

  • WARNING: TRACK WITH DATA BUT NO CONFIG: $key

This happens when there is an non-zero entry for a type:source ALL_SPECIES.stats file for a species and not "NO CONFIG AVAILABLE" in the TRACK column, indicating that this data type is represented in GBrowse (and there is a track configuration for it) but there is not an entry in the build config; see #Track_specific_configurations above for information on how this is configured.

  • MISSING INCLUDE FILE: $section.json

This occurs when there is an non-zero entry for a type:source ALL_SPECIES.stats file for a species and not "NO CONFIG AVAILABLE" in the TRACK column, indicating that this data type is represented in GBrowse (and there is a track configuration for it), but there is not a json track configuration for this data. The available track configuration files are in the website-admin repository at website-admin/jbrowse/jbrowse/data/c_elegans/includes. The format for track configurations is described in the GMOD wiki page for JBrowse configuration for CanvasFeature tracks, though the appropriate format can generally be inferred from the configuration for a similar track.

Create a "multi-species" build shell script

The make_jbrowse.pl script runs for a single species at a time. Since typically, we'd want to build a JBrowse instance for all species in WormBase, it makes sense to create a shell script to run the species consecutively, or make multiple shell scripts to run several species in parallel on multiple cores. I use vim with regular expresses to write a simple shell script. First, I get a list of the available species:

 head -n 1 ALL_SPECIES.stats > buildall.sh

After removing "FEATURE SOURCE TYPE TRACK " from the beginning of the single line, I run this regular expression to get them all on individual lines:

 :. s/\s\+/\r/g  (ugh, hate vim regexes)

Then I run this regex to put in the commands:

 :% s/^/..\/bin\/make_jbrowse.pl --conf c_elegans.jbrowse.conf --quiet --species /

Of course, make sure the path to make_jbrowse.pl is correct.

Now using something that will keep the script running after logging out (like NOHUP or running in a screen process), run the build script, collecting the output to check for warnings:

 bash buildall.sh &> buildall.out

On the current dev.wormbase.org hardware, this will run in under 24 hours (typically around 18 hours). When the script is done, the original GFF and FASTA files for each species will remain; they can be deleted manually, or if you are happy with the resulting JBrowse instance, the working directory can be deleted altogether.

A note about killing the build process

Since the make_jbrowse.pl script uses the perl "system" command to spawn long running processes, hitting control-C while a shell script like buildall.sh will not kill the run--it will kill the process started by a system command. Additionally, killing one invocation of make_jbrowse.pl will just start the next one, so it will frequently require many "cntl-C"s to get it to stop.

Update apache config

After the run has completed, update the apache configuration to point at the newly created jbrowse configuration. Configuring is quite easy, as all you need to do is make the jbrowse directory available to serve documents, so you need a DocumentRoot entry and a Directory configuration for it. See the existing entries for jbrowse in /etc/apache2/sites-enabled/000-default on dev.wormbase.org for examples.

Moving to production

When the build is complete on dev and it's ready to be moved to production, simply create a directory somewhere on gbrowse.wormbase.org and rsync it over, getting the tools/genome/ directories too (it's what apache expects and makes it easier to add items like jbrowse-simple, the version with the checkbox/hierarchical track selector in addition to the faceted track selector). For example, I typically build in /usr/local/wormbase/website/scain in a directory called ###_jbrowse_build, which has jbrowse and build directories in it. To rsync this to the gbrowse server, I do this:

 rsync -av scain@10.0.0.65:/usr/local/wormbase/website/scain/254_jbrowse_build/jbrowse/ .

in the newly created directory for the JBrowse build. I usually do this inside a screen process, since it will take a while to run. Also note that I'm using the internal AWS ip address for the rsync so that the data transfer is free.

After the transfer is complete, create a symlink in /usr/local/wormbase pointing at the new build:

 sudo ln -s /usr/local/wormbase/website/scain/254_jbrowse_build/jbrowse/ jbrowse-254

and when it's time for the release to go live, just move the symlink to a new one named just "jbrowse"; note that I usually save the old symlink, in case I need to quickly revert, so I do this:

 sudo mv jbrowse jbrowse-253; sudo mv jbrowse-254 jbrowse

Other items that might need updating

  • organisms.conf -- this file contains the configuration information that JBrowse uses to create the drop down menu for different organisms' JBrowse instance. They look like this:
 [datasets.a_ceylanicum_PRJNA231479]
 name   = a_ceylanicum_PRJNA231479
 url    = ?data=data/a_ceylanicum_PRJNA231479

So if an organism's "name" changes (that is, including the "PRJ..." part), then this document will need to be updated to reflect the new name and path. I should write a post build validator that checks that each of these paths exist, so it can warn if something is missing. Also, there's no reason the "name" attribute can't be changed to be something a little more human readable (like "A. ceylanicum (BioProject PRJNA231479)", I just haven't bothered to do it. That's the text that ends up in the drop down menu.

  • jbrowse.conf -- placeholder; I'm not sure if this will need to be modified after the build or not.