Difference between revisions of "Software Life Cycle: 1. Updating The Development Server"

From WormBaseWiki
Jump to navigationJump to search
 
(197 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 +
'''THIS DOCUMENT IS NOW DEPRECATED.  PLEASE REFER TO THE PROJECT DOCUMENTATION MAINTAINED ON GOOGLE DRIVE.'''
 +
 +
https://docs.google.com/a/wormbase.org/document/d/1oPpj8d5gibUc-gpUZorl6ETT5baE6mp-v2bMedKauiA/edit#
 +
 +
 
= Overview =
 
= Overview =
  
This document describes the process of preparing and staging a new release of WormBase on the development server.
+
This document describes the process of staging a new release of WormBase on the development server.
  
This process is entirely automated; each step is described in more detail below to assist in troubleshooting.  
+
The automated staging pipeline consists of:
 +
* a harness that handles logging, error trapping, and basic shared functions
 +
* a suite of modules -- one per step -- that implement the step or make calls to helper scripts
 +
* helper scripts in Perl or shell that assist in implementation
 +
 
 +
Control of the pipeline:
 +
You can use the pipeline in several ways:
 +
* Launch the full pipeline via the control script, the preferred and automated method.
 +
* Run individual steps in the context of the pipeline using control scripts in steps/, useful if the pipeline fails at a specific point.
 +
* Directly run helper scripts outside of the logging facilities of the pipeline, useful if you need to rebuild something quickly.
  
 
= Document Conventions  =
 
= Document Conventions  =
Line 21: Line 35:
 
= Staging Pipeline Code  =
 
= Staging Pipeline Code  =
  
The update pipeline code is available in the WormBase admin module:  
+
The update pipeline code is available in the website-admin module on github:
  
 
  tharris> git clone git@github.com:WormBase/website-admin.git
 
  tharris> git clone git@github.com:WormBase/website-admin.git
 
  tharris> cd website-admin/update
 
  tharris> cd website-admin/update
 
   
 
   
   lib/                   -- the shared library suite that handles updates.
+
   lib/                 -- the shared library suite that handles updates.
   development/ -- code related to staging data on the development site.  
+
   staging/         -- code related to staging data on the development site.  
   production/     -- code related to the releasing data/code into production.
+
   production/     -- code related to the releasing data/code into production.
  
 
The contents are:  
 
The contents are:  
Line 41: Line 55:
 
= Running the Update Pipeline =
 
= Running the Update Pipeline =
  
== Dry Run ==
+
== Log Files ==
  
A single shell script fires off all steps of the process. You should run it inside a screen.
+
The Staging Pipeline creates informative logs for each step of the process. Logs are located at:
  
   tharris> screen
+
   /usr/local/wormbase/logs/staging/WSXXX
 +
      master.log  -- Master log tracks all steps; useful for a meta-view of the pipeline. Contains INFO, WARN, ERROR, and FATAL messages.
 +
      master.err  -- Master error log tracks ERROR and FATAL messages encountered across all steps.
 +
 
 +
Each individual step creates its own log file capturing STDERR and STDOUT containing informative messages from the pipeline.  These are useful for tracking progress and triaging problems. For example:
 +
 
 +
  /usr/local/wormbase/logs/staging/WSXXX/build_blast_databases/
 +
      step.log  -- step-specific log tracking everything from TRACE on up.
 +
      step.err    -- step-specific error log tracking ERROR and FATAL messages. Good place to check if a step breaks.
  
You can first run it as a dryrun.  This will test for the presence of all input files without running any lengthy builds.
+
== Executing the Pipeline ==
  
  tharris> ./bin/update.sh dryrun
+
A single script fires off all steps of the process. You should run it inside a screen.
  
 +
  tharris> screen
 +
  tharris> ./stage_via_pipeline.pl WSXXX
 +
    (to disconnect your screen)
 +
  tharris> ^a ^d
 +
    (to resume your screen)
 +
  tharris> screen -r 
 +
 
 
Monitor progress of the update by following the master log file:
 
Monitor progress of the update by following the master log file:
tharris> tail -f /usr/local/wormbase/logs/update/WSXXX.log
 
  
== Running the Update Pipeline ==
+
tharris> tail -f /usr/local/wormbase/logs/staging/WSXXX/master.log
  
tharris> screen
+
  [http://kb.iu.edu/data/acuy.html screen command reference]
tharris> ./bin/update.sh
 
 
 
And monitor the progress as before.
 
  
 +
== Update Steps ==
 +
 +
The steps that comprise the pipeline, the script to launch them, and the module that implements are listed below.
  
 +
<table border="1" width="100%">
 +
<tr><th>step</th><th>control script</th><th>module</th></tr>
  
 +
<tr>
 +
<td>Mirror a new release</td>
 +
<td>steps/mirror_new_release.pl (manual)<br/ >
 +
<td>W::U::Staging::MirrorNewRelease</td>
 +
</tr>
  
 +
<tr>
 +
<td>Unpack ACeDB</td>
 +
<td>steps/unpack_acedb.pl (manual)<br/ >
 +
<td>W::U::Staging::UnpackAcedb</td>
 +
</tr>
  
== Log Files ==
+
<tr>
 +
<td>Create BLAST databases</td>
 +
<td>steps/create_blast_databases.pl<br/ >
 +
<td>W::U::Staging::CreateBlastDatabases</td>
 +
</tr>
  
The Staging Pipeline creates copious logs.  All logs are located at:
+
<tr>
 +
<td>Create BLAT databases</td>
 +
<td>steps/create_blat_databases.pl<br/ >
 +
<td>W::U::Staging::CreateBlatDatabases</td>
 +
</tr>
  
  /usr/local/wormbase/logs/updates/WSXXX
+
<tr>
 +
<td>Load Genomic GFF databases</td>
 +
<td>steps/load_genomic_gff_databases.pl<br/ >
 +
<td>W::U::Staging::LoadGenomicGFFDatabases</td>
 +
</tr>
  
The master log is useful for a meta-view of the process. It can be found at:
+
<tr>
 +
<td>Unpack and Load the ClustalW database</td>
 +
<td>steps/unpack_clustalw_database.pl<br/ >
 +
<td>W::U::Staging::UnpackClustalWDatabase</td>
 +
</tr>
  
  /usr/local/wormbase/logs/updates/WSXXX/master.log
+
<tr>
 +
<td>Compile Gene Summary resources</td>
 +
<td>steps/compile_gene_resources.pl<br/ >
 +
<td>W::U::Staging::CompileGeneResources</td>
 +
</tr>
  
Each individual step creates its own log files capturing STDERR and STDOUT. These are useful for tracking progress and triaging problems. For example:
+
<tr>
 +
<td>Compile Ontology resources</td>
 +
<td>steps/compile_ontology_resources.pl<br/ >
 +
<td>W::U::Staging::CompileOntologyResources</td>
 +
</tr>
  
  /usr/local/wormbase/logs/updates/WSXXX/build_blast_databases/
 
      step.log  -- STDOUT; and progress of the step.
 
      step.err    -- STDERR;  errors and warning generated during the step.
 
  
 +
<tr>
 +
<td>Compile Orthology resources</td>
 +
<td>steps/compile_orthology_resources.pl<br/ >
 +
<td>W::U::Staging::CompileOrthologyResources</td>
 +
</tr>
  
  
 +
<tr>
 +
<td>Create commonly requested datasets</td>
 +
<td>steps/dump_annotations.pl<br/ >
 +
<td>W::U::Staging::DumpAnnotations</td>
 +
</tr>
  
 +
<tr>
 +
<td>Go Live</td>
 +
<td>steps/go_live.pl<br/ >
 +
<td>W::U::Staging::GoLive</td>
 +
</tr>
  
 +
<tr>
 +
<td>Convert GFF2 To GFF3</td>
 +
<td>steps/convert_gff2togff3.pl<br/ >
 +
<td>W::U::Staging::ConvertGFF2ToGFF3</td>
 +
</tr>
  
 +
<tr>
 +
<td>Precache content</td>
 +
<td>steps/precache_content.pl<br/ >
 +
<td>W::U::Staging::PrecacheContent</td>
 +
</tr>
  
However, is best to run each step individually;  periodically checking the log to monitor progress.
+
</table>
  
> ./steps/<individual_update_script>.pl
 
 
The steps to perform are:
 
  
*Purge disk space
 
*Create necessary directories
 
*Mirror and unpack ACeDB from Sanger
 
*Compile Gene Resources
 
*Mirror ontology files from Sanger
 
*Compile ontology resources for the site
 
 
*Compile orthology resources
 
*Compile orthology resources
 
*Compile interaction resources
 
*Compile interaction resources
*Create BLAST databases for available species
 
*Create BLAT database for available species
 
*Create ePCR databases for select species
 
*Load genomic GFF databases for available species
 
 
*Build and load GFF patches  
 
*Build and load GFF patches  
*Convert GFF2 into GFF3
 
 
*Create a GBrowse-driven genetic map  
 
*Create a GBrowse-driven genetic map  
 
*Create a GBrowse-drive physical map  
 
*Create a GBrowse-drive physical map  
*Update strains database
 
*Create dump files of common datasets
 
*Mirror annotation files from Sanger to the FTP site
 
*Load CLUSTAL db
 
  
Each step is described below.
 
  
== Check logs and resolve issues ==
+
=== Purge old releases ===
  
Log files are under admin/development/logs and the logging levels are set to record ERRORS and WARNINGS
+
Clear out disk space by throwing away old releases.
  
NB: While logs provide excellent records of the loading process. It is important to check proper functioning of the web pages that use the data; links to which are provided with the description of each step.
+
./steps/purge_old_releases.sh WSXXX  // release to purge; clears out acedb, mysql, support DBs, and staging FTP
  
== Update Steps  ==
+
=== Mirror a new release ===
  
=== Purge Disk Space ===
+
New releases are mirrored directly from the Hinxton FTP site to the primary WormBase FTP site hosted on wb-dev:/usr/local/ftp. This process is run via cron but can also be run manually.
  
Remove obsolete files from the (staging) FTP site on the development site. These have already been mirrored to the production FTP site and do not need to be maintained on the development server.  
+
<pre>
 +
# Mirror the next incremental release newer than what we already have:
 +
# Cron:
 +
./steps/mirror_new_release.pl
  
  Usage : ./steps/purge_disk_space.sh
+
  # Or mirror a specific release:  
 +
./steps/mirror_new_release.pl --release WS150  // Mirror the WS150 release to /usr/local/ftp/pub/wormbase/releases/WS150
 +
</pre>
  
=== Create necessary directories  ===
+
=== Unpack Acedb ===
  
Create staging directories for the update process.  
+
Unpack AceDB from the new release. Customize the new installation with skeletal files located at /usr/local/wormbase/website/classic/wspec. You will need approximately 25 GB of disk space per release.
  
  Usage : ./steps/create_directories.pl ${RELEASE}
+
via pipeline: ./steps/unpack_acedb.pl ${RELEASE}
   Output : Directories in ${WORMBASE}/databases (primarily)
+
via helper : helpers/unpack_acedb.sh ${RELEASE}
 +
  Input : Files staged at ${FTP}/releases/${RELEASE}/species
 +
   Output : Unpacked AceDB files in ${ACEDB}/wormbase_${RELEASE}
  
=== Mirror and unpack ACeDB from Sanger ===
+
When complete, you should have a new acedb directory containing:
 +
    -- database
 +
    -- wgf
 +
    -- wquery
 +
    -- wspec
  
Mirror and unpack the new release of the database from Sanger. Add in appropriate control files for the new acedb database: serverpasswrd.wrm, passwrd.wrm, serverconfig.wrm pulled from the checked out development source (/usr/local/wormbase/wspec).
+
Test the database by:
  
Files will be mirrored and unpacked to /usr/local.  Please make sure that there is sufficient space in this directory!  You will most likely need approximately 25 GB of disk spacePossible places to free up disk space:
+
> ps -ax | grep acedb ## to get acedb process number
 +
> kill -9 {AceDB proc number} ## stop current acedb process
 +
> sudo /etc/init.d/xinetd restart
 +
  > saceclient localhost -port 2005
  
  /usr/local/mysq/data
+
=== Create BLAST databases  ===
  /usr/local/wormbase/acedb/tmp
 
  ~{you}/mp3s
 
  
   Usage : ./steps/mirror_acedb.pl ${RELEASE}
+
Build nucleotide and protein BLAST databases for species with genomic sequence and conceptual translations. In addition, for C. elegans and C. briggsae, we build blast databases for ESTs and "genes" (actually clones).
          ./steps/unpack_acedb.pl ${RELEASE}
+
 
   Input : Files mirrored from Sanger to ${ACEDB}/tmp
+
   Usage : ./steps/create_blast_databases.pl ${RELEASE}
   Output : Unpacked AceDB files in ${ACEDB}/wormbase_${RELEASE}  
+
  Input : Genomic sequence and protein FASTA files staged at:
 +
              ${FTP}/releases/species/${SPECIES}.${RELEASE}.genomic.fa.gz
 +
              ${FTP}/releases/species/${SPECIES}.${RELEASE}.protein.fa.gz
 +
              Gene and EST sequences derived from AceDB
 +
  Output : BLAST databases in ${WORMBASE}/databases/${RELEASE}/blast/${SPECIES}.
 +
 
 +
=== Create BLAT databases ===
 +
 
 +
Build BLAT databases of genomic sequence.
 +
 
 +
  Usage : ./steps/create_blat_databases.pl ${RELEASE}
 +
   Input : Genomic sequence FASTA files staged at
 +
              ${FTP}/releases/species/${SPECIES}/${SPECIES}.${RELEASE}.genomic.fa.gz
 +
   Output : BLAT .nib files in ${WORMBASE}/databases/${RELEASE}/blat/${SPECIES}
 +
 
 +
=== Load genomic GFF annotations ===
 +
 
 +
Convert GFF files into Bio::DB::GFF (GFF2) or Bio::DB::SeqFeature::Store (GFF3) databases.
 +
 
 +
  Usage : ./steps/load_genomic_gff_databases.pl ${RELEASE}
 +
  Input : GFF and FASTA files staged at:
 +
            GFF : ${FTP}/releases/species/${SPECIES}/${SPECIES}.${RELEASE}.gff[2|3].gz
 +
            DNA : ${FTP}/releases/species/${SPECIES}/${SPECIES}.${RELEASE}.genomic.fa.gz
 +
 
 +
===  Unpack and Load the CLUSTALW database  ===
 +
 
 +
Usage: ./steps/unpack_clustal_database.pl {WSRELEASE}
 +
Input: ${FTP}/releases/${RELEASE}/COMPARATIVE_ANALYSIS/wormpep${RELEASE}.clw.sql.bz2
 +
Output: a new mysql database called clustal_${RELEASE}
 +
 
 +
== Pre-compiled file documentation for Beta ==
 +
 
 +
NOTE: no need to pre-compile Gene and Interaction data resources any more
 +
 
 +
=== Compile Ontology Resources ===
 +
 
 +
TODO: This step is just copying over files, could be avoided but maybe file access issue?
 +
 
 +
  In Wormbase.conf
 +
  association_count_file = /usr/local/wormbase/databases/%s/ontology/%s_association.%s.wb
 +
 
 +
 
 +
copy over the mirrored ontology files under ftp directory, these files are used for calculating the counts of associated terms in Ontology Browser.
 +
 
 +
  Usage : ./steps/compile_ontology_resources.pl ${RELEASE}
 +
  Input : WB files staged at: /usr/local/ftp/pub/wormbase/releases/WSXXX/ONTOLOGY
  
''Note: This can take a *long* time. You might to run this in a screen:''
+
*anatomy_association.RELEASE.wb
 +
*gene_association.RELEASE.wb
 +
*phenotype_association.RELEASE.wb
  
   > screen
+
   Output : to ${WORMBASE}/database/${RELEASE}/ontology:
  > ./steps/mirror_acedb.pl WSXXX
 
    (to disconnect your screen)
 
  > ^a ^d
 
    (to resume your screen)
 
  > screen -r 
 
 
 
  [http://kb.iu.edu/data/acuy.html screen command reference]
 
  
When complete:
+
*anatomy_association.RELEASE.wb
 +
*gene_association.RELEASE.wb
 +
*phenotype_association.RELEASE.wb
  
  > a new acedb directory should have been completed: wormbase_WS{RELEASE} and should contain subdirs:
+
=== Compile Orthology Resources ===
    -- database
 
    -- wgf
 
    -- wquery
 
    -- wspec
 
  > check to make sure that the following directory and symlink exist: ${ACEDB}/wormbase -> wormbase_${RELEASE}
 
  > correct configuration file in wspec: serverconfig.wrm, server.wrm, serverpasswd.wrm
 
  
It is also good to check for a functional db -- try to connect to the acedb via a test script that creates a db handleIt may be necessary to restart the database:
+
  Usage : ./steps/compile_orthology_resources.pl ${RELEASE}
 +
  Input : omim.txt and morbidmap files from OMIM
  
  > ps -ax | grep acedb ## to get acedb process number
+
  Intermediate: gene2omim.txt (query AceDB and build the worm gene to human disease relationship based on its human ortholog)
> kill -9 {AceDB proc number} ## stop current acedb process
 
> sudo /etc/init.d/xinetd restart
 
> ## run command for test script ##
 
  
Trouble shooting notes:
+
  Output : Files ${WORMBASE}/databases/${RELEASE}/orthology
* gzip version incompatibilies may break the step.
+
  Disease.ace (load into Xapian, use for disease/search and disease page)
  
 
=== Compile Gene Resources ===
 
=== Compile Gene Resources ===
 +
 +
BROKEN
  
 
Create precompiled gene page files specifically to populate the Phenotype tables.
 
Create precompiled gene page files specifically to populate the Phenotype tables.
Line 191: Line 306:
 
   Output : Files ${WORMBASE}/databases/${RELEASE}/gene
 
   Output : Files ${WORMBASE}/databases/${RELEASE}/gene
  
*gene_rnai_pheno.txt
+
*gene_rnai_pheno.txt (gene/gene)
*gene_xgene_pheno.txt
+
*gene_xgene_pheno.txt (gene/gene)
*phenotype_id2name.txt
+
*phenotype_id2name.txt (gene/gene)
*rnai_data.txt
+
*rnai_data.txt (gene/gene)
*variation_data.txt
+
*variation_data.txt (gene/gene)
  
=== Mirror ontology from Sanger ===
+
=== Compile Ontology Resources ===
 
 
Mirror OBO files from Sanger.  These are necessary for the ontology searches.
 
  
  Usage : ./steps/mirror_ontology_files.pl ${RELEASE}
+
TODO: This step relies on a number of external helper scripts that should ALL be folded into CompileGeneResources. They are located at
  Input : none
 
  Output : Files mirrored to ${WORMBASE}/databases/${RELEASE}/ontology
 
  
=== Compile Ontology Resources ===
+
staging/helpers/gene_summary
  
Take the mirrored files and compile them into the databases for the ontology searches.
+
Take the mirrored ontology files and compile them into the databases for the ontology searches.
  
 
   Usage : ./steps/compile_ontology_resources.pl ${RELEASE}
 
   Usage : ./steps/compile_ontology_resources.pl ${RELEASE}
   Input : OBO files mirrored earlier in ${WORMBASE}/databases/${RELEASE}/ontology;
+
   Input : OBO files staged at: /usr/local/ftp/pub/wormbase/releases/WSXXX/ONTOLOGY
 
           compiled data files from Compile Gene Resources step
 
           compiled data files from Compile Gene Resources step
  Output : to ${WORMBASE}/database/${RELEASE}/ontology:
 
 
  
 
*anatomy_association.RELEASE.wb
 
*anatomy_association.RELEASE.wb
 +
*anatomy_ontology.RELEASE.obo
 +
*gene_association.RELEASE.wb
 +
*gene_association.RELEASE.wb.cb
 
*gene_association.RELEASE.wb.ce
 
*gene_association.RELEASE.wb.ce
*gene_ontology.RELEASE.obo
 
*name2id.txt
 
*search_data.txt
 
*anatomy_ontology.RELEASE.obo
 
 
*gene_association.RELEASE.wb.cjp
 
*gene_association.RELEASE.wb.cjp
*id2association_counts.txt
 
*parent2ids.txt
 
*gene_association.RELEASE.wb
 
 
*gene_association.RELEASE.wb.ppa
 
*gene_association.RELEASE.wb.ppa
*id2name.txt
+
*gene_association.RELEASE.wb.rem
 +
*gene_ontology.RELEASE.obo
 
*phenotype_association.RELEASE.wb
 
*phenotype_association.RELEASE.wb
*gene_association.RELEASE.wb.cb
 
*gene_association.RELEASE.wb.rem
 
*id2parents.txt
 
 
*phenotype_ontology.RELEASE.obo
 
*phenotype_ontology.RELEASE.obo
 +
 +
  Output : to ${WORMBASE}/database/${RELEASE}/ontology:
 +
 +
*id2association_counts.txt (ontology/tree_lister)
 +
*id2name.txt (ontology/tree_lister)
 +
*id2parents.txt (ontology/tree_lister)
 +
*id2total_associations.txt (ontology/tree_lister)
 +
*name2id.txt
 +
*search_data.txt
 +
*parent2ids.txt (ontology/tree_lister)
  
 
=== Compile Orthology Resources ===
 
=== Compile Orthology Resources ===
  
 +
Create precompiled orthology and disease display and search related files
  
Create precompiled orthology and disease display and search related files
+
This MUST be run after the ontology step above.
  
   Usage : ./steps/compile_gene_data.pl ${RELEASE}
+
   Usage : ./steps/compile_orthology_resources.pl ${RELEASE}
          ./steps/compile_ortholog_data.pl ${RELEASE}
 
          ./steps/compile_orthology_resources.pl ${RELEASE}
 
 
   Input : AceDB data, omim.txt and morbidmap files from OMIM, ontology resources files
 
   Input : AceDB data, omim.txt and morbidmap files from OMIM, ontology resources files
  Output : Files ${WORMBASE}/databases/${RELEASE}/orthology
+
 
 +
*gene_association.$RELEASE.wb.ce
 +
 
 +
  Intermediate:
  
 
*all_proteins.txt
 
*all_proteins.txt
 
*disease_page_data.txt
 
*disease_page_data.txt
*disease_search_data.txt
 
 
*full_disease_data.txt
 
*full_disease_data.txt
*gene_association.$RELEASE.wb.ce
 
 
*gene_id2go_bp.txt
 
*gene_id2go_bp.txt
 
*gene_id2go_mf.txt
 
*gene_id2go_mf.txt
*gene_id2omim_ids.txt
 
 
*gene_id2phenotype.txt
 
*gene_id2phenotype.txt
 
*gene_list.txt
 
*gene_list.txt
*go_id2omim_ids.txt
 
*go_ids2descendants.txt
 
*hs_ensembl_id2omim.txt
 
 
*hs_proteins.txt
 
*hs_proteins.txt
*id2name.txt
 
 
*last_processed_gene.txt
 
*last_processed_gene.txt
*name2id.txt
 
 
*omim2disease.txt
 
*omim2disease.txt
*omim_id2all_ortholog_data.txt
 
*omim_id2disease_desc.txt
 
*omim_id2disease_name.txt
 
*omim_id2disease_notes.txt
 
*omim_id2disease_synonyms.txt
 
*omim_id2disease.txt
 
*omim_id2gene_name.txt
 
 
*omim_id2go_ids.txt
 
*omim_id2go_ids.txt
 
*omim_id2phenotypes.txt
 
*omim_id2phenotypes.txt
 +
*omim_id2disease_synonyms.txt
 
*omim_reconfigured.txt
 
*omim_reconfigured.txt
 +
*ortholog_other_data.txt
 
*ortholog_other_data_hs_only.txt
 
*ortholog_other_data_hs_only.txt
*ortholog_other_data.txt
+
 
 +
  Output : Files ${WORMBASE}/databases/${RELEASE}/orthology (summary page using files in parenthesis)
 +
 
 +
*disease_search_data.txt (orthology/search)
 +
*gene_id2omim_ids.txt (orthology/disease)
 +
*go_id2omim_ids.txt (orthology/disease,ontology/gene)
 +
*go_ids2descendants.txt (orthology/gene)
 +
*hs_ensembl_id2omim.txt (orthology/gene)
 +
*id2name.txt (orthology/disease, orthology/gene)
 +
*name2id.txt (orthology/disease)
 +
*omim_id2all_ortholog_data.txt (orthology/disease)
 +
*omim_id2disease_desc.txt (orthology/disease)
 +
*omim_id2disease_name.txt (orthology/disease,ontology/gene)
 +
*omim_id2disease_notes.txt (orthology/disease)
 +
 
 +
*omim_id2disease.txt (orthology/gene)
 +
*omim_id2gene_name.txt (orthology/search)
  
 
=== Compile Interaction Data ===
 
=== Compile Interaction Data ===
  
 +
DEPRECATED. NO NEED TO MIGRATE THIS INTO THE NEW STAGING PIPELINE.
  
 
Create precompiled gene page files specifically to populate interaction listing pages.
 
Create precompiled gene page files specifically to populate interaction listing pages.
Line 286: Line 405:
 
*compiled_interaction_data.txt
 
*compiled_interaction_data.txt
  
=== Create BLAST databases for available species ===
+
=== Convert GFF2 into GFF3 ===
  
Build BLAST databases for available species. For some species, this includes databases for genes, ests, proteins, and genomic sequence.  For others, only genomic sequence and protein databases are constructed.
+
  Usage: ./steps/convert_gff2_to_gff3.pl ${RELEASE}
  
  Usage : ./steps/create_blast_databases.pl ${RELEASE}
+
=== Create files of commonly requested datasets ===
  Input : Genomic sequence and protein FASTA files mirrored from Sanger to
+
 
              ${FTP}/genomes/${SPECIES}/sequences/dna/${SPECIES}.dna.fa.gz;
+
  Usage: ./steps/dump_annotations.pl {WSRELEASE}
          Gene and EST sequences derived from AceDB
+
Output: datasets in ${FTP}/releases/${RELEASE}/annotations and species/annotations
  Output : BLAST databases in ${WORMBASE}/databases/${RELEASE}/blast
+
 
 +
The staging harness will automatically run scripts in annotation_dumpers/*.  These scripts should abide by the following conventions:
 +
 
 +
    1. Be located in update/staging/annotation_dumpers                                                                                                     
 +
    2. Be named either                                                                                                                                     
 +
          dump_species_*  for species level data (like brief IDs)                                                                                         
 +
          dump_resource_*  for resource level data (like laboratories)                                                                                     
 +
    3. Follow existing examples, including available parameters.                                                                                           
 +
    4. Dump to STDERR and STDOUT.                                                                                                                         
 +
    Notes:                                                                                                                                                 
 +
                                                                                                                                                           
 +
    1. dump_species_* will be called for each species managed by WormBase                                                                                 
 +
      and will end up in                                                                                                                                 
 +
          ${FTP_ROOT}/releases/[RELEASE]/species/[G_SPECIES]/annotation/[G_SPECIES].[RELEASE].[DESCRIPTION].txt                                           
 +
      dump_resource_* will be called once and end up in                                                                                                   
 +
          ${FTP_ROOT}/datasets-wormbase/wormbase.[RELEASE].[DESCRIPTION].txt                                                                 
 +
    2. The filename will be created by stripping off dump_species_ or dump_resource_.                                                                     
 +
        Species specific resources will be prepended with the appropriate species.
 +
 
 +
 
 +
=== Create a GBrowse-driven genetic map  ===
 +
 
 +
Notes...
 +
 
 +
  Usage: ./steps/load_gmap_gffdb.pl ${RELEASE}
 +
 
 +
=== Go Live ===
 +
 
 +
steps/go_live.pl WSXXX
 +
 
 +
This script will
 +
* create a series of symlinks in the FTP site (for example, to maintain the virtually organized species/ directory)
 +
* create "current" symlinks in the FTP site for easy access.
 +
* adjust symlinks to mysql GFF databases updated this release.
 +
* adjust the symlink at /usr/local/wormbase/acedb/wormbase -> to point to the new wormbase_WSXXX unpacked acedb.
 +
* Sync the staging FTP site to the production FTP site.
 +
 
 +
If you omit the WSXXX on the command line, the script will simply organize the virtual directories on the ftp site up to and including the current release. MySQL and AceDB symlinks will not be created.
 +
 
 +
=== Branch the web code ===
 +
 
 +
For each major WS release, create a corresponding branch in the git repository. We branch the moment a new release is staged so that we can begin development for that model.  This can be done from any repository.
  
NB: Ensure that the database is entered in /u/l/w/website-classic/html/blast_blat/search_form.html and /u/l/w/website-classic/cgi_perl/searches/blast_blat
+
staging> cd /usr/local/wormbase/website/production
 +
staging> git pull
 +
// Creating a tag...
 +
// staging> git tag -a -m "WSXXX" WSXXX HEAD
 +
// staging> git push --tags
 +
// Create a new branch, one tracking the remote master repository
 +
staging> git branch --track WSXXX origin/master
 +
staging> git branch // list all branches
 +
// Push the branch to the remote repository
 +
staging> git push origin WSXXX
 +
staging> git push
  
=== Create BLAT databases for available species ===
+
= Steps to execute after a release has been staged =
  
Build BLAT databases of genomic sequence for each available species.
+
== Precache content ==
  
  Usage : ./steps/create_blat_databases.pl ${RELEASE}
+
Once a release has been successfully staged and tested, we pre-cache select computationally intensive content to a CouchDB instance located on the development server.
  Input : Genomic sequence FASTA files mirrored from Sanger to
 
              ${FTP}/genomes/${SPECIES}/sequences/dna/${SPECIES}.dna.fa.gz
 
  Output : BLAT .nib files in ${WORMBASE}/databases/${RELEASE}/blat
 
  
=== Create ePCR databases for available species ===
+
Precaching works as follows.
  
Build ePCR databases for each species.
+
1. The primary Catalyst configuration file is read.
  
  Usage : ./steps/create_epcr_databases.pl ${RELEASE}
+
2. For each widget set to "precache = true" in config, REST requests will be constructed against staging.wormbase.org. This will be running the NEW version of WormBase.
  Input : Mirrored genomic sequence files from Sanger to
 
              ${FTP}/genomes/${SPECIES}/sequences/dna/${SPECIES}.dna.fa.gz
 
  Output : ePCR databases to ${WORMBASE}/databases/${RELEASE}/epcr
 
  
=== Load genomic GFF DBs for available species ===
+
3A. The webapp returns HTML; the precache script stores it in the reference (production) couchdb.
  
Get genomic gff files from Sanger and load into the DBs
+
    OR
  
  Usage :./steps/mirror_genomic_gffdb.pl ${RELEASE}
+
3B. The web app on staging.wormbase.org will automatically cache result in the reference couchdb (currently web6); the couchdb that is written to can be configured in wormbase.conf.
        ./steps/process_genomic_gff_files.pl ${RELEASE}
 
        ./steps/load_genomic_gffdb.pl ${RELEASE}
 
  Input : GFF and FASTA files mirrored from Sanger to
 
            GFF : ${FTP}/${SPECIES}/genome_feature_tables/GFF2/${SPECIES}.${VERSION}.gff.gz
 
            DNA : ${FTP}/${SPECIES}/sequences/dna/${SPECIES}.${VERSION}.dna.fa.gz
 
  Output: (This script both creates/mirrors and uses the files above).
 
  
Trouble shooting notes:
+
4. The reference couchDB will then be replicated during production release to each node, scaling horizontally.
*File and directory names need to be consistent with format specified in Update.pm circa line 36
 
*If necessary, e.g. files were incorrectly named, they should be manually downloaded from the source site, uncompressed and renamed correctly
 
*Progress can be monitored by checking the log file ..admin/development/logs/{WSRELEASE}/load genomic feature gff databases.log and the building of the mysql db files in /usr/local/mysql/data
 
*The C. elegans build is particularly complex.  One block is apparently the permissions for /u/l/mysql/data/c_elegans_{RELEASE} directory, set to 775.
 
*Granting permission to web requests: mysql -u root -p -e 'grant select on *.* to "www-data"@localhost'
 
*To restart mysql db: sudo /etc/init.d/mysql restart; sudo /etc/init.d/httpd graceful
 
  
=== Build and Load GFF patches ===
+
5. During a production cycle, additional content will be stored in the reference couchdb; this is synced periodically to each node.
 +
 +
See [[Administration:WormBase_Production_Environment#CouchDB|CouchDB]] for details.
  
Create and load number of patches for the c_elegans GFF database, including protein motifs and genetic limits.
+
== Purge old releases ==
  
  Usage : ./steps/load_gff_patches.pl ${RELEASE}
+
To purge previous releases from the production and staging nodes,
  Input : Files created to ${FTP}/genomes/c_elegans/genome_feature_tables/GFF2
 
Output : Files created above.
 
  
=== Convert GFF2 into GFF3 ===
+
staging/steps/purge_old_releases.pl --release WSXXX
  
Notes...
+
This will remove the following:
 +
 +
/usr/local/wormbase/acedb/wormbase_WSXXX
 +
/usr/local/wormbase/databases/WSXXX
 +
/usr/local/mysql/data/WSXXX
  
  Usage: ./steps/convert_gff2_to_gff3.pl ${RELEASE}
+
And on the staging host
 +
/usr/local/ftp/pub/wormbase/releases/WSXXX
  
=== Create a GBrowse-driven genetic map  ===
+
== Compiled file documentation and plans ==
  
Notes...
+
<table border="1" width="100%">
 +
<tr>
 +
<th>Step</th>
 +
<th>File</th>
 +
<th>Description</th>
 +
<th>WB2 update</th>
 +
</tr>
  
  Usage: ./steps/load_gmap_gffdb.pl ${RELEASE}
+
<tr>
 +
<td>Compile Gene Resources </td>
 +
<td>gene_rnai_pheno.txt </td>
 +
<td>
 +
* many-to-many listing of Gene_ids to RNAi_ids and Related phenotype ID. (or not)
 +
* Used in the classic gene summary for phenotype tables</td>
 +
<td>TODO: update the appropriate method in Gene.pm to pull data for given gene directly from Ace</td>
 +
</tr>
  
=== Create a GBrowse-driven physical map  ===
 
  
Notes...
+
<tr>
 +
<td>Compile Gene Resources </td>
 +
<td>gene_xgene_pheno.txt</td>
 +
<td>
 +
* many-to-many listing of Gene_ids to Transgene_ids and Related phenotype ID. (or not)
 +
* Used in the classic gene summary for phenotype tables</td>
 +
<td>TODO: update the appropriate method in Gene.pm to pull data for given gene directly from Ace</td>
 +
</tr>
 +
<tr>
  
  Usage: ./steps/load_pmap_gffdb.pl {WSRELEASE}
+
<td>Compile Gene Resources </td>
 +
<td>phenotype_id2name.txt</td>
 +
<td>
 +
* listing of Phenotype_ids to Phenotype names.
 +
* Used in the classic gene summary for phenotype tables in order to obviate the extraction of individual phenotype objects and their names</td>
 +
<td>TODO: function will be deprecated since individual phenotype objects will be extracted</td>
 +
</tr>
 +
<tr>
  
===  Create dump files of common datasets  ===
+
<tr>
 +
<td>Compile Gene Resources </td>
 +
<td>rnai_data.txt</td>
 +
<td>
 +
* listing of RNAi data for the RNAi table in gene/gene<br/ >
 +
* Used in the classic gene summary for RNAi tables</td>
 +
<td>TODO: update the appropriate method in Gene.pm to pull data for given gene directly from Ace</td>
 +
</tr>
  
Notes...
+
<tr>
 +
<td>Compile Gene Resources </td>
 +
<td>variation_data.txt </td>
 +
<td>
 +
* many-to-many listing of Gene_ids to RNAi_ids and Related phenotype ID. (or not)
 +
* Used in the classic gene summary for phenotype tables</td>
 +
<td>TODO: update the appropriate method in Gene.pm to pull data for given gene directly from Ace</td>
 +
</tr>
  
  
===  Load the CLUSTALW database  ===
+
<tr>
 +
<td>Compile Ontology Resources </td>
 +
<td>id2association_counts.txt</td>
 +
<td>
 +
* listing of ontology object ids (GO, Anatomy_term, Phenotype) to the number of annotations to the term
 +
* used in tree_lister (browser)
 +
<td>retain for browser, move into tied hash?</td>
 +
</tr>
  
Notes...
+
<tr>
 +
<td>Compile Ontology Resources </td>
 +
<td>id2name.txt</td>
 +
<td>
 +
* listing of ontology object ids (GO, Anatomy_term, Phenotype) to the term
 +
* used in tree_lister (browser)
 +
<td>retain for browser, move into tied hash?</td>
 +
</tr>
  
  Usage: ./steps/load_clustal_db.pl {WSRELEASE}
 
  
===  Mirror annotation files from Sanger to the FTP site  ===
+
<tr>
 +
<td>Compile Ontology Resources </td>
 +
<td>id2parents.txt</td>
 +
<td>
 +
* one-to-many listing of ontology object ids (GO, Anatomy_term, Phenotype) to the parent terms and respective relationship
 +
* used in tree_lister (browser)
 +
<td>retain for browser, move into tied hash?</td>
 +
</tr>
  
Notes...
+
<tr>
 +
<td>Compile Ontology Resources </td>
 +
<td>id2total_associations.txt</td>
 +
<td>
 +
* listing of ontology object terms (GO, Anatomy_term, Phenotype) to the id
 +
* used in tree_lister (browser)
 +
<td>retain for browser, move into tied hash?</td>
 +
</tr>
  
  Usage: ./steps/mirror_annotations.pl {WSRELEASE}
+
<tr>
 +
<td>Compile Ontology Resources </td>
 +
<td>search_data.txt </td>
 +
<td>
 +
* pipe-delieneated data on each term including synonyms and annotations
 +
* Used in GO, Anatomy_term, and Phenotype searches
 +
<td>To be superceded by Xapian search</td>
 +
</tr>
  
== Compiled File Table ==
 
  
 +
<td>Compile Ontology Resources </td>
 +
<td>parent2ids.txt</td>
 +
<td>
 +
* one-to-many listing of ontology object ids (GO, Anatomy_term, Phenotype) to their immediate descendants term ids
 +
* used in tree_lister (browser)
 +
<td>retain for browser, move into tied hash?</td>
 +
</tr>
  
== Update Records ==
+
<tr>
 +
<td>Compile Orthology Resources</td>
 +
<td>disease_search_data.txt </td>
 +
<td>
 +
* Pipe delineated file containing details on the diseases extracted from OMIM
 +
* used in disease search
 +
<td> Use data for Xapian search; work with Abby</td>
 +
</tr>
  
[[Update Matrix WS205]]
+
<tr>
 +
<td>Compile Orthology Resources</td>
 +
<td>gene_id2omim_ids.txt</td>
 +
<td>
 +
* one-to-many listing of gene_ids to omim IDs
 +
* used in orthology/disease
 +
<td>Keep for disease object</td>
 +
</tr>
  
[[Update Matrix WS206]]
+
<tr>
 +
<td>Compile Orthology Resources</td>
 +
<td>go_id2omim_ids.txt</td>
 +
<td>
 +
* one-to-many listing of gene_ids to omim IDs
 +
* used in orthology/disease and ontology/gene
 +
</td>
 +
<td>useful for further paralog data expansion and integration</td>
 +
</tr>
  
[[Update Matrix WS207]]
+
<tr>
 +
<td>Compile Orthology Resources</td>
 +
<td>go_ids2descendants.txt</td>
 +
<td>
 +
* one-to-many listing of go ids to its list of the go ids of its descendants
 +
* plan was to use this data for paralog display in orthology/gene
 +
</td>
 +
<td>useful for further paralog data expansion and integration</td>
 +
</tr>
  
[[Update Matrix WS208]]
+
<tr>
 +
<td>Compile Orthology Resources</td>
 +
<td>hs_ensembl_id2omim.txt</td>
 +
<td>
 +
* one-to-one listing of hs ensembl ids to omim ids
 +
* used in orthology/gene
 +
</td>
 +
<td>disease UI</td>
 +
</tr>
  
[[Update Matrix WS209]]
+
<tr>
 +
<td>Compile Orthology Resources</td>
 +
<td>id2name.txt </td>
 +
<td>
 +
* listing of ontology object ids (GO, Anatomy_term, Phenotype) to the term
 +
* used in orthology/disease & orthology/gene
 +
</td>
 +
<td>useful for further paralog data expansion and integration(?)</td>
 +
</tr>
  
[[Update Matrix WS210]]
+
<tr>
 +
<td>Compile Orthology Resources</td>
 +
<td>name2id.txt</td>
 +
<td>
 +
* listing of ontology object terms(GO, Anatomy_term, Phenotype) to the id
 +
* used in orthology/disease
 +
</td>
 +
<td>useful for further paralog data expansion and integration(?)</td>
 +
</tr>
  
[[Update Matrix WS211]]
 
  
[[Update Matrix WS212]]
+
<tr>
 +
<td>Compile Orthology Resources</td>
 +
<td>omim_id2all_ortholog_data.txt </td>
 +
<td>
 +
* pipe delineated file containing details of the ortholog associated with the omim id
 +
* used in orthology/disease
 +
</td>
 +
<td>use to generate Xapian data; work with Abby</td>
 +
</tr>
  
[[Update Matrix WS213]]
+
<tr>
 +
<td>Compile Orthology Resources</td>
 +
<td>omim_id2disease_desc.txt </td>
 +
<td>
 +
* one-to-one listing of omim ids and the disease description
 +
* used in orthology/disease
 +
</td>
 +
<td>use in Disease object model and UI</td>
 +
</tr>
  
[[Update Matrix WS214]]
+
<tr>
 +
<td>Compile Orthology Resources</td>
 +
<td>omim_id2disease_name.txt</td>
 +
<td>
 +
* one-to-one listing of omim ids and the disease name
 +
* used in orthology/disease
 +
</td>
 +
<td>use in Disease object model and UI</td>
 +
</tr>
  
[[Update Matrix WS215]]
+
<tr>
 +
<td>Compile Orthology Resources</td>
 +
<td>omim_id2disease_notes.txt </td>
 +
<td>
 +
* one-to-one listing of omim ids and the disease notes from omim
 +
* used in orthology/disease
 +
</td>
 +
<td>use in Disease object model and UI</td>
 +
</tr>
  
[[Update Matrix WS216]]
+
<tr>
 +
<td>Compile Orthology Resources</td>
 +
<td>omim_id2disease.txt </td>
 +
<td>
 +
* one-to-one listing of omim ids and the disease names
 +
* used in orthology/disease
 +
</td>
 +
<td>use in Disease object model and UI</td>
 +
</tr>
  
[[Update Matrix WS217]]
+
<tr>
 +
<td>Compile Orthology Resources</td>
 +
<td>omim_id2gene_name.txt</td>
 +
<td>
 +
* one-to-many listing of omim ids to gene names
 +
* used in orthology/search
 +
</td>
 +
<td>probably deprecate in updating Disease object model</td>
 +
</tr>
  
[[Update Matrix WS220]]
 
  
[[Category:Developer documentation]]
+
</table>

Latest revision as of 03:11, 28 October 2013

THIS DOCUMENT IS NOW DEPRECATED. PLEASE REFER TO THE PROJECT DOCUMENTATION MAINTAINED ON GOOGLE DRIVE.

https://docs.google.com/a/wormbase.org/document/d/1oPpj8d5gibUc-gpUZorl6ETT5baE6mp-v2bMedKauiA/edit#


Overview

This document describes the process of staging a new release of WormBase on the development server.

The automated staging pipeline consists of:

  • a harness that handles logging, error trapping, and basic shared functions
  • a suite of modules -- one per step -- that implement the step or make calls to helper scripts
  • helper scripts in Perl or shell that assist in implementation

Control of the pipeline: You can use the pipeline in several ways:

  • Launch the full pipeline via the control script, the preferred and automated method.
  • Run individual steps in the context of the pipeline using control scripts in steps/, useful if the pipeline fails at a specific point.
  • Directly run helper scripts outside of the logging facilities of the pipeline, useful if you need to rebuild something quickly.

Document Conventions

The current development server is

 wb-dev: wb-dev.oicr.on.ca (FQDN); aka: dev.wormbase.org

When indicated, substitute WSXXX for ${RELEASE}.

System paths referred to in this document:

      FTP : /usr/local/ftp/pub/wormbase
 WORMBASE : /usr/local/wormbase
    ACEDB : /usr/local/wormbase/acedb

Staging Pipeline Code

The update pipeline code is available in the website-admin module on github:

tharris> git clone git@github.com:WormBase/website-admin.git
tharris> cd website-admin/update

 lib/                  -- the shared library suite that handles updates.
 staging/          -- code related to staging data on the development site. 
 production/     -- code related to the releasing data/code into production.

The contents are:

  logs/            -- the logs directory for each step/update
  bin/              -- Perl scripts for manually launching individual steps.
  README.txt -- directory listing
  updatelog.conf  -- a configuration file for the update process
  update.sh  -- master script that fires off each step of the pipeline
  util/       -- various helper scripts for the update process

Running the Update Pipeline

Log Files

The Staging Pipeline creates informative logs for each step of the process. Logs are located at:

 /usr/local/wormbase/logs/staging/WSXXX
     master.log  -- Master log tracks all steps; useful for a meta-view of the pipeline. Contains INFO, WARN, ERROR, and FATAL messages.
     master.err  -- Master error log tracks ERROR and FATAL messages encountered across all steps.

Each individual step creates its own log file capturing STDERR and STDOUT containing informative messages from the pipeline. These are useful for tracking progress and triaging problems. For example:

 /usr/local/wormbase/logs/staging/WSXXX/build_blast_databases/
      step.log   -- step-specific log tracking everything from TRACE on up.
      step.err    -- step-specific error log tracking ERROR and FATAL messages. Good place to check if a step breaks.

Executing the Pipeline

A single script fires off all steps of the process. You should run it inside a screen.

 tharris> screen
 tharris> ./stage_via_pipeline.pl WSXXX
   (to disconnect your screen)
 tharris> ^a ^d
   (to resume your screen)
 tharris> screen -r   
 

Monitor progress of the update by following the master log file:

tharris> tail -f /usr/local/wormbase/logs/staging/WSXXX/master.log
 screen command reference

Update Steps

The steps that comprise the pipeline, the script to launch them, and the module that implements are listed below.

stepcontrol scriptmodule
Mirror a new release steps/mirror_new_release.pl (manual)
W::U::Staging::MirrorNewRelease
Unpack ACeDB steps/unpack_acedb.pl (manual)
W::U::Staging::UnpackAcedb
Create BLAST databases steps/create_blast_databases.pl
W::U::Staging::CreateBlastDatabases
Create BLAT databases steps/create_blat_databases.pl
W::U::Staging::CreateBlatDatabases
Load Genomic GFF databases steps/load_genomic_gff_databases.pl
W::U::Staging::LoadGenomicGFFDatabases
Unpack and Load the ClustalW database steps/unpack_clustalw_database.pl
W::U::Staging::UnpackClustalWDatabase
Compile Gene Summary resources steps/compile_gene_resources.pl
W::U::Staging::CompileGeneResources
Compile Ontology resources steps/compile_ontology_resources.pl
W::U::Staging::CompileOntologyResources
Compile Orthology resources steps/compile_orthology_resources.pl
W::U::Staging::CompileOrthologyResources
Create commonly requested datasets steps/dump_annotations.pl
W::U::Staging::DumpAnnotations
Go Live steps/go_live.pl
W::U::Staging::GoLive
Convert GFF2 To GFF3 steps/convert_gff2togff3.pl
W::U::Staging::ConvertGFF2ToGFF3
Precache content steps/precache_content.pl
W::U::Staging::PrecacheContent


  • Compile orthology resources
  • Compile interaction resources
  • Build and load GFF patches
  • Create a GBrowse-driven genetic map
  • Create a GBrowse-drive physical map


Purge old releases

Clear out disk space by throwing away old releases.

./steps/purge_old_releases.sh WSXXX   // release to purge; clears out acedb, mysql, support DBs, and staging FTP

Mirror a new release

New releases are mirrored directly from the Hinxton FTP site to the primary WormBase FTP site hosted on wb-dev:/usr/local/ftp. This process is run via cron but can also be run manually.

 # Mirror the next incremental release newer than what we already have:
 # Cron: 
 ./steps/mirror_new_release.pl

 # Or mirror a specific release: 
 ./steps/mirror_new_release.pl --release WS150   // Mirror the WS150 release to /usr/local/ftp/pub/wormbase/releases/WS150

Unpack Acedb

Unpack AceDB from the new release. Customize the new installation with skeletal files located at /usr/local/wormbase/website/classic/wspec. You will need approximately 25 GB of disk space per release.

via pipeline: ./steps/unpack_acedb.pl ${RELEASE}
via helper : helpers/unpack_acedb.sh ${RELEASE}
  Input : Files staged at ${FTP}/releases/${RELEASE}/species
 Output : Unpacked AceDB files in ${ACEDB}/wormbase_${RELEASE} 

When complete, you should have a new acedb directory containing:

   -- database
   -- wgf
   -- wquery
   -- wspec

Test the database by:

> ps -ax | grep acedb ## to get acedb process number
> kill -9 {AceDB proc number} ## stop current acedb process
> sudo /etc/init.d/xinetd restart 
> saceclient localhost -port 2005

Create BLAST databases

Build nucleotide and protein BLAST databases for species with genomic sequence and conceptual translations. In addition, for C. elegans and C. briggsae, we build blast databases for ESTs and "genes" (actually clones).

  Usage : ./steps/create_blast_databases.pl ${RELEASE}
  Input : Genomic sequence and protein FASTA files staged at:
             ${FTP}/releases/species/${SPECIES}.${RELEASE}.genomic.fa.gz
             ${FTP}/releases/species/${SPECIES}.${RELEASE}.protein.fa.gz
             Gene and EST sequences derived from AceDB
 Output : BLAST databases in ${WORMBASE}/databases/${RELEASE}/blast/${SPECIES}.

Create BLAT databases

Build BLAT databases of genomic sequence.

  Usage : ./steps/create_blat_databases.pl ${RELEASE}
  Input : Genomic sequence FASTA files staged at
             ${FTP}/releases/species/${SPECIES}/${SPECIES}.${RELEASE}.genomic.fa.gz
 Output : BLAT .nib files in ${WORMBASE}/databases/${RELEASE}/blat/${SPECIES}

Load genomic GFF annotations

Convert GFF files into Bio::DB::GFF (GFF2) or Bio::DB::SeqFeature::Store (GFF3) databases.

 Usage : ./steps/load_genomic_gff_databases.pl ${RELEASE}
 Input : GFF and FASTA files staged at:
           GFF : ${FTP}/releases/species/${SPECIES}/${SPECIES}.${RELEASE}.gff[2|3].gz
           DNA : ${FTP}/releases/species/${SPECIES}/${SPECIES}.${RELEASE}.genomic.fa.gz

Unpack and Load the CLUSTALW database

Usage: ./steps/unpack_clustal_database.pl {WSRELEASE}
Input: ${FTP}/releases/${RELEASE}/COMPARATIVE_ANALYSIS/wormpep${RELEASE}.clw.sql.bz2
Output: a new mysql database called clustal_${RELEASE}

Pre-compiled file documentation for Beta

NOTE: no need to pre-compile Gene and Interaction data resources any more

Compile Ontology Resources

TODO: This step is just copying over files, could be avoided but maybe file access issue?

 In Wormbase.conf 
 association_count_file = /usr/local/wormbase/databases/%s/ontology/%s_association.%s.wb
  

copy over the mirrored ontology files under ftp directory, these files are used for calculating the counts of associated terms in Ontology Browser.

  Usage : ./steps/compile_ontology_resources.pl ${RELEASE}
  Input : WB files staged at: /usr/local/ftp/pub/wormbase/releases/WSXXX/ONTOLOGY
  • anatomy_association.RELEASE.wb
  • gene_association.RELEASE.wb
  • phenotype_association.RELEASE.wb
 Output : to ${WORMBASE}/database/${RELEASE}/ontology:
  • anatomy_association.RELEASE.wb
  • gene_association.RELEASE.wb
  • phenotype_association.RELEASE.wb

Compile Orthology Resources

  Usage : ./steps/compile_orthology_resources.pl ${RELEASE}
  Input :  omim.txt and morbidmap files from OMIM
  Intermediate: gene2omim.txt  (query AceDB and build the worm gene to human disease relationship based on its human ortholog)
 Output : Files ${WORMBASE}/databases/${RELEASE}/orthology 
 Disease.ace (load into Xapian, use for disease/search and disease page)

Compile Gene Resources

BROKEN

Create precompiled gene page files specifically to populate the Phenotype tables.

  Usage : ./steps/compile_gene_resource.pl ${RELEASE}
  Input : AceDB data
 Output : Files ${WORMBASE}/databases/${RELEASE}/gene
  • gene_rnai_pheno.txt (gene/gene)
  • gene_xgene_pheno.txt (gene/gene)
  • phenotype_id2name.txt (gene/gene)
  • rnai_data.txt (gene/gene)
  • variation_data.txt (gene/gene)

Compile Ontology Resources

TODO: This step relies on a number of external helper scripts that should ALL be folded into CompileGeneResources. They are located at

staging/helpers/gene_summary 

Take the mirrored ontology files and compile them into the databases for the ontology searches.

  Usage : ./steps/compile_ontology_resources.pl ${RELEASE}
  Input : OBO files staged at: /usr/local/ftp/pub/wormbase/releases/WSXXX/ONTOLOGY
          compiled data files from Compile Gene Resources step
  • anatomy_association.RELEASE.wb
  • anatomy_ontology.RELEASE.obo
  • gene_association.RELEASE.wb
  • gene_association.RELEASE.wb.cb
  • gene_association.RELEASE.wb.ce
  • gene_association.RELEASE.wb.cjp
  • gene_association.RELEASE.wb.ppa
  • gene_association.RELEASE.wb.rem
  • gene_ontology.RELEASE.obo
  • phenotype_association.RELEASE.wb
  • phenotype_ontology.RELEASE.obo
 Output : to ${WORMBASE}/database/${RELEASE}/ontology:
  • id2association_counts.txt (ontology/tree_lister)
  • id2name.txt (ontology/tree_lister)
  • id2parents.txt (ontology/tree_lister)
  • id2total_associations.txt (ontology/tree_lister)
  • name2id.txt
  • search_data.txt
  • parent2ids.txt (ontology/tree_lister)

Compile Orthology Resources

Create precompiled orthology and disease display and search related files

This MUST be run after the ontology step above.

  Usage : ./steps/compile_orthology_resources.pl ${RELEASE}
  Input : AceDB data, omim.txt and morbidmap files from OMIM, ontology resources files
  • gene_association.$RELEASE.wb.ce
  Intermediate: 
  • all_proteins.txt
  • disease_page_data.txt
  • full_disease_data.txt
  • gene_id2go_bp.txt
  • gene_id2go_mf.txt
  • gene_id2phenotype.txt
  • gene_list.txt
  • hs_proteins.txt
  • last_processed_gene.txt
  • omim2disease.txt
  • omim_id2go_ids.txt
  • omim_id2phenotypes.txt
  • omim_id2disease_synonyms.txt
  • omim_reconfigured.txt
  • ortholog_other_data.txt
  • ortholog_other_data_hs_only.txt
 Output : Files ${WORMBASE}/databases/${RELEASE}/orthology (summary page using files in parenthesis)
  • disease_search_data.txt (orthology/search)
  • gene_id2omim_ids.txt (orthology/disease)
  • go_id2omim_ids.txt (orthology/disease,ontology/gene)
  • go_ids2descendants.txt (orthology/gene)
  • hs_ensembl_id2omim.txt (orthology/gene)
  • id2name.txt (orthology/disease, orthology/gene)
  • name2id.txt (orthology/disease)
  • omim_id2all_ortholog_data.txt (orthology/disease)
  • omim_id2disease_desc.txt (orthology/disease)
  • omim_id2disease_name.txt (orthology/disease,ontology/gene)
  • omim_id2disease_notes.txt (orthology/disease)
  • omim_id2disease.txt (orthology/gene)
  • omim_id2gene_name.txt (orthology/search)

Compile Interaction Data

DEPRECATED. NO NEED TO MIGRATE THIS INTO THE NEW STAGING PIPELINE.

Create precompiled gene page files specifically to populate interaction listing pages.

  Usage : ./steps/compile_interaction_data.pl ${RELEASE}
  Input : AceDB interaction data
 Output : Files ${WORMBASE}/databases/${RELEASE}/interaction
  • compiled_interaction_data.txt

Convert GFF2 into GFF3

 Usage: ./steps/convert_gff2_to_gff3.pl ${RELEASE}

Create files of commonly requested datasets

 Usage: ./steps/dump_annotations.pl {WSRELEASE}
Output: datasets in ${FTP}/releases/${RELEASE}/annotations and species/annotations

The staging harness will automatically run scripts in annotation_dumpers/*. These scripts should abide by the following conventions:

   1. Be located in update/staging/annotation_dumpers                                                                                                      
   2. Be named either                                                                                                                                      
          dump_species_*   for species level data (like brief IDs)                                                                                          
          dump_resource_*  for resource level data (like laboratories)                                                                                      
   3. Follow existing examples, including available parameters.                                                                                            
   4. Dump to STDERR and STDOUT.                                                                                                                           
   Notes:                                                                                                                                                  
                                                                                                                                                            
   1. dump_species_* will be called for each species managed by WormBase                                                                                   
      and will end up in                                                                                                                                   
         ${FTP_ROOT}/releases/[RELEASE]/species/[G_SPECIES]/annotation/[G_SPECIES].[RELEASE].[DESCRIPTION].txt                                            
      dump_resource_* will be called once and end up in                                                                                                    
         ${FTP_ROOT}/datasets-wormbase/wormbase.[RELEASE].[DESCRIPTION].txt                                                                   
   2. The filename will be created by stripping off dump_species_ or dump_resource_.                                                                       
       Species specific resources will be prepended with the appropriate species.


Create a GBrowse-driven genetic map

Notes...

 Usage: ./steps/load_gmap_gffdb.pl ${RELEASE}

Go Live

steps/go_live.pl WSXXX

This script will

  • create a series of symlinks in the FTP site (for example, to maintain the virtually organized species/ directory)
  • create "current" symlinks in the FTP site for easy access.
  • adjust symlinks to mysql GFF databases updated this release.
  • adjust the symlink at /usr/local/wormbase/acedb/wormbase -> to point to the new wormbase_WSXXX unpacked acedb.
  • Sync the staging FTP site to the production FTP site.

If you omit the WSXXX on the command line, the script will simply organize the virtual directories on the ftp site up to and including the current release. MySQL and AceDB symlinks will not be created.

Branch the web code

For each major WS release, create a corresponding branch in the git repository. We branch the moment a new release is staged so that we can begin development for that model. This can be done from any repository.

staging> cd /usr/local/wormbase/website/production
staging> git pull
// Creating a tag...
// staging> git tag -a -m "WSXXX" WSXXX HEAD
// staging> git push --tags
// Create a new branch, one tracking the remote master repository
staging> git branch --track WSXXX origin/master
staging> git branch // list all branches
// Push the branch to the remote repository
staging> git push origin WSXXX
staging> git push

Steps to execute after a release has been staged

Precache content

Once a release has been successfully staged and tested, we pre-cache select computationally intensive content to a CouchDB instance located on the development server.

Precaching works as follows.

1. The primary Catalyst configuration file is read.

2. For each widget set to "precache = true" in config, REST requests will be constructed against staging.wormbase.org. This will be running the NEW version of WormBase.

3A. The webapp returns HTML; the precache script stores it in the reference (production) couchdb.

   OR

3B. The web app on staging.wormbase.org will automatically cache result in the reference couchdb (currently web6); the couchdb that is written to can be configured in wormbase.conf.

4. The reference couchDB will then be replicated during production release to each node, scaling horizontally.

5. During a production cycle, additional content will be stored in the reference couchdb; this is synced periodically to each node.

See CouchDB for details.

Purge old releases

To purge previous releases from the production and staging nodes,

staging/steps/purge_old_releases.pl --release WSXXX

This will remove the following:

/usr/local/wormbase/acedb/wormbase_WSXXX
/usr/local/wormbase/databases/WSXXX
/usr/local/mysql/data/WSXXX

And on the staging host

/usr/local/ftp/pub/wormbase/releases/WSXXX

Compiled file documentation and plans

Step File Description WB2 update
Compile Gene Resources gene_rnai_pheno.txt
  • many-to-many listing of Gene_ids to RNAi_ids and Related phenotype ID. (or not)
  • Used in the classic gene summary for phenotype tables
TODO: update the appropriate method in Gene.pm to pull data for given gene directly from Ace
Compile Gene Resources gene_xgene_pheno.txt
  • many-to-many listing of Gene_ids to Transgene_ids and Related phenotype ID. (or not)
  • Used in the classic gene summary for phenotype tables
TODO: update the appropriate method in Gene.pm to pull data for given gene directly from Ace
Compile Gene Resources phenotype_id2name.txt
  • listing of Phenotype_ids to Phenotype names.
  • Used in the classic gene summary for phenotype tables in order to obviate the extraction of individual phenotype objects and their names
TODO: function will be deprecated since individual phenotype objects will be extracted
Compile Gene Resources rnai_data.txt
  • listing of RNAi data for the RNAi table in gene/gene
  • Used in the classic gene summary for RNAi tables
TODO: update the appropriate method in Gene.pm to pull data for given gene directly from Ace
Compile Gene Resources variation_data.txt
  • many-to-many listing of Gene_ids to RNAi_ids and Related phenotype ID. (or not)
  • Used in the classic gene summary for phenotype tables
TODO: update the appropriate method in Gene.pm to pull data for given gene directly from Ace
Compile Ontology Resources id2association_counts.txt
  • listing of ontology object ids (GO, Anatomy_term, Phenotype) to the number of annotations to the term
  • used in tree_lister (browser)
retain for browser, move into tied hash?
Compile Ontology Resources id2name.txt
  • listing of ontology object ids (GO, Anatomy_term, Phenotype) to the term
  • used in tree_lister (browser)
retain for browser, move into tied hash?
Compile Ontology Resources id2parents.txt
  • one-to-many listing of ontology object ids (GO, Anatomy_term, Phenotype) to the parent terms and respective relationship
  • used in tree_lister (browser)
retain for browser, move into tied hash?
Compile Ontology Resources id2total_associations.txt
  • listing of ontology object terms (GO, Anatomy_term, Phenotype) to the id
  • used in tree_lister (browser)
retain for browser, move into tied hash?
Compile Ontology Resources search_data.txt
  • pipe-delieneated data on each term including synonyms and annotations
  • Used in GO, Anatomy_term, and Phenotype searches
To be superceded by Xapian search
Compile Ontology Resources parent2ids.txt
  • one-to-many listing of ontology object ids (GO, Anatomy_term, Phenotype) to their immediate descendants term ids
  • used in tree_lister (browser)
retain for browser, move into tied hash?
Compile Orthology Resources disease_search_data.txt
  • Pipe delineated file containing details on the diseases extracted from OMIM
  • used in disease search
Use data for Xapian search; work with Abby
Compile Orthology Resources gene_id2omim_ids.txt
  • one-to-many listing of gene_ids to omim IDs
  • used in orthology/disease
Keep for disease object
Compile Orthology Resources go_id2omim_ids.txt
  • one-to-many listing of gene_ids to omim IDs
  • used in orthology/disease and ontology/gene
useful for further paralog data expansion and integration
Compile Orthology Resources go_ids2descendants.txt
  • one-to-many listing of go ids to its list of the go ids of its descendants
  • plan was to use this data for paralog display in orthology/gene
useful for further paralog data expansion and integration
Compile Orthology Resources hs_ensembl_id2omim.txt
  • one-to-one listing of hs ensembl ids to omim ids
  • used in orthology/gene
disease UI
Compile Orthology Resources id2name.txt
  • listing of ontology object ids (GO, Anatomy_term, Phenotype) to the term
  • used in orthology/disease & orthology/gene
useful for further paralog data expansion and integration(?)
Compile Orthology Resources name2id.txt
  • listing of ontology object terms(GO, Anatomy_term, Phenotype) to the id
  • used in orthology/disease
useful for further paralog data expansion and integration(?)
Compile Orthology Resources omim_id2all_ortholog_data.txt
  • pipe delineated file containing details of the ortholog associated with the omim id
  • used in orthology/disease
use to generate Xapian data; work with Abby
Compile Orthology Resources omim_id2disease_desc.txt
  • one-to-one listing of omim ids and the disease description
  • used in orthology/disease
use in Disease object model and UI
Compile Orthology Resources omim_id2disease_name.txt
  • one-to-one listing of omim ids and the disease name
  • used in orthology/disease
use in Disease object model and UI
Compile Orthology Resources omim_id2disease_notes.txt
  • one-to-one listing of omim ids and the disease notes from omim
  • used in orthology/disease
use in Disease object model and UI
Compile Orthology Resources omim_id2disease.txt
  • one-to-one listing of omim ids and the disease names
  • used in orthology/disease
use in Disease object model and UI
Compile Orthology Resources omim_id2gene_name.txt
  • one-to-many listing of omim ids to gene names
  • used in orthology/search
probably deprecate in updating Disease object model