Mining WormBase with Bio::DB::GFF

From WormBaseWiki
Jump to navigationJump to search

Mining WormBase with MySQL & PostgreSQL

The BioPerl library provides a simple relational schema and database access layer for querying genomic features. This is the access method of choice to use for mining WormBase for:

Spliced and unspliced genes. UTRs, upstream regions or introns. Searching for annotations that overlap one another, for example, finding all genes that are in an intron of another gene. Comparing syntenic regions of the C. elegans and C. briggsae genomes. You will need to install your own local database. Currently WormBase does not provide world access to the MySQL database of genome features (although this may change in the future). You must:


Install BioPerl and either MySQL or PostgreSQL. Read the manual pages on Bio::DB::GFF. This describes the API for loading and querying the database. Download the following flat files that describe the C. elegans and C. briggsae genomes: elegansWSXXXX.gff.gz GFF-format descriptions of sequence annotations on C. elegans. The elegansWSXXXX.gff.gz file contains complete annotations for WormBase release XXXX, where XXXX is the current release number. Other files in this directory contain the same information sorted by chromosome. CHROMOSOME_*.dna.gz Fasta-format files of C. elegans chromosome DNA. EST_Elegans.dna.gz Fasta-format files of C. elegans ESTs and cDNAs. briggsae_25.gff.gz GFF-format descriptions of sequence annotations on the C. briggsae genome. briggsae_25.fa.gz Fasta-format file of C. briggsae genomic contigs. After downloading the C. elegans FASTA files, you should combine them into a single file to facilitate loading. This is most easily done like this:

% zcat CHROMOSOME_*.dna.gz EST_Elegans.dna.gz > CElegans.fa

You do not have to do this with C. briggsae, because its DNA data is already combined into one file.

Once these are downloaded, create appropriately-named databases using the MySQL or PostgreSQL administrators' tools. You may create separate databases for each of the genomes (recommended) or put them together in the same database. Assuming that you have created two MySQL databases, one named "elegans" and the other named "briggsae", you will use the bulk_load_gff.pl tool to load them from the downloaded files:

% bulk_load_gff.pl -c -d elegans  -fasta CElegans.fa elegansWSXXXX.gff.gz
% bulk_load_gff.pl -c -d briggsae -fasta briggsae_25.fa.gz briggsae_25.gff.gz

The bulk_load_gff.pl program comes with BioPerl. Look for it in the subdirectory scripts/Bio-DB-GFF.

If you are using PostgreSQL, you should use the script "load_gff.pl", which works, but is very slow, or "pg_bulk_load_gff.pl," which is fast and optimized for PostgreSQL. They both have the same command-line syntax.

It might be neccessary to set TMPDIR enviroment variable to an appropriate temporary directory with enough space to hold the SQL files.

Once you've got the database loaded, you can write scripts to mine the data. For example, this script will find all named (3-letter) genes that are contained within the intron of another named or predicted gene and print out 100 bp upstream from their 5' end:

 
#!/usr/bin/perl
 
# find 3-letter named genes that are contained within the intron of another gene
use strict;
use Bio::DB::GFF;
  
 my $db = Bio::DB::GFF->new('elegans');
 my $intron_stream = $db->get_seq_stream('intron:curated');
 while (my $intron = $intron_stream->next_feature) {
  my @contained_genes = $intron->contained_features('gene:curated') or next;
   for my $gene (@contained_genes) {
    my $upstream = $gene->subseq(-99,0);  # 100 bp upstream - position 0 is 1 bp to left of translational start
    print $gene->name,"\t",$upstream->dna,"\n";
   }
}

The output starts like this:

fkh-6 acctccgtcttcacagttccgagaccccgccctcactcttagcttctgcataatccgttgtctcatttgacaccccctaccataaaaaaatacaataatc kin-31 aaaaaaaaatcgattttatcaaaaaacaatttatttcacatttttgtataactgacactcgtcagaattgtaaaaaccattaatttcatcgttgcattaa ... To fetch all gene models and print their coding regions and UTRs, use the following script:

 
#!/usr/bin/perl
 
use strict;
use Bio::DB::GFF;
  
 my $db  = Bio::DB::GFF->new(-dsn         => 'elegans',
			    -aggregators => 'gene_model{coding_exon,5_UTR,3_UTR/CDS}');
  
 my $gene_stream = $db->get_seq_stream('gene_model:curated');
  
 while (my $gene = $gene_stream->next_seq) {
  print $gene->name,"\n";
   for my $part ($gene->get_SeqFeatures) {
    print "\t",join("\t",$part->method,$part->start,$part->end),"\n";
   }
  print "\n";
 }

This will produce output like:

2L52.1 coding_exon II 1867 1911 1 coding_exon II 2506 2694 1 coding_exon II 2738 2888 1 coding_exon II 2931 3036 1 coding_exon II 3406 3552 1 coding_exon II 3802 3984 1 coding_exon II 4201 4663 1

2RSSE.1 5_UTR II 15268097 15268367 1 coding_exon II 15268368 15268441 1 coding_exon II 15269346 15269681 1 coding_exon II 15269747 15269918 1 coding_exon II 15270683 15270860 1 coding_exon II 15272930 15273201 1

See the documentation for the BioPerl Bio::DB::GFF class for details. Note the use of the aggregator "gene_model{coding_exon,5_UTR,3_UTR/CDS}", which says to aggregate parts of type coding_exon, 5_UTR, 3_UTR and CDS into a single feature of type "gene_model."