Test and training regions

From WormBaseWiki
Jump to: navigation, search

How the nGASP test and training regions were chosen

Here is information about how we chose the test and training regions for the nGASP project.

These regions were chosen by dividing the WS160 C. elegans genome into non-overlapping windows of 1 Mb, and discarding 6 leftover <1-Mb regions at the (high-coordinate) ends of chromosomes. There were 96 1-Mb windows.

For each 1-Mb window, the gene density was calculated as the number of curated genes in that window. If a gene overlapped the border of two 1-Mb windows, it was included in the gene count for both neighbouring windows. One-third of the windows have gene densities of <=174 genes per Mb, which we classified as low gene density windows. One-third of the windows have gene densities of >=227 genes per Mb, which we classified as high gene density windows.

We also calculated the conservation with C. briggsae in each 1-Mb window, as the number of bases covered by strong WABA matches to C. briggsae. This gives a measure of the fraction of the DNA in the window that is strongly conserved with C. briggsae. Many WABA matches overlap, so if a base-pair was covered by more than one WABA match, it was only counted once. One-third of the windows have <=54440 bp covered by strong WABA matches, which we classified as low conservation windows. One-third of the windows have >=77077 bp covered by strong WABA matches, and we classified these as high conservation windows.

The test set consists of 10 1-Mb windows:

  • 2 randomly chosen high-conservation, high-gene density autosomal windows (out of 10)
  • 2 randomly chosen high-conservation, low-gene density autosomal windows (out of 9)
  • 2 randomly chosen low-conservation, high-gene density autosomal windows (out of 12)
  • 2 randomly chosen low-conservation, low-gene density autosomal windows (out of 10)
  • 2 randomly chosen high-conservation, low-gene density X-chromosome windows (out of 7)

The training set also consists of 10 1-Mb windows, chosen with the same criteria as the test set. None of the test and training set windows overlap.