GFF3specProposal

From WormBaseWiki
Revision as of 11:45, 3 November 2009 by Mh6 (talk | contribs) (another url fix)
Jump to navigationJump to search

Proposal to adapt the EnsEMBL cigar format to GFF3

This is a proposal and in no way implemented or accepted for the official GFF3 spec. It exists solely here on WormBase to get community input.

THE GAP ATTRIBUTE

The GFF3 Gap alignment format is based on the CigarString and BaseAlignFormat format used by Ensembl. The format is extended by frameshift attributes and formalises the usage of orientation attributes.

Protein and nucleotide alignment features typically consist of two
sequences, the reference sequence and the "target", and are not always
colinear.  For example, consider the following alignment between an
EST ("EST23") and a segment of the genome ("Chr3"):

        Chr3  (reference)  1 CAAGACCTAAACTGGAT-TCCAAT  23
        EST23 (target)     1 CAAGACCT---CTGGATATCCAAT  21

Previous versions of the GFF format would represent this alignment as
three colinear segments, but this made it difficult to reconstruct the
gapped alignment.  GFF3 recommends representing gapped alignments
explicitly with the "Gap" attribute.  The Gap attribute's format
consists of a series of (operation,length) pairs, for example "8M3D6M".

For segments of length 1 the number can be omitted, so "8M1D6M" is equal
to "8MD6M". 

Each operation is a single-letter
code:

       Code      Operation
       ----      ---------

        M        match
        I        insert a gap into the reference sequence
        D        insert a gap into the target (delete from reference)
        F        frameshift forward in the reference sequence
        R        frameshift reverse in the reference sequence

In the alignment between EST23 and Chr3 shown above, Chr3 is the
reference sequence referred to in the first column of the GFF3 file,
and EST23 is the sequence referred to by the Target attribute.  This
gives a Gap string of "8M3D6MI6M". The full GFF match line will
read:
          
   Chr3 . Match 1 23 . . . ID=Match1;Target=EST23 1 21;Gap=8M3D6MI6M

For protein to nucleotide matches, the M, I and D operations apply to
amino acid residues in the target and nucleotide base pairs in the
reference in a 1:3 residue.  That is, "2M" means to match two amino
residues in the target to six base pairs in the reference.  Hence this
alignment:

 100 atgaaggag---gttattgcgaatgtcggcggt
   1 M..K..E..V..V..I..-..N..V..G..G..

Corresponds to this GFF3 Line:

 ctg123 . nucleotide_to_protein 100 129 . + . ID=match008;Target=p101 1 10;Gap=3MI2MD4M

In addition, the Gap attribute provides <F>orward and <R>everse
frameshift operators to allow for frameshifts in the alignment.  These
are in nucleotide coordinates: a forward frameshift skips forward the
indicated number of base pairs, while a reverse frameshift moves
backwards.  Examples:


 100 atgaaggag---gttattgaatgtcggcggt     Gap=3MI2MF4M
   1 M..K..E..V..V..I...
                        N..V..G..G


 100 atgaaggag---gttataatgtcggcggt        Gap=3MI2MR4M
   1 M..K..E..V..V..I.
                      N..V..G..G

ALIGNMENTS

In the SO, an alignment between the reference sequence and another
sequence is called a "match".  In addition to the generic "match"
type, there are the subclasses "cDNA_match," "EST_match,"
"translated_nucleotide_match," "nucleotide_to_protein_match," and
"nucleotide_motif."

Matches typically contain gaps; matches broken up by large gaps are
usually called "HSPs" (high-scoring segment pair), and previous
incarnations of GFF have handled gapped alignments by breaking up the
alignment into a series of ungapped HSPs.

The SO does not have an HSP type.  Instead, gapped matches are
represented as a single feature that occupies a discontinuous location
on the reference sequence.  Figure 2 shows the same gene as before,
but with a new track added showing an alignment of a sequenced cDNA to
the genome.  For the purposes of illustration, we have shown the
regions of alignment to be exact across the three exons of the second
spliced transcript (EDEN.2).

Gff3_spec_Figure2.png

Figure 2

The recommended way to represent this alignment is with a single
feature of type "cDNA_match" and a Gap attribute that indicates that
the alignment is in three segments:

 ctg123 . cDNA_match 1050  9000  6.2e-45  +  . 
    ID=match00001;Target=cdna0123 12 2964;Gap=451M3499D501M1499D2001M

Parsed out, the Target attribute indicates that the sequence named
"cdna0123" between bases 12 and 2964 (in cdna coordinates) aligns to
bases 1050 to 9000 of ctg123.  The Gap attribute is easier to
read when spaces are inserted:

     451M	 match 451 bases
     3499D	 skip 3499 bases in the reference ctg123 sequence
     501M	 match the next 501 bases
     1499D	 skip 1499 bases in the reference ctg123
     2001M	 match the next 2001 bases

Note that the matched region is 2953 bases, which corresponds exactly
to the matching subsequence [12,2964] of the target.  Extra bases in
the cDNA which would cause gaps in the reference sequence would be
indicated using the CIGAR "I" notation.

Another important item to note is that the ID corresponds to the Match
and not to the target sequence.  This avoids the confusion that has
occurred in previous incarnations of GFF which made it impossible to
distinguish between a particular alignment of a target sequence to the
genome and all alignments of a target sequence to the genome.

A limitation of the Gap representation is that the entire alignment
shares the same score (column 6).  To give each component of the match
a separate score, it can be broken across multiple lines as shown
here:

 ctg123 . cDNA_match 1050  1500  5.8e-42  +  . ID=match00001;Target=cdna0123 12 462
 ctg123 . cDNA_match 5000  5500  8.1e-43  +  . ID=match00001;Target=cdna0123 463 963
 ctg123 . cDNA_match 7000  9000  1.4e-40  +  . ID=match00001;Target=cdna0123 964 2964

Notice that the ID is the same across each of the three lines,
indicating that these lines all refer to a single feature, the Match.
Each aligning segment, however has a distinct score and Target region.

The two types of representations can be mixed, allowing large aligned
segments to have their own GFF line and score, while small gaps within
them are represented using a Gap attribute.

Matches can align to either the + or the - strand of the reference
sequence.  This should be denoted in the seventh column of the GFF
line and *not* by changing the order of the start and end positions in
the Target attribute.  To illustrate this, Figure 3 adds an EST pair
to the annotation.  The two ESTs, mjm1123.5 and mum1123.3 correspond
to 5' and 3' EST reads from the same cDNA clone.  The following GFF3
lines describe them:

 ctg123 . EST_match 1200  3200  2.2e-30  +  . 
     ID=match00002;Target=mjm1123.5 5 106;Gap=301M1499I201M

 ctg123 . EST_match 5400  9000  7.4e-32  -  . 
     ID=match00003;Target=mjm1123.3 1 502;Gap=101M1499I401M

Please note that the subsequence indicated by the Target always uses
the coordinate system of the EST, regardless of the direction of the
alignment.  For the 3' EST, the seventh column contains a "-" to
indicate that the match is to the reverse complement of ctg123.  The
Gap attribute does not change as a consequence of this reverse
complementation, and is read from left to right in the usual manner.

An application may wish to group the EST pair into a single feature.
This can be accomplished by creating an implied cDNA_match that
extends from the left end of the first EST to the right end of the
last EST, and indicating that this cDNA match is the Parent of the two
ESTs. The parts of the match use the SO "match_part" term. A
match_part can be used as a subpart of any type of match.

 ctg123 . cDNA_match 1200  9000  .        .  . ID=cDNA00001

 ctg123 . match_part  1200  3200  2.2e-30  +  . 
      ID=match00002;Parent=cDNA00001;Target=mjm1123.5 5 106;Gap=301M1499I201M

 ctg123 . match_part  5400  9000  7.4e-32  -  . 
      ID=match00003;Parent=cDNA00001;Target=mjm1123.3 1 502;Gap=101M1499I401M

GFf3_spec_Figure3.png

FIGURE 3


TRANSCRIPT-RELATIVE ALIGNMENTS

The representation of strandedness in nucleotide-to-nucleotide and
protein-to-nucleotide alignments is a common source of confusion in
GFF files.  This section will attempt to explain it.

* Case #1: alignment to a + strand transcript

Consider a pair of EST matches to the genome:

 =============================  genome
     ------------------->       transcript
     ------>        <----       
      EST_A (5')    EST_B (3')

EST_A is a 5' EST and its sequence (as represented in a FASTA file,
for example) is in the same strand as the genomic sequence.  It is
represented as:

 ctg123 . EST_match  1000 1500 . + . ID=match001;Target=EST_A 1 500 +

The strand field in column #7 is "+" indicating that the match is to
the forward strand of the genome.  The optional strand field in the
Target attribute is also +, indicating that the alignment is to the
plus strand of the implied underlying transcript.

Let us now consider EST_B , which is a 3' EST. Its sequence as
represented in the FASTA file aligns to the reverse complement of the
genomic sequence.  It is represented as:

 ctg123 . EST_match  2000 2500 . + . ID=match002;Target=EST_B 1 500 -

The strand field in column #7 is "+" indicating that the match is to a
transcript feature on the forward of the genome.  The strand field in
the Target attribute is -, indicating that the EST sequence should be
reverse complemented in order to align to the underlying transcript.

* Case #2: alignment to a - strand transcript

Here is the opposite case:

 =============================  genome
     <--------------------      transcript
      ------>        <----       
      EST_D (3')  EST_C (5')

In this case, the 5' EST_C aligns to the reverse complement of the
forward strand of the genome, while the 3' EST_D aligns to the forward
strand directly.  These are represented as follows:

 ctg123 . EST_match  2000 2500 . - . ID=match001;Target=EST_C 1 500 +
 ctg123 . EST_match  1000 1500 . - . ID=match001;Target=EST_D 1 500 -

The first line indicates that the transcript is on the - strand of the
genome, and that EST_C aligns to the transcripts forward strand.  The
second line uses - in the 7th column to indicate that the transcript
is on the minus strand, and - in the Target field to indicate that
EST_D aligns to the minus strand of the transcript.

  ===================================================================
  Confused?  Just remember that for purposes of display, the source
  and target strands will be multipled together.  A +/+ or -/-
  alignment indicates that the reference sequence and the target
  sequence can be aligned directly.  A +/- or -/+ alignment indicates
  that the target must be reverse complemented in order to align to
  the plus strand of the reference sequence.  
  ===================================================================

A similar rule applies to TBLASTX alignments, which rely on matching
the six-frame translation of the source to the six-frame translation
of the target.  Consider the case of two genomes that align together
in the forward direction, whose alignment is supported by translations
of genes A and B, one of which is on the plus strand, and the other on
the minus strand:

  =============================>  genome X
       ------>        <----
       gene A          gene B
  =============================>  genome Y

These two alignments will be  represented as:

  X TBLASTX translated_nucleotide_match 1000 1500 . + . ID=matchA;Target=Y 500 1000 +
  X TBLASTX translated_nucleotide_match 2000 2500 . - . ID=matchB;Target=Y 1500 2000 -

Note that the first alignment is +/+ and the second is -/-.  Both
indicate that the sequences of genomes X and Y can be aligned
directly.

Now we look at the case of two genomes that align in the antiparallel
direction:

  =============================>  genome X
       ------>        <----
       gene A          gene B
  <=============================  genome Y


These two alignments will be represented as:

  X TBLASTX translated_nucleotide_match 1000 1500 . + . ID=matchA;Target=Y 500 1000 -
  X TBLASTX translated_nucleotide_match 2000 2500 . - . ID=matchB;Target=Y 1500 2000 +

The first match indicates that a plus strand feature of genome X
aligns to a minus strand feature of genome Y.  The second match
indicates that a minus strand feature of genome X aligns to a plus
strand feature of genome Y.  In both cases, the result is to align the
plus strand of genome X to the minus strand of genome Y.