Proposal to adapt the EnsEMBL cigar format to GFF3
This is a proposal and in no way implemented or accepted for the official GFF3 spec. It exists solely here on WormBase to get community input.
THE GAP ATTRIBUTE
The GFF3 Gap alignment format is based on the CigarString and BaseAlignFormat format used by Ensembl. The format is extended by frameshift attributes and formalises the usage of orientation attributes.
Protein and nucleotide alignment features typically consist of two sequences, the reference sequence and the "target", and are not always colinear. For example, consider the following alignment between an EST ("EST23") and a segment of the genome ("Chr3"): Chr3 (reference) 1 CAAGACCTAAACTGGAT-TCCAAT 23 EST23 (target) 1 CAAGACCT---CTGGATATCCAAT 21 Previous versions of the GFF format would represent this alignment as three colinear segments, but this made it difficult to reconstruct the gapped alignment. GFF3 recommends representing gapped alignments explicitly with the "Gap" attribute. The Gap attribute's format consists of a series of (operation,length) pairs, for example "8M3D6M". For segments of length 1 the number can be omitted, so "8M1D6M" is equal to "8MD6M". Each operation is a single-letter code: Code Operation ---- --------- M match I insert a gap into the reference sequence D insert a gap into the target (delete from reference) F frameshift forward in the reference sequence R frameshift reverse in the reference sequence In the alignment between EST23 and Chr3 shown above, Chr3 is the reference sequence referred to in the first column of the GFF3 file, and EST23 is the sequence referred to by the Target attribute. This gives a Gap string of "8M3D6MI6M". The full GFF match line will read: Chr3 . Match 1 23 . . . ID=Match1;Target=EST23 1 21;Gap=8M3D6MI6M For protein to nucleotide matches, the M, I and D operations apply to amino acid residues in the target and nucleotide base pairs in the reference in a 1:3 residue. That is, "2M" means to match two amino residues in the target to six base pairs in the reference. Hence this alignment: 100 atgaaggag---gttattgcgaatgtcggcggt 1 M..K..E..V..V..I..-..N..V..G..G.. Corresponds to this GFF3 Line: ctg123 . nucleotide_to_protein 100 129 . + . ID=match008;Target=p101 1 10;Gap=3MI2MD4M In addition, the Gap attribute provides <F>orward and <R>everse frameshift operators to allow for frameshifts in the alignment. These are in nucleotide coordinates: a forward frameshift skips forward the indicated number of base pairs, while a reverse frameshift moves backwards. Examples: 100 atgaaggag---gttattgaatgtcggcggt Gap=3MI2MF4M 1 M..K..E..V..V..I... N..V..G..G 100 atgaaggag---gttataatgtcggcggt Gap=3MI2MR4M 1 M..K..E..V..V..I. N..V..G..G
In the SO, an alignment between the reference sequence and another sequence is called a "match". In addition to the generic "match" type, there are the subclasses "cDNA_match," "EST_match," "translated_nucleotide_match," "nucleotide_to_protein_match," and "nucleotide_motif." Matches typically contain gaps; matches broken up by large gaps are usually called "HSPs" (high-scoring segment pair), and previous incarnations of GFF have handled gapped alignments by breaking up the alignment into a series of ungapped HSPs. The SO does not have an HSP type. Instead, gapped matches are represented as a single feature that occupies a discontinuous location on the reference sequence. Figure 2 shows the same gene as before, but with a new track added showing an alignment of a sequenced cDNA to the genome. For the purposes of illustration, we have shown the regions of alignment to be exact across the three exons of the second spliced transcript (EDEN.2).
The recommended way to represent this alignment is with a single feature of type "cDNA_match" and a Gap attribute that indicates that the alignment is in three segments: ctg123 . cDNA_match 1050 9000 6.2e-45 + . ID=match00001;Target=cdna0123 12 2964;Gap=451M3499D501M1499D2001M Parsed out, the Target attribute indicates that the sequence named "cdna0123" between bases 12 and 2964 (in cdna coordinates) aligns to bases 1050 to 9000 of ctg123. The Gap attribute is easier to read when spaces are inserted: 451M match 451 bases 3499D skip 3499 bases in the reference ctg123 sequence 501M match the next 501 bases 1499D skip 1499 bases in the reference ctg123 2001M match the next 2001 bases Note that the matched region is 2953 bases, which corresponds exactly to the matching subsequence [12,2964] of the target. Extra bases in the cDNA which would cause gaps in the reference sequence would be indicated using the CIGAR "I" notation. Another important item to note is that the ID corresponds to the Match and not to the target sequence. This avoids the confusion that has occurred in previous incarnations of GFF which made it impossible to distinguish between a particular alignment of a target sequence to the genome and all alignments of a target sequence to the genome. A limitation of the Gap representation is that the entire alignment shares the same score (column 6). To give each component of the match a separate score, it can be broken across multiple lines as shown here: ctg123 . cDNA_match 1050 1500 5.8e-42 + . ID=match00001;Target=cdna0123 12 462 ctg123 . cDNA_match 5000 5500 8.1e-43 + . ID=match00001;Target=cdna0123 463 963 ctg123 . cDNA_match 7000 9000 1.4e-40 + . ID=match00001;Target=cdna0123 964 2964 Notice that the ID is the same across each of the three lines, indicating that these lines all refer to a single feature, the Match. Each aligning segment, however has a distinct score and Target region. The two types of representations can be mixed, allowing large aligned segments to have their own GFF line and score, while small gaps within them are represented using a Gap attribute. Matches can align to either the + or the - strand of the reference sequence. This should be denoted in the seventh column of the GFF line and *not* by changing the order of the start and end positions in the Target attribute. To illustrate this, Figure 3 adds an EST pair to the annotation. The two ESTs, mjm1123.5 and mum1123.3 correspond to 5' and 3' EST reads from the same cDNA clone. The following GFF3 lines describe them: ctg123 . EST_match 1200 3200 2.2e-30 + . ID=match00002;Target=mjm1123.5 5 106;Gap=301M1499I201M ctg123 . EST_match 5400 9000 7.4e-32 - . ID=match00003;Target=mjm1123.3 1 502;Gap=101M1499I401M Please note that the subsequence indicated by the Target always uses the coordinate system of the EST, regardless of the direction of the alignment. For the 3' EST, the seventh column contains a "-" to indicate that the match is to the reverse complement of ctg123. The Gap attribute does not change as a consequence of this reverse complementation, and is read from left to right in the usual manner. An application may wish to group the EST pair into a single feature. This can be accomplished by creating an implied cDNA_match that extends from the left end of the first EST to the right end of the last EST, and indicating that this cDNA match is the Parent of the two ESTs. The parts of the match use the SO "match_part" term. A match_part can be used as a subpart of any type of match. ctg123 . cDNA_match 1200 9000 . . . ID=cDNA00001 ctg123 . match_part 1200 3200 2.2e-30 + . ID=match00002;Parent=cDNA00001;Target=mjm1123.5 5 106;Gap=301M1499I201M ctg123 . match_part 5400 9000 7.4e-32 - . ID=match00003;Parent=cDNA00001;Target=mjm1123.3 1 502;Gap=101M1499I401M
The representation of strandedness in nucleotide-to-nucleotide and protein-to-nucleotide alignments is a common source of confusion in GFF files. This section will attempt to explain it. * Case #1: alignment to a + strand transcript Consider a pair of EST matches to the genome: ============================= genome -------------------> transcript ------> <---- EST_A (5') EST_B (3') EST_A is a 5' EST and its sequence (as represented in a FASTA file, for example) is in the same strand as the genomic sequence. It is represented as: ctg123 . EST_match 1000 1500 . + . ID=match001;Target=EST_A 1 500 + The strand field in column #7 is "+" indicating that the match is to the forward strand of the genome. The optional strand field in the Target attribute is also +, indicating that the alignment is to the plus strand of the implied underlying transcript. Let us now consider EST_B , which is a 3' EST. Its sequence as represented in the FASTA file aligns to the reverse complement of the genomic sequence. It is represented as: ctg123 . EST_match 2000 2500 . + . ID=match002;Target=EST_B 1 500 - The strand field in column #7 is "+" indicating that the match is to a transcript feature on the forward of the genome. The strand field in the Target attribute is -, indicating that the EST sequence should be reverse complemented in order to align to the underlying transcript. * Case #2: alignment to a - strand transcript Here is the opposite case: ============================= genome <-------------------- transcript ------> <---- EST_D (3') EST_C (5') In this case, the 5' EST_C aligns to the reverse complement of the forward strand of the genome, while the 3' EST_D aligns to the forward strand directly. These are represented as follows: ctg123 . EST_match 2000 2500 . - . ID=match001;Target=EST_C 1 500 + ctg123 . EST_match 1000 1500 . - . ID=match001;Target=EST_D 1 500 - The first line indicates that the transcript is on the - strand of the genome, and that EST_C aligns to the transcripts forward strand. The second line uses - in the 7th column to indicate that the transcript is on the minus strand, and - in the Target field to indicate that EST_D aligns to the minus strand of the transcript. =================================================================== Confused? Just remember that for purposes of display, the source and target strands will be multipled together. A +/+ or -/- alignment indicates that the reference sequence and the target sequence can be aligned directly. A +/- or -/+ alignment indicates that the target must be reverse complemented in order to align to the plus strand of the reference sequence. =================================================================== A similar rule applies to TBLASTX alignments, which rely on matching the six-frame translation of the source to the six-frame translation of the target. Consider the case of two genomes that align together in the forward direction, whose alignment is supported by translations of genes A and B, one of which is on the plus strand, and the other on the minus strand: =============================> genome X ------> <---- gene A gene B =============================> genome Y These two alignments will be represented as: X TBLASTX translated_nucleotide_match 1000 1500 . + . ID=matchA;Target=Y 500 1000 + X TBLASTX translated_nucleotide_match 2000 2500 . - . ID=matchB;Target=Y 1500 2000 - Note that the first alignment is +/+ and the second is -/-. Both indicate that the sequences of genomes X and Y can be aligned directly. Now we look at the case of two genomes that align in the antiparallel direction: =============================> genome X ------> <---- gene A gene B <============================= genome Y These two alignments will be represented as: X TBLASTX translated_nucleotide_match 1000 1500 . + . ID=matchA;Target=Y 500 1000 - X TBLASTX translated_nucleotide_match 2000 2500 . - . ID=matchB;Target=Y 1500 2000 + The first match indicates that a plus strand feature of genome X aligns to a minus strand feature of genome Y. The second match indicates that a minus strand feature of genome X aligns to a plus strand feature of genome Y. In both cases, the result is to align the plus strand of genome X to the minus strand of genome Y.