GFF source methods

From WormBaseWiki
Jump to: navigation, search

GFF source and feature

GFF2 description at the Sanger Institute

In the WormBase GFF files genes are represented in several ways each specified by a different source and feature (second and third columns)

Gene spans

This is the largest extent of a genes' transcripts from the begining of the most 5' transcripts 5' UTR to the end of the most 3' transcripts 3' UTR. Each gene is represented as a single line.

  • source = gene; feature = gene.

eg nlp-36 in WormBase

CHROMOSOME_III  gene gene    9488630 9489091 .  +   .  Gene "WBGene00007185" ; Position "0.567795" ; Locus "nlp-36"

CDS

A CDS is the coding sequence of a gene from the start codon to the stop codon (so does not include UTR). A gene may have 1 or more CDS's. Each CDS is represented as a single line describing the start and end coordinates.

  • source = curated; feature = CDS.
CHROMOSOME_III  curated   CDS     9488634 9488986 .  +  .  CDS "B0464.3" ; WormPep "CE:CE00017" ;  Locus "nlp-36" ;  Status "Confirmed" ;  Gene "WBGene00007185" ;

The individual exons and introns are described with a single line per exon / intron eg

So for the exampe 3 exon gene . . .

  • source = curated; feature = exon;
CHROMOSOME_III  curated exon    9488634 9488695 .       +       .       CDS "B0464.3"
CHROMOSOME_III  curated exon    9488749 9488839 .       +       .       CDS "B0464.3"
CHROMOSOME_III  curated exon    9488891 9488986 .       +       .       CDS "B0464.3"
  • source = curated; feature = intron;
CHROMOSOME_III  curated intron  9488696 9488748 .       +       .       CDS "B0464.3" ; Confirmed_EST FM248941 ; Confirmed_EST OSTF051A3_1
CHROMOSOME_III  curated intron  9488840 9488890 .       +       .       CDS "B0464.3" ; Confirmed_EST yk1241c10.5 ; Confirmed_EST OSTF051A3_1

Exons are also represented with their coding phase

  • source = curated; feature= coding_exon; (note: the actual coordinates are the same)
CHROMOSOME_III  curated coding_exon     9488634 9488695 .       +       0       CDS "B0464.3"
CHROMOSOME_III  curated coding_exon     9488749 9488839 .       +       1       CDS "B0464.3"
CHROMOSOME_III  curated coding_exon     9488891 9488986 .       +       0       CDS "B0464.3"

Coding_transcript

Each CDS can have one or more Coding_transcripts. Where a CDS has multiple transcript they will only vary in the UTRs. Coding_transcripts are the best equivalent to a full length mRNA that we can build based on available evidence. They go from transcription start site (Eg SL1) to polyA site. The full extent of a Coding_transcript is defined as a single line and this CDS has two Coding_transcripts . .

  • source = Coding_transcript; feature = protein_coding_primary_transcript
CHROMOSOME_III  Coding_transcript    protein_coding_primary_transcript   9488630 9489087 .   +   .   Transcript "B0464.3.1"
CHROMOSOME_III  Coding_transcript    protein_coding_primary_transcript   9488632 9489091 .   +   .   Transcript "B0464.3.2"

Compare the coordinates of these to the full gene span (above), which is bigger than both of these extending from 9488630 to 9489091 - the outer extremities of the two coding_transcipts.

Each Coding_transcript is composed of the following feature types; coding_exons, introns, five_prime_UTR and three_prime_UTR

Exons - source = Coding_transcript; feature = exon

CHROMOSOME_III  Coding_transcript       exon    9488630 9488695 .       +       .       Transcript "B0464.3.1"
CHROMOSOME_III  Coding_transcript       exon    9488749 9488839 .       +       .       Transcript "B0464.3.1"
CHROMOSOME_III  Coding_transcript       exon    9488891 9489087 .       +       .       Transcript "B0464.3.1"

and as the CDS does has a "coding_exon" equivalent

Exons - source = Coding_transcript; feature = coding_exon

CHROMOSOME_III  Coding_transcript       coding_exon     9488634 9488695 .       +       0       Transcript "B0464.3.1" ; CDS "B0464.3"
CHROMOSOME_III  Coding_transcript       coding_exon     9488749 9488839 .       +       1       Transcript "B0464.3.1" ; CDS "B0464.3"
CHROMOSOME_III  Coding_transcript       coding_exon     9488891 9488986 .       +       0       Transcript "B0464.3.1" ; CDS "B0464.3"

Introns - source = Coding_transcript; feature = intron

CHROMOSOME_III  Coding_transcript       intron  9488696 9488748 .       +       .       Transcript "B0464.3.1" ; Confirmed_EST FM248941 ; Confirmed_EST OSTF051A3_1
CHROMOSOME_III  Coding_transcript       intron  9488840 9488890 .       +       .       Transcript "B0464.3.1" ; Confirmed_EST yk1241c10.5 ; Confirmed_EST OSTF051A3_1

UTRs

  • source = Coding_transcript; feature = five_prime_UTR
  • source = Coding_transcript; feature = three_prime_UTR

Each non-coding exon is represented as a single line

CHROMOSOME_III  Coding_transcript       five_prime_UTR  9488630 9488633 .       +       .       Transcript "B0464.3.1"
CHROMOSOME_III  Coding_transcript       three_prime_UTR 9488987 9489087 .       +       .       Transcript "B0464.3.1"