Difference between revisions of "GFF source methods"

From WormBaseWiki
Jump to navigationJump to search
 
(One intermediate revision by one other user not shown)
Line 10: Line 10:
 
*source = '''gene'''; feature = '''gene'''.
 
*source = '''gene'''; feature = '''gene'''.
  
eg [http://www.wormbase.org/db/gene/gene?name=WBGene00000875;class=Gene cyk in WormBase]
+
eg [http://www.wormbase.org/db/gene/gene?name=WBGene00007185;class=Gene nlp-36 in WormBase]
  
CHROMOSOME_III  '''gene gene'''  13768424  13771124  . - .  Gene WBGene00000875" ; Position "21.5305" ; Locus "cyk-4"
+
CHROMOSOME_III  '''gene gene'''   9488630 9489091 . .  Gene "WBGene00007185" ; Position "0.567795" ; Locus "nlp-36"
  
 +
=== CDS ===
 +
A CDS is the coding sequence of a gene from the start codon to the stop codon (so does not include UTR).  A gene may have 1 or more CDS's.  Each CDS is represented as a single line describing the start and end coordinates.
 +
 +
*source = '''curated'''; feature = '''CDS'''.
 +
 +
CHROMOSOME_III  '''curated  CDS'''    9488634 9488986 .  +  .  CDS "B0464.3" ; WormPep "CE:CE00017" ;  Locus "nlp-36" ;  Status "Confirmed" ;  Gene "WBGene00007185" ;
 +
 +
The individual exons and introns are described with a single line per exon / intron eg
 +
 +
So for the exampe 3 exon gene  . . .
 +
 +
*source = '''curated'''; feature =''' exon'''; 
 +
CHROMOSOME_III  '''curated exon'''    9488634 9488695 .      +      .      CDS "B0464.3"
 +
CHROMOSOME_III  '''curated exon'''    9488749 9488839 .      +      .      CDS "B0464.3"
 +
CHROMOSOME_III  '''curated exon'''    9488891 9488986 .      +      .      CDS "B0464.3"
 +
 +
*source = '''curated'''; feature =''' intron''';
 +
CHROMOSOME_III  '''curated intron'''  9488696 9488748 .      +      .      CDS "B0464.3" ; Confirmed_EST FM248941 ; Confirmed_EST OSTF051A3_1
 +
CHROMOSOME_III  '''curated intron'''  9488840 9488890 .      +      .      CDS "B0464.3" ; Confirmed_EST yk1241c10.5 ; Confirmed_EST OSTF051A3_1
 +
 +
Exons are also represented with their coding phase
 +
*source = '''curated'''; feature= '''coding_exon'''; (note: the actual coordinates are the same)
 +
CHROMOSOME_III  '''curated coding_exon'''    9488634 9488695 .      +      0      CDS "B0464.3"
 +
CHROMOSOME_III  '''curated coding_exon'''    9488749 9488839 .      +      1      CDS "B0464.3"
 +
CHROMOSOME_III  '''curated coding_exon'''    9488891 9488986 .      +      0      CDS "B0464.3"
 +
 +
== Coding_transcript ==
 +
Each CDS can have one or more Coding_transcripts.  Where a CDS has multiple transcript they will only vary in the UTRs.
 +
Coding_transcripts are the best equivalent to a full length mRNA that we can build based on available evidence.  They go from transcription start site (Eg SL1) to polyA site.
 +
The full extent of a Coding_transcript is defined as a single line and this CDS has two Coding_transcripts . .
 +
 +
*source = '''Coding_transcript'''; feature = '''protein_coding_primary_transcript'''
 +
 +
CHROMOSOME_III  '''Coding_transcript    protein_coding_primary_transcript'''  9488630 9489087 .  +  .  Transcript "B0464.3.1"
 +
CHROMOSOME_III  '''Coding_transcript    protein_coding_primary_transcript'''  9488632 9489091 .  +  .  Transcript "B0464.3.2"
 +
 +
Compare the coordinates of these to the full gene span (above), which is bigger than both of these extending from 9488630 to 9489091 - the outer extremities of the two coding_transcipts.
 +
 +
Each Coding_transcript is composed of the following feature types; coding_exons, introns, five_prime_UTR and three_prime_UTR
 +
 +
Exons - source = '''Coding_transcript'''; feature = '''exon'''
 +
CHROMOSOME_III  '''Coding_transcript      exon'''    9488630 9488695 .      +      .      Transcript "B0464.3.1"
 +
CHROMOSOME_III  '''Coding_transcript      exon'''    9488749 9488839 .      +      .      Transcript "B0464.3.1"
 +
CHROMOSOME_III  '''Coding_transcript      exon'''    9488891 9489087 .      +      .      Transcript "B0464.3.1"
 +
 +
and as the CDS does has a "coding_exon" equivalent
  
=== CDS ===
+
Exons - source = '''Coding_transcript'''; feature = '''coding_exon'''
A CDS is the coding sequence of a gene from the start codon to the stop codon (so does not include UTR)A gene may have 1 or more CDS's.
+
CHROMOSOME_III  '''Coding_transcript      coding_exon'''    9488634 9488695 .      +      0      Transcript "B0464.3.1" ; CDS "B0464.3"
 +
CHROMOSOME_III  '''Coding_transcript      coding_exon'''    9488749 9488839 .      +      1      Transcript "B0464.3.1" ; CDS "B0464.3"
 +
CHROMOSOME_III  '''Coding_transcript      coding_exon '''    9488891 9488986 .      +      0      Transcript "B0464.3.1" ; CDS "B0464.3"
 +
 
 +
Introns - source = '''Coding_transcript'''; feature = '''intron'''
 +
CHROMOSOME_III  '''Coding_transcript      intron'''  9488696 9488748 .      +      .      Transcript "B0464.3.1" ; Confirmed_EST FM248941 ; Confirmed_EST OSTF051A3_1
 +
CHROMOSOME_III  '''Coding_transcript      intron'''  9488840 9488890 .      +      .      Transcript "B0464.3.1" ; Confirmed_EST yk1241c10.5 ; Confirmed_EST OSTF051A3_1
 +
 
 +
UTRs
 +
 
 +
*source = '''Coding_transcript'''; feature = '''five_prime_UTR'''
 +
*source = '''Coding_transcript'''; feature = '''three_prime_UTR'''
 +
 
 +
Each non-coding exon is represented as a single line
 +
CHROMOSOME_III  '''Coding_transcript      five_prime_UTR'''  9488630 9488633 .      +      .      Transcript "B0464.3.1"
 +
  CHROMOSOME_III  '''Coding_transcript      three_prime_UTR''' 9488987 9489087 .      +      .      Transcript "B0464.3.1"
 +
 
 +
 
 +
[[Category:User Guide]]

Latest revision as of 23:15, 13 August 2010

GFF source and feature

GFF2 description at the Sanger Institute

In the WormBase GFF files genes are represented in several ways each specified by a different source and feature (second and third columns)

Gene spans

This is the largest extent of a genes' transcripts from the begining of the most 5' transcripts 5' UTR to the end of the most 3' transcripts 3' UTR. Each gene is represented as a single line.

  • source = gene; feature = gene.

eg nlp-36 in WormBase

CHROMOSOME_III  gene gene    9488630 9489091 .  +   .  Gene "WBGene00007185" ; Position "0.567795" ; Locus "nlp-36"

CDS

A CDS is the coding sequence of a gene from the start codon to the stop codon (so does not include UTR). A gene may have 1 or more CDS's. Each CDS is represented as a single line describing the start and end coordinates.

  • source = curated; feature = CDS.
CHROMOSOME_III  curated   CDS     9488634 9488986 .  +  .  CDS "B0464.3" ; WormPep "CE:CE00017" ;  Locus "nlp-36" ;  Status "Confirmed" ;  Gene "WBGene00007185" ;

The individual exons and introns are described with a single line per exon / intron eg

So for the exampe 3 exon gene . . .

  • source = curated; feature = exon;
CHROMOSOME_III  curated exon    9488634 9488695 .       +       .       CDS "B0464.3"
CHROMOSOME_III  curated exon    9488749 9488839 .       +       .       CDS "B0464.3"
CHROMOSOME_III  curated exon    9488891 9488986 .       +       .       CDS "B0464.3"
  • source = curated; feature = intron;
CHROMOSOME_III  curated intron  9488696 9488748 .       +       .       CDS "B0464.3" ; Confirmed_EST FM248941 ; Confirmed_EST OSTF051A3_1
CHROMOSOME_III  curated intron  9488840 9488890 .       +       .       CDS "B0464.3" ; Confirmed_EST yk1241c10.5 ; Confirmed_EST OSTF051A3_1

Exons are also represented with their coding phase

  • source = curated; feature= coding_exon; (note: the actual coordinates are the same)
CHROMOSOME_III  curated coding_exon     9488634 9488695 .       +       0       CDS "B0464.3"
CHROMOSOME_III  curated coding_exon     9488749 9488839 .       +       1       CDS "B0464.3"
CHROMOSOME_III  curated coding_exon     9488891 9488986 .       +       0       CDS "B0464.3"

Coding_transcript

Each CDS can have one or more Coding_transcripts. Where a CDS has multiple transcript they will only vary in the UTRs. Coding_transcripts are the best equivalent to a full length mRNA that we can build based on available evidence. They go from transcription start site (Eg SL1) to polyA site. The full extent of a Coding_transcript is defined as a single line and this CDS has two Coding_transcripts . .

  • source = Coding_transcript; feature = protein_coding_primary_transcript
CHROMOSOME_III  Coding_transcript    protein_coding_primary_transcript   9488630 9489087 .   +   .   Transcript "B0464.3.1"
CHROMOSOME_III  Coding_transcript    protein_coding_primary_transcript   9488632 9489091 .   +   .   Transcript "B0464.3.2"

Compare the coordinates of these to the full gene span (above), which is bigger than both of these extending from 9488630 to 9489091 - the outer extremities of the two coding_transcipts.

Each Coding_transcript is composed of the following feature types; coding_exons, introns, five_prime_UTR and three_prime_UTR

Exons - source = Coding_transcript; feature = exon

CHROMOSOME_III  Coding_transcript       exon    9488630 9488695 .       +       .       Transcript "B0464.3.1"
CHROMOSOME_III  Coding_transcript       exon    9488749 9488839 .       +       .       Transcript "B0464.3.1"
CHROMOSOME_III  Coding_transcript       exon    9488891 9489087 .       +       .       Transcript "B0464.3.1"

and as the CDS does has a "coding_exon" equivalent

Exons - source = Coding_transcript; feature = coding_exon

CHROMOSOME_III  Coding_transcript       coding_exon     9488634 9488695 .       +       0       Transcript "B0464.3.1" ; CDS "B0464.3"
CHROMOSOME_III  Coding_transcript       coding_exon     9488749 9488839 .       +       1       Transcript "B0464.3.1" ; CDS "B0464.3"
CHROMOSOME_III  Coding_transcript       coding_exon     9488891 9488986 .       +       0       Transcript "B0464.3.1" ; CDS "B0464.3"

Introns - source = Coding_transcript; feature = intron

CHROMOSOME_III  Coding_transcript       intron  9488696 9488748 .       +       .       Transcript "B0464.3.1" ; Confirmed_EST FM248941 ; Confirmed_EST OSTF051A3_1
CHROMOSOME_III  Coding_transcript       intron  9488840 9488890 .       +       .       Transcript "B0464.3.1" ; Confirmed_EST yk1241c10.5 ; Confirmed_EST OSTF051A3_1

UTRs

  • source = Coding_transcript; feature = five_prime_UTR
  • source = Coding_transcript; feature = three_prime_UTR

Each non-coding exon is represented as a single line

CHROMOSOME_III  Coding_transcript       five_prime_UTR  9488630 9488633 .       +       .       Transcript "B0464.3.1"
CHROMOSOME_III  Coding_transcript       three_prime_UTR 9488987 9489087 .       +       .       Transcript "B0464.3.1"