Difference between revisions of "Specifications for WB gpi file"

From WormBaseWiki
Jump to navigationJump to search
 
(34 intermediate revisions by the same user not shown)
Line 22: Line 22:
  
 
!gpi-version: 1.2
 
!gpi-version: 1.2
 +
 +
!Date: 2016-10-06
  
 
!Project_name: WormBase
 
!Project_name: WormBase
  
!!WB_release: WS256
+
!Release: WS256
  
!Contact Email: help@wormbase.org
+
!Contact_email: help@wormbase.org
  
 
!URL: http://www.wormbase.org
 
!URL: http://www.wormbase.org
 
!Date: 20161006
 
  
 
==Field Values==
 
==Field Values==
Line 45: Line 45:
 
! Example for WormBase (Gene OR Transcript OR Protein)
 
! Example for WormBase (Gene OR Transcript OR Protein)
 
! Tag in AceDB model  
 
! Tag in AceDB model  
! Column in UniProtKB Reference Proteome File
 
 
! Comment
 
! Comment
 
 
|-
 
|-
| 01 || DB || required || 1 || 1 || WB || n/a || n/a || n/a
+
| 01 || DB || required || 1 || 1 || WB || n/a || n/a
 
|-
 
|-
| 02 || DB_Object_ID || required || 1 || 2/17 || WBGene00006796 OR F28F12.2a OR WP:CE21219 || Gene OR Transcript OR Protein ID || n/a ||| n/a
+
| 02 || DB_Object_ID || required || 1 || 2/17 || WBGene00006796 OR F28F12.2a OR WP:CE21219 || Gene OR Transcript OR Protein ID || n/a
 
|-
 
|-
| 03 || DB_Object_Symbol || required || 1 || 3 || unc-62 OR unc-62 OR UNC-62 || Public_name in ?Gene model or capitalized version of Public_name in ?Gene model || n/a || n/a
+
| 03 || DB_Object_Symbol || required || 1 || 3 || unc-62 OR unc-62 OR UNC-62 || Public_name in ?Gene model or capitalized version of Public_name in ?Gene model || n/a
 
|-
 
|-
| 04 || DB_Object_Name || optional || 0 or 1 || 10 || n/a || n/a || n/a || n/a
+
| 04 || DB_Object_Name || optional || 0 or 1 || 10 || n/a || n/a || n/a
 
|-
 
|-
| 05 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || Other_name in ?Gene model or capitalized version of Other_name in ?Gene model || ceh-25 OR ceh-25 OR CEH-25 (showing one, but we would include all Other_name entries) || n/a || This is a little different from what we currently put in the GAF, but I think the slightly different purpose of this file (Noctua, GO website searches, text mining) makes it beneficial to include the Other_name entries.
+
| 05 || DB_Object_Synonym || optional || 0 or greater || 11 || Other_name in ?Gene model or capitalized version of Other_name in ?Gene model || ceh-25 OR ceh-25 OR CEH-25 (showing one, but we would include all Other_name entries) || This is a little different from what we currently put in the GAF, but I think the slightly different purpose of this file (Noctua, GO website searches, text mining) makes it beneficial to include the Other_name entries.  When there are multiple entries, they should be pipe-separated.
 
|-
 
|-
| 06 || DB_Object_Type || required || 1 || 12 || gene OR transcript OR protein || See Comment || n/a || For transcripts, could we use the Method tag?  Do all of the values in this tag correspond to SO terms?  It seems the CV of SO terms would be good to use here, if possible.
+
| 06 || DB_Object_Type || required || 1 || 12 || gene OR transcript OR protein || See Comment || For transcript, could we use the value in the Method tag?  Do all of the values in this tag correspond to SO terms?  It seems the CV of SO terms would be good to use here, if possible.
 
|-
 
|-
| 07 || Taxon || required || 1 || 13 || taxon:6239 || n/a || n/a || NCBI taxonomy ID for corresponding species of entity in Column 2.
+
| 07 || Taxon || required || 1 || 13 || taxon:6239 || n/a || NCBI taxonomy ID for corresponding species of entity in Column 2.
 
|-
 
|-
| 08 || Parent_Object_ID || optional || 0 or 1 || n/a || WB:WBGene00006796 || Gene ID || n/a || The WB gene ID will be the parent ID for each entry.
+
| 08 || Parent_Object_ID || optional || 0 or 1 || n/a || WB:WBGene00006796 || Gene ID || The WB gene ID will be the parent ID for each transcript and protein entry.  For gene entries, this field will be blank.
 
|-
 
|-
| 09 || DB_Xref(s) || optional || 0 or greater || n/a || UniProtKB: || UniProtKB: || n/a || '''WBGene''' entries will contain a DB_Xref to the UniProtKB Reference Proteome accession, where available.  The Reference Proteome accessions are available in this file:  ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/WORM/goa_worm.gpi.gz.  Unfortunately, the goa_worm file does not currently xref to WBGene IDs.   
+
| 09 || DB_Xref(s) || optional || 0 or greater || n/a || UniProtKB:Q9N5D6 (for gene) OR UniProtKB:Q9N5D6-1 (for protein) || For gene and transcript entries, see comment. || '''WBGene''' entries will contain a DB_Xref to the UniProtKB Reference Proteome accession, where available.  The Reference Proteome accessions are available in this file:  ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/WORM/goa_worm.gpi.gz.  Unfortunately, the goa_worm file does not currently xref to WBGene IDs.   
 
'''Transcript''' entries will contain a DB_xref to RNAcentral accessions, where available.   
 
'''Transcript''' entries will contain a DB_xref to RNAcentral accessions, where available.   
  
'''Protein''' entries will contain a DB_xref to the UniProtKB accession; this accession may be a reviewed Swiss-Prot accession or a TrEMBL accession (e.g., Q9N5D6-1 or A0A0K3ASC5).
+
'''Protein''' entries will contain a DB_xref to the UniProtKB accession; this accession will be an isoform accession of a reviewed Swiss-Prot entry or a TrEMBL accession (e.g., Q9N5D6-1 or A0A0K3ASC5).
 
|-
 
|-
| 10 || Gene_Product_Properties || optional || 0 or greater || n/a || n/a || n/a || n/a || n/a
+
| 10 || Gene_Product_Properties || optional || 0 or greater || n/a || n/a || n/a || Right now, I can't think of anything we'd need to put in this field, but that could change in future iterations.
 
|-
 
|-
 
|}
 
|}
  
=MGI=
+
==Example Entries==
 
 
These specifications are based on the documentation on the GO wiki:
 
 
 
http://wiki.geneontology.org/index.php/Final_GPAD_and_GPI_file_format#Final_format_.2809_Jan_2013.29_2
 
 
 
The input file is a comma-separated text file converted from an Excel spreadsheet sent to us by MGI.
 
 
 
The file is called: mgi_gpi_input.txt  (the file has the carriage return symbols - need to remove?)
 
 
 
and is located on mangolassi here:  /home/acedb/kimberly/ccc/ccc_gpi/mgi/input_files
 
 
 
There are duplicate entries in the mgi_gpi_input file that we will need to remove before we create the gpi file.
 
 
 
 
 
 
 
  
 
{| cellspacing="2" border="1"
 
{| cellspacing="2" border="1"
 
|-
 
|-
! column
+
! DB
! name
+
! DB_Object_ID
! required?
+
! DB_Object_Symbol
! cardinality
+
! DB_Object_Name
! GAF column
+
! DB_Object_Synonym
! Example for MGI
+
! DB_Object_Type
! Column in mgi_gpi_input
+
! Taxon
! Value if not in mgi_gpi_input (i.e. fixed value)
+
! Parent_Object_ID
 +
! DB_Xref(s)
 +
! Gene_Product_Properties
 +
|-
 +
|WB
 +
|WBGene00006796
 +
|unc-62
 +
|
 +
|let-318 pipe nob-5 pipe ceh-25 pipe CELE_T28F12.2
 +
|gene
 +
|taxon:6239
 +
|
 +
|UniProtKB:Q9N5D6
 +
|
 
|-
 
|-
| 01 || DB_Object_ID || required || 1 || 2/17 || 1861229 || Column 1, stripped of 'MGI:' prefix || n/a
+
|WB
 +
|T28F12.2a
 +
|unc-62
 +
|
 +
|let-318 pipe nob-5 pipe ceh-25 pipe CELE_T28F12.2
 +
|coding_transcript
 +
|taxon:6239
 +
|WB:WBGene00006796
 +
|  
 +
|
 
|-
 
|-
| 02 || DB_Object_Symbol || required || 1 || 3 || Adam21 || Column 2, no changes|| n/a
+
|WB
 +
|WP:CE21219
 +
|UNC-62
 +
|
 +
|LET-318 pipe NOB-5 pipe CEH-25 pipe CELE_T28F12.2
 +
|protein
 +
|taxon:6239
 +
|WB:WBGene00006796
 +
|UniProtKB:Q9N5D6-1
 +
|
 
|-
 
|-
| 03 || DB_Object_Name || optional || 0 or 1 || 10 || a disintegrin and metallopeptidase domain 21 || Column 4, no changes || n/a
+
|WB
 +
|WP:CE50189
 +
|UNC-62
 +
|
 +
|LET-318 pipe NOB-5 pipe CEH-25 pipe CELE_T28F12.2
 +
|protein
 +
|taxon:6239
 +
|WB:WBGene00006796
 +
|UniProtKB:A0A0K3ASC5
 +
|
 
|-
 
|-
| 04 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || n/a || n/a || n/a
+
|WB
 +
|WBGene00000829
 +
|ctb-1
 +
|
 +
|CYTB
 +
|gene
 +
|taxon:6239
 +
|
 +
|UniProtKB:P24890
 +
|
 +
|
 
|-
 
|-
| 05 || DB_Object_Type || required || 1 || 12 || n/a || n/a || gene
+
|WB
 +
|MTCE.21
 +
|ctb-1
 +
|
 +
|CYTB
 +
|coding_transcript
 +
|taxon:6239
 +
|WB:WBGene00000829
 +
|
 +
|
 
|-
 
|-
| 06 || Taxon || required || 1 || 13 || n/a || n/a || taxon:10090
+
|WB
 +
|WP:CE35348
 +
|CTB-1
 +
|
 +
|CYTB
 +
|protein
 +
|taxon:6239
 +
|WB:WBGene00000829
 +
|
 +
|
 
|-
 
|-
| 07 || Parent_Object_ID || optional || 0 or 1 || - || MGI:1861229 || Column 1, no changes || n/a
+
|WB
 +
|WBGene00002993
 +
|lin-4
 +
|  
 +
|CELE_F59G1.6
 +
|gene
 +
|taxon:6239
 +
|
 +
|
 +
|
 
|-
 
|-
| 08 || DB_Xref(s) || optional || 0 or greater || - || UniProtKB:Q9JI76 || Column 3, add 'UniProtKB:' as prefix (note there are some entries with no value in Column 3)|| n/a
+
|WB
 +
|F59G1.6
 +
|lin-4
 +
|  
 +
|CELE_F59G1.6
 +
|pre_miRNA
 +
|taxon:6239
 +
|WB:WBGene00002993
 +
|RNAcentral:URS00001E2999
 +
|
 
|-
 
|-
| 09 || Gene_Product_Properties || optional || 0 or greater || - || n/a || n/a || n/a
+
|WB
 +
|F59G1.6a
 +
|lin-4
 +
|
 +
|CELE_F59G1.6
 +
|miRNA
 +
|taxon:6239
 +
|WB:WBGene00002993
 +
|RNAcentral:URS0000278C03
 +
|
 
|-
 
|-
 
|}
 
|}

Latest revision as of 20:46, 10 October 2016

gpi File

We will need to create a tab-delimited gpi file with each WormBase release.

Specifications Source

These specifications are based, in part, on the documentation on the GOC's go-annotation github repository:

https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-1_2.md

and also on the content of files submitted here:

http://viewvc.geneontology.org/viewvc/GO-SVN/trunk/gpad-gpi/submission/

File Name

wb_nematode.gpi.gz (all nematode species in WB)

Header

Header content:

!gpi-version: 1.2

!Date: 2016-10-06

!Project_name: WormBase

!Release: WS256

!Contact_email: help@wormbase.org

!URL: http://www.wormbase.org

Field Values

column name required? cardinality GAF column Example for WormBase (Gene OR Transcript OR Protein) Tag in AceDB model Comment
01 DB required 1 1 WB n/a n/a
02 DB_Object_ID required 1 2/17 WBGene00006796 OR F28F12.2a OR WP:CE21219 Gene OR Transcript OR Protein ID n/a
03 DB_Object_Symbol required 1 3 unc-62 OR unc-62 OR UNC-62 Public_name in ?Gene model or capitalized version of Public_name in ?Gene model n/a
04 DB_Object_Name optional 0 or 1 10 n/a n/a n/a
05 DB_Object_Synonym optional 0 or greater 11 Other_name in ?Gene model or capitalized version of Other_name in ?Gene model ceh-25 OR ceh-25 OR CEH-25 (showing one, but we would include all Other_name entries) This is a little different from what we currently put in the GAF, but I think the slightly different purpose of this file (Noctua, GO website searches, text mining) makes it beneficial to include the Other_name entries. When there are multiple entries, they should be pipe-separated.
06 DB_Object_Type required 1 12 gene OR transcript OR protein See Comment For transcript, could we use the value in the Method tag? Do all of the values in this tag correspond to SO terms? It seems the CV of SO terms would be good to use here, if possible.
07 Taxon required 1 13 taxon:6239 n/a NCBI taxonomy ID for corresponding species of entity in Column 2.
08 Parent_Object_ID optional 0 or 1 n/a WB:WBGene00006796 Gene ID The WB gene ID will be the parent ID for each transcript and protein entry. For gene entries, this field will be blank.
09 DB_Xref(s) optional 0 or greater n/a UniProtKB:Q9N5D6 (for gene) OR UniProtKB:Q9N5D6-1 (for protein) For gene and transcript entries, see comment. WBGene entries will contain a DB_Xref to the UniProtKB Reference Proteome accession, where available. The Reference Proteome accessions are available in this file: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/WORM/goa_worm.gpi.gz. Unfortunately, the goa_worm file does not currently xref to WBGene IDs.

Transcript entries will contain a DB_xref to RNAcentral accessions, where available.

Protein entries will contain a DB_xref to the UniProtKB accession; this accession will be an isoform accession of a reviewed Swiss-Prot entry or a TrEMBL accession (e.g., Q9N5D6-1 or A0A0K3ASC5).

10 Gene_Product_Properties optional 0 or greater n/a n/a n/a Right now, I can't think of anything we'd need to put in this field, but that could change in future iterations.

Example Entries

DB DB_Object_ID DB_Object_Symbol DB_Object_Name DB_Object_Synonym DB_Object_Type Taxon Parent_Object_ID DB_Xref(s) Gene_Product_Properties
WB WBGene00006796 unc-62 let-318 pipe nob-5 pipe ceh-25 pipe CELE_T28F12.2 gene taxon:6239 UniProtKB:Q9N5D6
WB T28F12.2a unc-62 let-318 pipe nob-5 pipe ceh-25 pipe CELE_T28F12.2 coding_transcript taxon:6239 WB:WBGene00006796
WB WP:CE21219 UNC-62 LET-318 pipe NOB-5 pipe CEH-25 pipe CELE_T28F12.2 protein taxon:6239 WB:WBGene00006796 UniProtKB:Q9N5D6-1
WB WP:CE50189 UNC-62 LET-318 pipe NOB-5 pipe CEH-25 pipe CELE_T28F12.2 protein taxon:6239 WB:WBGene00006796 UniProtKB:A0A0K3ASC5
WB WBGene00000829 ctb-1 CYTB gene taxon:6239 UniProtKB:P24890
WB MTCE.21 ctb-1 CYTB coding_transcript taxon:6239 WB:WBGene00000829
WB WP:CE35348 CTB-1 CYTB protein taxon:6239 WB:WBGene00000829
WB WBGene00002993 lin-4 CELE_F59G1.6 gene taxon:6239
WB F59G1.6 lin-4 CELE_F59G1.6 pre_miRNA taxon:6239 WB:WBGene00002993 RNAcentral:URS00001E2999
WB F59G1.6a lin-4 CELE_F59G1.6 miRNA taxon:6239 WB:WBGene00002993 RNAcentral:URS0000278C03