Latest revision as of 20:46, 10 October 2016

gpi File

We will need to create a tab-delimited gpi file with each WormBase release.

Specifications Source

These specifications are based, in part, on the documentation on the GOC's go-annotation github repository:

https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-1_2.md

and also on the content of files submitted here:

http://viewvc.geneontology.org/viewvc/GO-SVN/trunk/gpad-gpi/submission/

File Name

wb_nematode.gpi.gz (all nematode species in WB)

Header

Header content:

!gpi-version: 1.2

!Date: 2016-10-06

!Project_name: WormBase

!Release: WS256

!Contact_email: help@wormbase.org

!URL: http://www.wormbase.org

Field Values

column	name	required?	cardinality	GAF column	Example for WormBase (Gene OR Transcript OR Protein)	Tag in AceDB model	Comment
01	DB	required	1	1	WB	n/a	n/a
02	DB_Object_ID	required	1	2/17	WBGene00006796 OR F28F12.2a OR WP:CE21219	Gene OR Transcript OR Protein ID	n/a
03	DB_Object_Symbol	required	1	3	unc-62 OR unc-62 OR UNC-62	Public_name in ?Gene model or capitalized version of Public_name in ?Gene model	n/a
04	DB_Object_Name	optional	0 or 1	10	n/a	n/a	n/a
05	DB_Object_Synonym	optional	0 or greater	11	Other_name in ?Gene model or capitalized version of Other_name in ?Gene model	ceh-25 OR ceh-25 OR CEH-25 (showing one, but we would include all Other_name entries)	This is a little different from what we currently put in the GAF, but I think the slightly different purpose of this file (Noctua, GO website searches, text mining) makes it beneficial to include the Other_name entries. When there are multiple entries, they should be pipe-separated.
06	DB_Object_Type	required	1	12	gene OR transcript OR protein	See Comment	For transcript, could we use the value in the Method tag? Do all of the values in this tag correspond to SO terms? It seems the CV of SO terms would be good to use here, if possible.
07	Taxon	required	1	13	taxon:6239	n/a	NCBI taxonomy ID for corresponding species of entity in Column 2.
08	Parent_Object_ID	optional	0 or 1	n/a	WB:WBGene00006796	Gene ID	The WB gene ID will be the parent ID for each transcript and protein entry. For gene entries, this field will be blank.
09	DB_Xref(s)	optional	0 or greater	n/a	UniProtKB:Q9N5D6 (for gene) OR UniProtKB:Q9N5D6-1 (for protein)	For gene and transcript entries, see comment.	WBGene entries will contain a DB_Xref to the UniProtKB Reference Proteome accession, where available. The Reference Proteome accessions are available in this file: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/WORM/goa_worm.gpi.gz. Unfortunately, the goa_worm file does not currently xref to WBGene IDs. Transcript entries will contain a DB_xref to RNAcentral accessions, where available. Protein entries will contain a DB_xref to the UniProtKB accession; this accession will be an isoform accession of a reviewed Swiss-Prot entry or a TrEMBL accession (e.g., Q9N5D6-1 or A0A0K3ASC5).
10	Gene_Product_Properties	optional	0 or greater	n/a	n/a	n/a	Right now, I can't think of anything we'd need to put in this field, but that could change in future iterations.

Example Entries

DB	DB_Object_ID	DB_Object_Symbol	DB_Object_Synonym	DB_Object_Type	Taxon	Parent_Object_ID	DB_Xref(s)
WB	WBGene00006796	unc-62	let-318 pipe nob-5 pipe ceh-25 pipe CELE_T28F12.2	gene	taxon:6239		UniProtKB:Q9N5D6
WB	T28F12.2a	unc-62	let-318 pipe nob-5 pipe ceh-25 pipe CELE_T28F12.2	coding_transcript	taxon:6239	WB:WBGene00006796
WB	WP:CE21219	UNC-62	LET-318 pipe NOB-5 pipe CEH-25 pipe CELE_T28F12.2	protein	taxon:6239	WB:WBGene00006796	UniProtKB:Q9N5D6-1
WB	WP:CE50189	UNC-62	LET-318 pipe NOB-5 pipe CEH-25 pipe CELE_T28F12.2	protein	taxon:6239	WB:WBGene00006796	UniProtKB:A0A0K3ASC5
WB	WBGene00000829	ctb-1	CYTB	gene	taxon:6239		UniProtKB:P24890
WB	MTCE.21	ctb-1	CYTB	coding_transcript	taxon:6239	WB:WBGene00000829
WB	WP:CE35348	CTB-1	CYTB	protein	taxon:6239	WB:WBGene00000829
WB	WBGene00002993	lin-4	CELE_F59G1.6	gene	taxon:6239
WB	F59G1.6	lin-4	CELE_F59G1.6	pre_miRNA	taxon:6239	WB:WBGene00002993	RNAcentral:URS00001E2999
WB	F59G1.6a	lin-4	CELE_F59G1.6	miRNA	taxon:6239	WB:WBGene00002993	RNAcentral:URS0000278C03

Difference between revisions of "Specifications for WB gpi file"

Latest revision as of 20:46, 10 October 2016

Contents

gpi File

Specifications Source

File Name

Header

Field Values

Example Entries

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

@@ Line 1: / Line 1: @@
-These specifications are based on the documentation on the GO wiki:
+=gpi File=
-http://wiki.geneontology.org/index.php/Final_GPAD_and_GPI_file_format#Final_format_.2809_Jan_2013.29_2
+We will need to create a tab-delimited gpi file with each WormBase release.
-We will need to create a tab-delimited file with each WormBase release using the information in AceDB and the xrefs file generated for C. elegans that is available on the ftp site:
+==Specifications Source==
-ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS234/species/c_elegans/
+These specifications are based, in part, on the documentation on the GOC's go-annotation github repository:
-The file is named according to the release, e.g., c_elegans.WS234.xrefs.txt.gz
+https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-1_2.md
-Note that the xrefs file we use will have to be updated with each new WS release.
+and also on the content of files submitted here:
-(Unfortunately there is no one AceDB object or file that has all of the information we need.)
+http://viewvc.geneontology.org/viewvc/GO-SVN/trunk/gpad-gpi/submission/
-Output will be sorted according to ascending WBGene ID and will contain a two-line header:
+==File Name==
-!gpi-version: 1.1
+wb_nematode.gpi.gz (all nematode species in WB)
-!namespace: WB
+==Header==
+Header content:
-For CCC curation form:
+!gpi-version: 1.2
-) For each gene name identified in a Textpresso sentence, we will look in Column 2 and Column 4 of gpi file for an exact match.
+!Date: 2016-10-06
-) The form will then display for each match, in the left-most box where gene name information is displayed, the gene name mapped to the parent object id in Column 7 and the UniProtKB: ID in Column 8, e.g. ace-1:WB:WBGene00000035:UniProtKB:P38433.  -- We're making these mappings static at the moment we get the sentence from textpresso, dealing with changing mappings between gene names and DBIDs or UniProt IDs is beyond the scope of what we want to deal with.  The sentence can get re-textpressoed in the future with a then-current mapping and re-curated (K&J) Always show all sentences from textpresso even if no genes map to a DBID and/nor a UniProt ID, if that means there are no genes at all in the gene select box, or there's curation to a gene not in the list, then add a comment to the sentence in curation, and add it to ptgo directly.  (K&J)
+!Project_name: WormBase
-) Note that the xrefs file is not sorted in order of WBGene ID, but the gpi file should be.
+!Release: WS256
-) Where it is stated that a column can have one or greater values, e.g. 'with', DB_Object_Synonym(s), DB_Xref(s), the values should be given as a pipe-separated list.
+!Contact_email: help@wormbase.org
-From Table Maker:
-Col 1: Genes_elegans
-Col 2: CGC_name
-Col 3: Sequence_name
-Col 4: Other_name
-Col 5: Live - this value is fixed, i.e. Table Maker search is restricted to only returning information for live elegans genes
-Col 6: Corresponding_pseudogene
-Filter out (omit, remove, skip) genes that have a value for Corresponding_pseudogene; these genes will not be included in the final gpi output file, since they don't encode functional proteins or ncRNA molecules.
-Use Table Maker file and xrefs.txt file for creating gpi file as outlined below.
+!URL: http://www.wormbase.org
+==Field Values==
@@ Line 60: / Line 43: @@
 ! cardinality
 ! GAF column
-! Example for UniProt
+! Example for WormBase (Gene OR Transcript OR Protein)
-! Example for WormBase
+! Tag in AceDB model
-! Tag in AceDB ?Gene model (Column in Table Maker (TM) output)
+! Comment
-! Column in xrefs file
+|-
-! Value if not in AceDB ?Gene model or xrefs file
+| 01 || DB || required || 1 || 1 || WB || n/a || n/a
+|-
+| 02 || DB_Object_ID || required || 1 || 2/17 || WBGene00006796 OR F28F12.2a OR WP:CE21219 || Gene OR Transcript OR Protein ID || n/a
+|-
+| 03 || DB_Object_Symbol || required || 1 || 3 || unc-62 OR unc-62 OR UNC-62 || Public_name in ?Gene model or capitalized version of Public_name in ?Gene model || n/a
+|-
+| 04 || DB_Object_Name || optional || 0 or 1 || 10 || n/a || n/a || n/a
+|-
+| 05 || DB_Object_Synonym || optional || 0 or greater || 11 || Other_name in ?Gene model or capitalized version of Other_name in ?Gene model || ceh-25 OR ceh-25 OR CEH-25 (showing one, but we would include all Other_name entries) || This is a little different from what we currently put in the GAF, but I think the slightly different purpose of this file (Noctua, GO website searches, text mining) makes it beneficial to include the Other_name entries.  When there are multiple entries, they should be pipe-separated.
+|-
+| 06 || DB_Object_Type || required || 1 || 12 || gene OR transcript OR protein || See Comment || For transcript, could we use the value in the Method tag?  Do all of the values in this tag correspond to SO terms?  It seems the CV of SO terms would be good to use here, if possible.
+|-
+| 07 || Taxon || required || 1 || 13 || taxon:6239 || n/a || NCBI taxonomy ID for corresponding species of entity in Column 2.
+|-
+| 08 || Parent_Object_ID || optional || 0 or 1 || n/a || WB:WBGene00006796 || Gene ID || The WB gene ID will be the parent ID for each transcript and protein entry.  For gene entries, this field will be blank.
+|-
+| 09 || DB_Xref(s) || optional || 0 or greater || n/a || UniProtKB:Q9N5D6 (for gene) OR UniProtKB:Q9N5D6-1 (for protein) || For gene and transcript entries, see comment. || '''WBGene''' entries will contain a DB_Xref to the UniProtKB Reference Proteome accession, where available.  The Reference Proteome accessions are available in this file:  ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/WORM/goa_worm.gpi.gz.  Unfortunately, the goa_worm file does not currently xref to WBGene IDs.
+'''Transcript''' entries will contain a DB_xref to RNAcentral accessions, where available.
+'''Protein''' entries will contain a DB_xref to the UniProtKB accession; this accession will be an isoform accession of a reviewed Swiss-Prot entry or a TrEMBL accession (e.g., Q9N5D6-1 or A0A0K3ASC5).
+|-
+| 10 || Gene_Product_Properties || optional || 0 or greater || n/a || n/a || n/a || Right now, I can't think of anything we'd need to put in this field, but that could change in future iterations.
+|-
+|}
+==Example Entries==
+{| cellspacing="2" border="1"
+|-
+! DB
+! DB_Object_ID
+! DB_Object_Symbol
+! DB_Object_Name
+! DB_Object_Synonym
+! DB_Object_Type
+! Taxon
+! Parent_Object_ID
+! DB_Xref(s)
+! Gene_Product_Properties
+|-
+|WB
+|WBGene00006796
+|unc-62
+|
+|let-318 pipe nob-5 pipe ceh-25 pipe CELE_T28F12.2
+|gene
+|taxon:6239
+|
+|UniProtKB:Q9N5D6
+|
 |-
-| 01 || DB_Object_ID || required || 1 || 2/17 || Q4VCS5-1 || WBGene00000035 || Genes_elegans, Column 1 of TM output || n/a || n/a
+|WB
+|T28F12.2a
+|unc-62
+|
+|let-318 pipe nob-5 pipe ceh-25 pipe CELE_T28F12.2
+|coding_transcript
+|taxon:6239
+|WB:WBGene00006796
+|
+|
 |-
-| 02 || DB_Object_Symbol || required || 1 || 3 || AMOT || ace-1 || CGC_name, Column 2 of TM output; if no CGC_name then Sequence_name, Column 3 of TM output || n/a || n/a
+|WB
+|WP:CE21219
+|UNC-62
+|
+|LET-318 pipe NOB-5 pipe CEH-25 pipe CELE_T28F12.2
+|protein
+|taxon:6239
+|WB:WBGene00006796
+|UniProtKB:Q9N5D6-1
+|
 |-
-| 03 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin || n/a || n/a || n/a || n/a
+|WB
+|WP:CE50189
+|UNC-62
+|
+|LET-318 pipe NOB-5 pipe CEH-25 pipe CELE_T28F12.2
+|protein
+|taxon:6239
+|WB:WBGene00006796
+|UniProtKB:A0A0K3ASC5
+|
 |-
-| 04 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT_HUMAN|KIAA1071|AMOT || ACE1 || If CGC_name exists, then Sequence_name, Column 3 of TM output, AND Other_name, Column 4 of TM output; If no CGC_name, but Sequence_name, then Other_name, Column 4 of TM output || For all, also check Column 4 of xrefs.txt, if it exists, strip number after second '.' and add all resulting unique values; for all also check Column 5 of xrefs.txt, add unique values prefaced with 'WP:' || n/a
+|WB
+|WBGene00000829
+|ctb-1
+|
+|CYTB
+|gene
+|taxon:6239
+|
+|UniProtKB:P24890
+|
+|
 |-
-| 05 || DB_Object_Type || required || 1 || 12 || protein || gene || n/a || n/a || gene
+|WB
+|MTCE.21
+|ctb-1
+|
+|CYTB
+|coding_transcript
+|taxon:6239
+|WB:WBGene00000829
+|
+|
 |-
-| 06 || Taxon || required || 1 || 13 || taxon:9606 || taxon:6239 || n/a || n/a || taxon:6239
+|WB
+|WP:CE35348
+|CTB-1
+|
+|CYTB
+|protein
+|taxon:6239
+|WB:WBGene00000829
+|
+|
 |-
-| 07 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:Q4VCS5 || WB:WBGene00000035 || WBGene ID from Column 1 of TM output prefaced with 'WB:' || n/a || n/a
+|WB
+|WBGene00002993
+|lin-4
+|
+|CELE_F59G1.6
+|gene
+|taxon:6239
+|
+|
+|
 |-
-| 08 || DB_Xref(s) || optional || 0 or greater || - || - || UniProtKB:P38433 || n/a || for all, add unique values from xrefs.txt as follows: Column 7 of xrefs.txt prefaced with 'CCD:' and Column 8 of xrefs.txt prefaced with "UniProtKB:" || n/a
+|WB
+|F59G1.6
+|lin-4
+|
+|CELE_F59G1.6
+|pre_miRNA
+|taxon:6239
+|WB:WBGene00002993
+|RNAcentral:URS00001E2999
+|
 |-
-| 09 || Gene_Product_Properties || optional || 0 or greater || - || See Note 4 below || n/a || n/a || n/a || n/a
+|WB
+|F59G1.6a
+|lin-4
+|
+|CELE_F59G1.6
+|miRNA
+|taxon:6239
+|WB:WBGene00002993
+|RNAcentral:URS0000278C03
+|
 |-
 |}