Difference between revisions of "Specifications for WB gpi file"

From WormBaseWiki
Jump to navigationJump to search
Line 49: Line 49:
 
| 01 || DB || required || 1 || 1 || WB || n/a
 
| 01 || DB || required || 1 || 1 || WB || n/a
 
|-
 
|-
| 02 || DB_Object_ID || required || 1 || 2/17 || WBGene00006605 OR C15F1.3a OR WP:CE23546 || Gene OR Transcript OR Protein ID
+
| 02 || DB_Object_ID || required || 1 || 2/17 || WBGene00006796 OR F28F12.2a OR WP:CE21219 || Gene OR Transcript OR Protein ID
 
|-
 
|-
| 03 || DB_Object_Symbol || required || 1 || 3 || tra-2 OR tra-2 OR TRA-2 || Public_name in ?Gene model or capitalized version of Public_name in ?Gene model
+
| 03 || DB_Object_Symbol || required || 1 || 3 || unc-62 OR unc-62 OR UNC-62 || Public_name in ?Gene model or capitalized version of Public_name in ?Gene model
 
|-
 
|-
 
| 04 || DB_Object_Name || optional || 0 or 1 || 10 || n/a || n/a  
 
| 04 || DB_Object_Name || optional || 0 or 1 || 10 || n/a || n/a  
 
|-
 
|-
| 05 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || Other_name in ?Gene model or capitalized version of Other_name in ?Gene model
+
| 05 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || Other_name in ?Gene model or capitalized version of Other_name in ?Gene model || ceh-25 OR ceh-25 OR CEH-25 (showing one, but we would include all Other_name entries)
 
|-
 
|-
| 06 || DB_Object_Type || required || 1 || 12 || gene || n/a || n/a || gene
+
| 06 || DB_Object_Type || required || 1 || 12 || gene OR transcript OR protein || n/a || n/a || gene
 
|-
 
|-
 
| 07 || Taxon || required || 1 || 13 || taxon:6239 || n/a || n/a || taxon:6239
 
| 07 || Taxon || required || 1 || 13 || taxon:6239 || n/a || n/a || taxon:6239

Revision as of 15:30, 6 October 2016

gpi File

We will need to create a tab-delimited gpi file with each WormBase release.

Specifications Source

These specifications are based, in part, on the documentation on the GOC's go-annotation github repository:

https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-1_2.md

and also on the content of files submitted here:

http://viewvc.geneontology.org/viewvc/GO-SVN/trunk/gpad-gpi/submission/

File Name

wb_nematode.gpi.gz (all nematode species in WB)

Header

Header content:

!gpi-version: 1.2

!Project_name: WormBase

!!WB_release: WS256

!Contact Email: help@wormbase.org

!URL: http://www.wormbase.org

!Date: 20161006

Field Values

column name required? cardinality GAF column Example for WormBase (Gene OR Transcript OR Protein) Tag in AceDB model
01 DB required 1 1 WB n/a
02 DB_Object_ID required 1 2/17 WBGene00006796 OR F28F12.2a OR WP:CE21219 Gene OR Transcript OR Protein ID
03 DB_Object_Symbol required 1 3 unc-62 OR unc-62 OR UNC-62 Public_name in ?Gene model or capitalized version of Public_name in ?Gene model
04 DB_Object_Name optional 0 or 1 10 n/a n/a
05 DB_Object_Synonym(s) optional 0 or greater 11 Other_name in ?Gene model or capitalized version of Other_name in ?Gene model ceh-25 OR ceh-25 OR CEH-25 (showing one, but we would include all Other_name entries)
06 DB_Object_Type required 1 12 gene OR transcript OR protein n/a n/a gene
07 Taxon required 1 13 taxon:6239 n/a n/a taxon:6239
08 Parent_Object_ID optional 0 or 1 - WB:WBGene00000035 WBGene ID from Column 1 of TM output prefaced with 'WB:' n/a n/a
09 DB_Xref(s) optional 0 or greater - - UniProtKB:P38433 n/a for all, if value exists, add unique values from xrefs.txt as follows: Column 7 of xrefs.txt prefaced with 'CCD:' and Column 8 of xrefs.txt prefaced with "UniProtKB:"; if no values, then skip n/a
10 Gene_Product_Properties optional 0 or greater - See Note 4 below n/a n/a n/a n/a

MGI

These specifications are based on the documentation on the GO wiki:

http://wiki.geneontology.org/index.php/Final_GPAD_and_GPI_file_format#Final_format_.2809_Jan_2013.29_2

The input file is a comma-separated text file converted from an Excel spreadsheet sent to us by MGI.

The file is called: mgi_gpi_input.txt (the file has the carriage return symbols - need to remove?)

and is located on mangolassi here: /home/acedb/kimberly/ccc/ccc_gpi/mgi/input_files

There are duplicate entries in the mgi_gpi_input file that we will need to remove before we create the gpi file.



column name required? cardinality GAF column Example for MGI Column in mgi_gpi_input Value if not in mgi_gpi_input (i.e. fixed value)
01 DB_Object_ID required 1 2/17 1861229 Column 1, stripped of 'MGI:' prefix n/a
02 DB_Object_Symbol required 1 3 Adam21 Column 2, no changes n/a
03 DB_Object_Name optional 0 or 1 10 a disintegrin and metallopeptidase domain 21 Column 4, no changes n/a
04 DB_Object_Synonym(s) optional 0 or greater 11 n/a n/a n/a
05 DB_Object_Type required 1 12 n/a n/a gene
06 Taxon required 1 13 n/a n/a taxon:10090
07 Parent_Object_ID optional 0 or 1 - MGI:1861229 Column 1, no changes n/a
08 DB_Xref(s) optional 0 or greater - UniProtKB:Q9JI76 Column 3, add 'UniProtKB:' as prefix (note there are some entries with no value in Column 3) n/a
09 Gene_Product_Properties optional 0 or greater - n/a n/a n/a