Difference between revisions of "Specifications for WB gpi file"

From WormBaseWiki
Jump to navigationJump to search
Line 55: Line 55:
 
| 04 || DB_Object_Name || optional || 0 or 1 || 10 || n/a || n/a  
 
| 04 || DB_Object_Name || optional || 0 or 1 || 10 || n/a || n/a  
 
|-
 
|-
| 05 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || If CGC_name exists, then Sequence_name, Column 3 of TM output, AND Other_name, Column 4 of TM output; If no CGC_name, but Sequence_name, then Other_name, Column 4 of TM output || For all, also check Column 4 of xrefs.txt, if entry contains a number AND lower case letter after the first '.', strip number after second '.' if the latter exists, and add all resulting unique values (if no number AND lower case letter after the first '.', then we can skip this column in the xrefs file); for all, also check Column 5 of xrefs.txt, if value exists, add unique values prefaced with 'WP:'; if no values, then skip || n/a
+
| 05 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || Other_name in ?Gene model or capitalized version of Other_name in ?Gene model
 
|-
 
|-
 
| 06 || DB_Object_Type || required || 1 || 12 || gene || n/a || n/a || gene
 
| 06 || DB_Object_Type || required || 1 || 12 || gene || n/a || n/a || gene

Revision as of 15:20, 6 October 2016

gpi File

We will need to create a tab-delimited gpi file with each WormBase release.

Specifications Source

These specifications are based, in part, on the documentation on the GOC's go-annotation github repository:

https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-1_2.md

and also on the content of files submitted here:

http://viewvc.geneontology.org/viewvc/GO-SVN/trunk/gpad-gpi/submission/

File Name

wb_nematode.gpi.gz (all nematode species in WB)

Header

Header content:

!gpi-version: 1.2

!Project_name: WormBase

!!WB_release: WS256

!Contact Email: help@wormbase.org

!URL: http://www.wormbase.org

!Date: 20161006

Field Values

column name required? cardinality GAF column Example for WormBase (Gene OR Transcript OR Protein) Tag in AceDB model
01 DB required 1 1 WB n/a
02 DB_Object_ID required 1 2/17 WBGene00006605 OR C15F1.3a OR WP:CE23546 Gene OR Transcript OR Protein ID
03 DB_Object_Symbol required 1 3 tra-2 OR tra-2 OR TRA-2 Public_name in ?Gene model or capitalized version of Public_name in ?Gene model
04 DB_Object_Name optional 0 or 1 10 n/a n/a
05 DB_Object_Synonym(s) optional 0 or greater 11 Other_name in ?Gene model or capitalized version of Other_name in ?Gene model
06 DB_Object_Type required 1 12 gene n/a n/a gene
07 Taxon required 1 13 taxon:6239 n/a n/a taxon:6239
08 Parent_Object_ID optional 0 or 1 - WB:WBGene00000035 WBGene ID from Column 1 of TM output prefaced with 'WB:' n/a n/a
09 DB_Xref(s) optional 0 or greater - - UniProtKB:P38433 n/a for all, if value exists, add unique values from xrefs.txt as follows: Column 7 of xrefs.txt prefaced with 'CCD:' and Column 8 of xrefs.txt prefaced with "UniProtKB:"; if no values, then skip n/a
10 Gene_Product_Properties optional 0 or greater - See Note 4 below n/a n/a n/a n/a

MGI

These specifications are based on the documentation on the GO wiki:

http://wiki.geneontology.org/index.php/Final_GPAD_and_GPI_file_format#Final_format_.2809_Jan_2013.29_2

The input file is a comma-separated text file converted from an Excel spreadsheet sent to us by MGI.

The file is called: mgi_gpi_input.txt (the file has the carriage return symbols - need to remove?)

and is located on mangolassi here: /home/acedb/kimberly/ccc/ccc_gpi/mgi/input_files

There are duplicate entries in the mgi_gpi_input file that we will need to remove before we create the gpi file.



column name required? cardinality GAF column Example for MGI Column in mgi_gpi_input Value if not in mgi_gpi_input (i.e. fixed value)
01 DB_Object_ID required 1 2/17 1861229 Column 1, stripped of 'MGI:' prefix n/a
02 DB_Object_Symbol required 1 3 Adam21 Column 2, no changes n/a
03 DB_Object_Name optional 0 or 1 10 a disintegrin and metallopeptidase domain 21 Column 4, no changes n/a
04 DB_Object_Synonym(s) optional 0 or greater 11 n/a n/a n/a
05 DB_Object_Type required 1 12 n/a n/a gene
06 Taxon required 1 13 n/a n/a taxon:10090
07 Parent_Object_ID optional 0 or 1 - MGI:1861229 Column 1, no changes n/a
08 DB_Xref(s) optional 0 or greater - UniProtKB:Q9JI76 Column 3, add 'UniProtKB:' as prefix (note there are some entries with no value in Column 3) n/a
09 Gene_Product_Properties optional 0 or greater - n/a n/a n/a