Specifications for WB gpi file
gpi File
We will need to create a tab-delimited gpi file with each WormBase release.
Specifications Source
These specifications are based, in part, on the documentation on the GOC's go-annotation github repository:
https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-1_2.md
and also on the content of files submitted here:
http://viewvc.geneontology.org/viewvc/GO-SVN/trunk/gpad-gpi/submission/
File Name
wb_nematode.gpi.gz (all nematode species in WB)
Header
Header content:
!gpi-version: 1.2
!Project_name: WormBase
!!WB_release: WS256
!Contact Email: help@wormbase.org
!URL: http://www.wormbase.org
!Date: 20161006
Field Values
column | name | required? | cardinality | GAF column | Example for WormBase | Tag in AceDB ?Gene model or Column in Reference Protein Source File | |||
---|---|---|---|---|---|---|---|---|---|
01 | DB | required | 1 | 1 | WB | n/a | |||
02 | DB_Object_ID | required | 1 | 2/17 | WBGene00000035 | Genes_elegans, Column 1 of TM output | n/a | n/a | |
03 | DB_Object_Symbol | required | 1 | 3 | ace-1 | CGC_name, Column 2 of TM output; if no CGC_name then Sequence_name, Column 3 of TM output | n/a | n/a | |
04 | DB_Object_Name | optional | 0 or 1 | 10 | n/a | n/a | n/a | n/a | |
05 | DB_Object_Synonym(s) | optional | 0 or greater | 11 | If CGC_name exists, then Sequence_name, Column 3 of TM output, AND Other_name, Column 4 of TM output; If no CGC_name, but Sequence_name, then Other_name, Column 4 of TM output | For all, also check Column 4 of xrefs.txt, if entry contains a number AND lower case letter after the first '.', strip number after second '.' if the latter exists, and add all resulting unique values (if no number AND lower case letter after the first '.', then we can skip this column in the xrefs file); for all, also check Column 5 of xrefs.txt, if value exists, add unique values prefaced with 'WP:'; if no values, then skip | n/a | ||
06 | DB_Object_Type | required | 1 | 12 | gene | n/a | n/a | gene | |
07 | Taxon | required | 1 | 13 | taxon:6239 | n/a | n/a | taxon:6239 | |
08 | Parent_Object_ID | optional | 0 or 1 | - | WB:WBGene00000035 | WBGene ID from Column 1 of TM output prefaced with 'WB:' | n/a | n/a | |
09 | DB_Xref(s) | optional | 0 or greater | - | - | UniProtKB:P38433 | n/a | for all, if value exists, add unique values from xrefs.txt as follows: Column 7 of xrefs.txt prefaced with 'CCD:' and Column 8 of xrefs.txt prefaced with "UniProtKB:"; if no values, then skip | n/a |
10 | Gene_Product_Properties | optional | 0 or greater | - | See Note 4 below | n/a | n/a | n/a | n/a |
MGI
These specifications are based on the documentation on the GO wiki:
The input file is a comma-separated text file converted from an Excel spreadsheet sent to us by MGI.
The file is called: mgi_gpi_input.txt (the file has the carriage return symbols - need to remove?)
and is located on mangolassi here: /home/acedb/kimberly/ccc/ccc_gpi/mgi/input_files
There are duplicate entries in the mgi_gpi_input file that we will need to remove before we create the gpi file.
column | name | required? | cardinality | GAF column | Example for MGI | Column in mgi_gpi_input | Value if not in mgi_gpi_input (i.e. fixed value) |
---|---|---|---|---|---|---|---|
01 | DB_Object_ID | required | 1 | 2/17 | 1861229 | Column 1, stripped of 'MGI:' prefix | n/a |
02 | DB_Object_Symbol | required | 1 | 3 | Adam21 | Column 2, no changes | n/a |
03 | DB_Object_Name | optional | 0 or 1 | 10 | a disintegrin and metallopeptidase domain 21 | Column 4, no changes | n/a |
04 | DB_Object_Synonym(s) | optional | 0 or greater | 11 | n/a | n/a | n/a |
05 | DB_Object_Type | required | 1 | 12 | n/a | n/a | gene |
06 | Taxon | required | 1 | 13 | n/a | n/a | taxon:10090 |
07 | Parent_Object_ID | optional | 0 or 1 | - | MGI:1861229 | Column 1, no changes | n/a |
08 | DB_Xref(s) | optional | 0 or greater | - | UniProtKB:Q9JI76 | Column 3, add 'UniProtKB:' as prefix (note there are some entries with no value in Column 3) | n/a |
09 | Gene_Product_Properties | optional | 0 or greater | - | n/a | n/a | n/a |