Difference between revisions of "Specifications for WB gpi file"

From WormBaseWiki
Jump to navigationJump to search
Line 45: Line 45:
 
! Example for WormBase (Gene OR Transcript OR Protein)
 
! Example for WormBase (Gene OR Transcript OR Protein)
 
! Tag in AceDB model  
 
! Tag in AceDB model  
! Column in UniProtKB Reference Proteome File
 
 
! Comment
 
! Comment
  
 
|-
 
|-
| 01 || DB || required || 1 || 1 || WB || n/a || n/a || n/a
+
| 01 || DB || required || 1 || 1 || WB || n/a || n/a
 
|-
 
|-
| 02 || DB_Object_ID || required || 1 || 2/17 || WBGene00006796 OR F28F12.2a OR WP:CE21219 || Gene OR Transcript OR Protein ID || n/a ||| n/a
+
| 02 || DB_Object_ID || required || 1 || 2/17 || WBGene00006796 OR F28F12.2a OR WP:CE21219 || Gene OR Transcript OR Protein ID || n/a
 
|-
 
|-
| 03 || DB_Object_Symbol || required || 1 || 3 || unc-62 OR unc-62 OR UNC-62 || Public_name in ?Gene model or capitalized version of Public_name in ?Gene model || n/a || n/a
+
| 03 || DB_Object_Symbol || required || 1 || 3 || unc-62 OR unc-62 OR UNC-62 || Public_name in ?Gene model or capitalized version of Public_name in ?Gene model || n/a
 
|-
 
|-
| 04 || DB_Object_Name || optional || 0 or 1 || 10 || n/a || n/a || n/a || n/a
+
| 04 || DB_Object_Name || optional || 0 or 1 || 10 || n/a || n/a || n/a
 
|-
 
|-
| 05 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || Other_name in ?Gene model or capitalized version of Other_name in ?Gene model || ceh-25 OR ceh-25 OR CEH-25 (showing one, but we would include all Other_name entries) || n/a || This is a little different from what we currently put in the GAF, but I think the slightly different purpose of this file (Noctua, GO website searches, text mining) makes it beneficial to include the Other_name entries.
+
| 05 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || Other_name in ?Gene model or capitalized version of Other_name in ?Gene model || ceh-25 OR ceh-25 OR CEH-25 (showing one, but we would include all Other_name entries) || This is a little different from what we currently put in the GAF, but I think the slightly different purpose of this file (Noctua, GO website searches, text mining) makes it beneficial to include the Other_name entries.
 
|-
 
|-
| 06 || DB_Object_Type || required || 1 || 12 || gene OR transcript OR protein || See Comment || n/a || For transcripts, could we use the Method tag?  Do all of the values in this tag correspond to SO terms?  It seems the CV of SO terms would be good to use here, if possible.
+
| 06 || DB_Object_Type || required || 1 || 12 || gene OR transcript OR protein || See Comment || For transcripts, could we use the Method tag?  Do all of the values in this tag correspond to SO terms?  It seems the CV of SO terms would be good to use here, if possible.
 
|-
 
|-
| 07 || Taxon || required || 1 || 13 || taxon:6239 || n/a || n/a || NCBI taxonomy ID for corresponding species of entity in Column 2.
+
| 07 || Taxon || required || 1 || 13 || taxon:6239 || n/a || NCBI taxonomy ID for corresponding species of entity in Column 2.
 
|-
 
|-
| 08 || Parent_Object_ID || optional || 0 or 1 || n/a || WB:WBGene00006796 || Gene ID || n/a || The WB gene ID will be the parent ID for each entry.
+
| 08 || Parent_Object_ID || optional || 0 or 1 || n/a || WB:WBGene00006796 || Gene ID || The WB gene ID will be the parent ID for each entry.
 
|-
 
|-
 
| 09 || DB_Xref(s) || optional || 0 or greater || n/a || UniProtKB:Q9N5D6 OR UniProtKB:Q9N5D6-1  || For gene entries, see comment.  
 
| 09 || DB_Xref(s) || optional || 0 or greater || n/a || UniProtKB:Q9N5D6 OR UniProtKB:Q9N5D6-1  || For gene entries, see comment.  
For transcript entries, see comment.
+
For transcript entries, see comment. For Protein entries, the UniProtIsoformAcc in ?Protein model || '''WBGene''' entries will contain a DB_Xref to the UniProtKB Reference Proteome accession, where available.  The Reference Proteome accessions are available in this file:  ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/WORM/goa_worm.gpi.gz.  Unfortunately, the goa_worm file does not currently xref to WBGene IDs.   
For Protein entries, the UniProtIsoformAcc in ?Protein model || n/a || '''WBGene''' entries will contain a DB_Xref to the UniProtKB Reference Proteome accession, where available.  The Reference Proteome accessions are available in this file:  ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/WORM/goa_worm.gpi.gz.  Unfortunately, the goa_worm file does not currently xref to WBGene IDs.   
 
 
'''Transcript''' entries will contain a DB_xref to RNAcentral accessions, where available.   
 
'''Transcript''' entries will contain a DB_xref to RNAcentral accessions, where available.   
 
 
'''Protein''' entries will contain a DB_xref to the UniProtKB accession; this accession will be an isoform accession of a reviewed Swiss-Prot entry or a TrEMBL accession (e.g., Q9N5D6-1 or A0A0K3ASC5).
 
'''Protein''' entries will contain a DB_xref to the UniProtKB accession; this accession will be an isoform accession of a reviewed Swiss-Prot entry or a TrEMBL accession (e.g., Q9N5D6-1 or A0A0K3ASC5).
 
|-
 
|-
| 10 || Gene_Product_Properties || optional || 0 or greater || n/a || n/a || n/a || n/a || n/a
+
| 10 || Gene_Product_Properties || optional || 0 or greater || n/a || n/a || n/a || n/a
 
|-
 
|-
 
|}
 
|}

Revision as of 17:13, 6 October 2016

gpi File

We will need to create a tab-delimited gpi file with each WormBase release.

Specifications Source

These specifications are based, in part, on the documentation on the GOC's go-annotation github repository:

https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-1_2.md

and also on the content of files submitted here:

http://viewvc.geneontology.org/viewvc/GO-SVN/trunk/gpad-gpi/submission/

File Name

wb_nematode.gpi.gz (all nematode species in WB)

Header

Header content:

!gpi-version: 1.2

!Project_name: WormBase

!!WB_release: WS256

!Contact Email: help@wormbase.org

!URL: http://www.wormbase.org

!Date: 20161006

Field Values

column name required? cardinality GAF column Example for WormBase (Gene OR Transcript OR Protein) Tag in AceDB model Comment
01 DB required 1 1 WB n/a n/a
02 DB_Object_ID required 1 2/17 WBGene00006796 OR F28F12.2a OR WP:CE21219 Gene OR Transcript OR Protein ID n/a
03 DB_Object_Symbol required 1 3 unc-62 OR unc-62 OR UNC-62 Public_name in ?Gene model or capitalized version of Public_name in ?Gene model n/a
04 DB_Object_Name optional 0 or 1 10 n/a n/a n/a
05 DB_Object_Synonym(s) optional 0 or greater 11 Other_name in ?Gene model or capitalized version of Other_name in ?Gene model ceh-25 OR ceh-25 OR CEH-25 (showing one, but we would include all Other_name entries) This is a little different from what we currently put in the GAF, but I think the slightly different purpose of this file (Noctua, GO website searches, text mining) makes it beneficial to include the Other_name entries.
06 DB_Object_Type required 1 12 gene OR transcript OR protein See Comment For transcripts, could we use the Method tag? Do all of the values in this tag correspond to SO terms? It seems the CV of SO terms would be good to use here, if possible.
07 Taxon required 1 13 taxon:6239 n/a NCBI taxonomy ID for corresponding species of entity in Column 2.
08 Parent_Object_ID optional 0 or 1 n/a WB:WBGene00006796 Gene ID The WB gene ID will be the parent ID for each entry.
09 DB_Xref(s) optional 0 or greater n/a UniProtKB:Q9N5D6 OR UniProtKB:Q9N5D6-1 For gene entries, see comment.

For transcript entries, see comment. For Protein entries, the UniProtIsoformAcc in ?Protein model || WBGene entries will contain a DB_Xref to the UniProtKB Reference Proteome accession, where available. The Reference Proteome accessions are available in this file: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/WORM/goa_worm.gpi.gz. Unfortunately, the goa_worm file does not currently xref to WBGene IDs. Transcript entries will contain a DB_xref to RNAcentral accessions, where available. Protein entries will contain a DB_xref to the UniProtKB accession; this accession will be an isoform accession of a reviewed Swiss-Prot entry or a TrEMBL accession (e.g., Q9N5D6-1 or A0A0K3ASC5).

10 Gene_Product_Properties optional 0 or greater n/a n/a n/a n/a

MGI

These specifications are based on the documentation on the GO wiki:

http://wiki.geneontology.org/index.php/Final_GPAD_and_GPI_file_format#Final_format_.2809_Jan_2013.29_2

The input file is a comma-separated text file converted from an Excel spreadsheet sent to us by MGI.

The file is called: mgi_gpi_input.txt (the file has the carriage return symbols - need to remove?)

and is located on mangolassi here: /home/acedb/kimberly/ccc/ccc_gpi/mgi/input_files

There are duplicate entries in the mgi_gpi_input file that we will need to remove before we create the gpi file.



column name required? cardinality GAF column Example for MGI Column in mgi_gpi_input Value if not in mgi_gpi_input (i.e. fixed value)
01 DB_Object_ID required 1 2/17 1861229 Column 1, stripped of 'MGI:' prefix n/a
02 DB_Object_Symbol required 1 3 Adam21 Column 2, no changes n/a
03 DB_Object_Name optional 0 or 1 10 a disintegrin and metallopeptidase domain 21 Column 4, no changes n/a
04 DB_Object_Synonym(s) optional 0 or greater 11 n/a n/a n/a
05 DB_Object_Type required 1 12 n/a n/a gene
06 Taxon required 1 13 n/a n/a taxon:10090
07 Parent_Object_ID optional 0 or 1 - MGI:1861229 Column 1, no changes n/a
08 DB_Xref(s) optional 0 or greater - UniProtKB:Q9JI76 Column 3, add 'UniProtKB:' as prefix (note there are some entries with no value in Column 3) n/a
09 Gene_Product_Properties optional 0 or greater - n/a n/a n/a