Difference between revisions of "Specifications for WB gpi file"
Line 22: | Line 22: | ||
1) For each gene name identified in a Textpresso sentence, we will look in Column 2 and Column 4 of gpi file for an exact match. | 1) For each gene name identified in a Textpresso sentence, we will look in Column 2 and Column 4 of gpi file for an exact match. | ||
− | 2) The form will then display for each match, in the left-most box where gene name information is displayed, the gene name mapped to the parent object id in Column 7 and the UniProtKB: ID in Column 8, e.g. ace-1:WB:WBGene00000035:UniProtKB:P38433. -- We're making these mappings static at the moment we get the sentence from textpresso, dealing with changing mappings between gene names and DBIDs or UniProt IDs is beyond the scope of what we want to deal with. The sentence can get re-textpressoed in the future with a then-current mapping and re-curated (K&J) | + | 2) The form will then display for each match, in the left-most box where gene name information is displayed, the gene name mapped to the parent object id in Column 7 and the UniProtKB: ID in Column 8, e.g. ace-1:WB:WBGene00000035:UniProtKB:P38433. -- We're making these mappings static at the moment we get the sentence from textpresso, dealing with changing mappings between gene names and DBIDs or UniProt IDs is beyond the scope of what we want to deal with. The sentence can get re-textpressoed in the future with a then-current mapping and re-curated (K&J) Always show all sentences from textpresso even if no genes map to a DBID and/nor a UniProt ID, if that means there are no genes at all in the gene select box, or there's curation to a gene not in the list, then add a comment to the sentence in curation, and add it to ptgo directly. (K&J) |
Revision as of 20:22, 2 April 2013
These specifications are based on the documentation on the GO wiki:
We will need to create a tab-delimited file with each WormBase release using the information in AceDB and the xrefs file generated for C. elegans that is available on the ftp site:
ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS236/species/c_elegans/
The file is named according to the release, e.g., c_elegans.WS236.xrefs.txt.gz
(Unfortunately there is no one AceDB object or file that has all of the information we need.)
Output will be sorted according to ascending WBGene ID and will contain a two-line header:
!gpi-version: 1.1
!namespace: WB
For CCC curation form:
1) For each gene name identified in a Textpresso sentence, we will look in Column 2 and Column 4 of gpi file for an exact match.
2) The form will then display for each match, in the left-most box where gene name information is displayed, the gene name mapped to the parent object id in Column 7 and the UniProtKB: ID in Column 8, e.g. ace-1:WB:WBGene00000035:UniProtKB:P38433. -- We're making these mappings static at the moment we get the sentence from textpresso, dealing with changing mappings between gene names and DBIDs or UniProt IDs is beyond the scope of what we want to deal with. The sentence can get re-textpressoed in the future with a then-current mapping and re-curated (K&J) Always show all sentences from textpresso even if no genes map to a DBID and/nor a UniProt ID, if that means there are no genes at all in the gene select box, or there's curation to a gene not in the list, then add a comment to the sentence in curation, and add it to ptgo directly. (K&J)
column | name | required? | cardinality | GAF column | Example for UniProt | Example for WormBase | Tag in AceDB ?Gene model | Column in xrefs file | Value if not in AceDB ?Gene model or xrefs file |
---|---|---|---|---|---|---|---|---|---|
01 | DB_Object_ID | required | 1 | 2/17 | Q4VCS5-1 | WBGene00000035 | WBGene ID | n/a | n/a |
02 | DB_Object_Symbol | required | 1 | 3 | AMOT | ace-1 | CGC_name; if no CGC_name then Sequence_name | n/a | n/a |
03 | DB_Object_Name | optional | 0 or 1 | 10 | Angiomotin | n/a | n/a | n/a | n/a |
04 | DB_Object_Synonym(s) | optional | 0 or greater | 11 | KIAA1071|AMOT | ACE1 | Other_name; if value in CGC_name, then also Sequence_name; also take Molecular_name values, but first strip WP: prefix and any numbers after second '.' in transcript names to only take unique CE and transcript names (e.g., WP:CE21219 becomes CE21219 and T28F12.2a.1 becomes T28F12.2a) | n/a | n/a |
05 | DB_Object_Type | required | 1 | 12 | protein | gene | n/a | n/a | gene |
06 | Taxon | required | 1 | 13 | taxon:9606 | taxon:6239 | n/a | n/a | taxon:6239 |
07 | Parent_Object_ID | optional | 0 or 1 | - | UniProtKB:Q4VCS5 | WB:WBGene00000035 | WBGene ID prefaced with WB: | n/a | n/a |
08 | DB_Xref(s) | optional | 0 or greater | - | - | UniProtKB:P38433 | n/a | 8, prefaced with UniProtKB: Note: there may be several values per given gene ID | n/a |
09 | Gene_Product_Properties | optional | 0 or greater | - | See Note 4 below | n/a | n/a | n/a | n/a |