Difference between revisions of "Specifications for WB gpi file"
(→MGI) |
(→MGI) |
||
Line 122: | Line 122: | ||
! GAF column | ! GAF column | ||
! Example for MGI | ! Example for MGI | ||
− | + | ! Column in mgi_gpi_input | |
− | ! Column in | + | ! Value if not in mgi_gpi_input (i.e. fixed value) |
− | ! Value if not in | ||
− | |||
|- | |- | ||
− | | 01 || DB_Object_ID || required || 1 || 2/17 || 1930768 || Genes_elegans, Column 1 of TM output | + | | 01 || DB_Object_ID || required || 1 || 2/17 || 1930768 || Genes_elegans, Column 1 of TM output || n/a |
|- | |- | ||
− | | 02 || DB_Object_Symbol || required || 1 || 3 || AMOT || CGC_name, Column 2 of TM output; if no CGC_name then Sequence_name, Column 3 of TM output | + | | 02 || DB_Object_Symbol || required || 1 || 3 || AMOT || CGC_name, Column 2 of TM output; if no CGC_name then Sequence_name, Column 3 of TM output || n/a |
|- | |- | ||
− | | 03 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin | + | | 03 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin || n/a || n/a |
|- | |- | ||
− | | 04 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT_HUMAN|KIAA1071|AMOT || If CGC_name exists, then Sequence_name, Column 3 of TM output, AND Other_name, Column 4 of TM output; If no CGC_name, but Sequence_name, then Other_name, Column 4 of TM output || For all, also check Column 4 of xrefs.txt, if entry contains a number AND lower case letter after the first '.', strip number after second '.' if the latter exists, and add all resulting unique values (if no number AND lower case letter after the first '.', then we can skip this column in the xrefs file); for all, also check Column 5 of xrefs.txt, if value exists, add unique values prefaced with 'WP:'; if no values, then skip | + | | 04 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT_HUMAN|KIAA1071|AMOT || If CGC_name exists, then Sequence_name, Column 3 of TM output, AND Other_name, Column 4 of TM output; If no CGC_name, but Sequence_name, then Other_name, Column 4 of TM output || For all, also check Column 4 of xrefs.txt, if entry contains a number AND lower case letter after the first '.', strip number after second '.' if the latter exists, and add all resulting unique values (if no number AND lower case letter after the first '.', then we can skip this column in the xrefs file); for all, also check Column 5 of xrefs.txt, if value exists, add unique values prefaced with 'WP:'; if no values, then skip |
|- | |- | ||
− | | 05 || DB_Object_Type || required || 1 || 12 | + | | 05 || DB_Object_Type || required || 1 || 12 || n/a || n/a || gene |
|- | |- | ||
− | | 06 || Taxon || required || 1 || 13 || | + | | 06 || Taxon || required || 1 || 13 || n/a || n/a || n/a || taxon:10090 |
|- | |- | ||
− | | 07 || Parent_Object_ID || optional || 0 or 1 || - || | + | | 07 || Parent_Object_ID || optional || 0 or 1 || - || MGI:1861229 || WBGene ID from Column 1 of TM output prefaced with 'WB:' || n/a |
|- | |- | ||
− | | 08 || DB_Xref(s) || optional || 0 or greater || - || - || n/a || for all, if value exists, add unique values from xrefs.txt as follows: Column 7 of xrefs.txt prefaced with 'CCD:' and Column 8 of xrefs.txt prefaced with "UniProtKB:"; if no values, then skip | + | | 08 || DB_Xref(s) || optional || 0 or greater || - || - || n/a || for all, if value exists, add unique values from xrefs.txt as follows: Column 7 of xrefs.txt prefaced with 'CCD:' and Column 8 of xrefs.txt prefaced with "UniProtKB:"; if no values, then skip |
|- | |- | ||
− | | 09 || Gene_Product_Properties || optional || 0 or greater || - || See Note 4 below | + | | 09 || Gene_Product_Properties || optional || 0 or greater || - || See Note 4 below || n/a || n/a |
|- | |- | ||
|} | |} |
Revision as of 15:56, 3 February 2014
Back to CCC_Form_2.0_Specifications
WormBase
These specifications are based on the documentation on the GO wiki:
We will need to create a tab-delimited file with each WormBase release using the information in AceDB and the xrefs file generated for C. elegans that is available on the ftp site:
ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS234/species/c_elegans/
The file is named according to the release, e.g., c_elegans.WS234.xrefs.txt.gz
Note that the xrefs file we use will have to be updated with each new WS release.
(Unfortunately there is no one AceDB object or file that has all of the information we need.)
Output will be sorted according to ascending WBGene ID and will contain a two-line header:
!gpi-version: 1.1
!namespace: WB
For CCC curation form:
1) For each gene name identified in a Textpresso sentence, we will look in Column 2 and Column 4 of gpi file for an exact match.
2) The form will then display for each match, in the left-most box where gene name information is displayed, the gene name mapped to the parent object id in Column 7 and the UniProtKB: ID in Column 8, e.g. ace-1:WB:WBGene00000035:UniProtKB:P38433. -- We're making these mappings static at the moment we get the sentence from textpresso, dealing with changing mappings between gene names and DBIDs or UniProt IDs is beyond the scope of what we want to deal with. The sentence can get re-textpressoed in the future with a then-current mapping and re-curated (K&J) Always show all sentences from textpresso even if no genes map to a DBID and/nor a UniProt ID, if that means there are no genes at all in the gene select box, or there's curation to a gene not in the list, then add a comment to the sentence in curation, and add it to ptgo directly. (K&J)
3) Note that the xrefs file is not sorted in order of WBGene ID, but the gpi file should be.
4) Where it is stated that a column can have one or greater values, e.g. 'with', DB_Object_Synonym(s), DB_Xref(s), the values should be given as a pipe-separated list.
From Table Maker:
Col 1: Genes_elegans
Col 2: CGC_name
Col 3: Sequence_name
Col 4: Other_name
Col 5: Live - this value is fixed, i.e. Table Maker search is restricted to only returning information for live elegans genes
Col 6: Corresponding_pseudogene - Filter out (omit, remove, skip) genes that have a value for Corresponding_pseudogene; these genes will not be included in the final gpi output file, since they don't encode functional proteins or ncRNA molecules.
Use Table Maker file and xrefs.txt file for creating gpi file as outlined below.
/home/azurebrd/public_html/cgi-bin/forms/ccc/generate_gpi.pl
generates /home/azurebrd/public_html/cgi-bin/forms/ccc/ws234_gpi
needs ws###_tablemaker_info.txt to generate ws###_gpi
always uses c_elegans.WS236.xrefs.txt so that needs to be changed in the future.
column | name | required? | cardinality | GAF column | Example for UniProt | Example for WormBase | Tag in AceDB ?Gene model (Column in Table Maker (TM) output) | Column in xrefs file | Value if not in AceDB ?Gene model or xrefs file |
---|---|---|---|---|---|---|---|---|---|
01 | DB_Object_ID | required | 1 | 2/17 | Q4VCS5-1 | WBGene00000035 | Genes_elegans, Column 1 of TM output | n/a | n/a |
02 | DB_Object_Symbol | required | 1 | 3 | AMOT | ace-1 | CGC_name, Column 2 of TM output; if no CGC_name then Sequence_name, Column 3 of TM output | n/a | n/a |
03 | DB_Object_Name | optional | 0 or 1 | 10 | Angiomotin | n/a | n/a | n/a | n/a |
04 | DB_Object_Synonym(s) | optional | 0 or greater | 11 | KIAA1071|AMOT | ACE1 | If CGC_name exists, then Sequence_name, Column 3 of TM output, AND Other_name, Column 4 of TM output; If no CGC_name, but Sequence_name, then Other_name, Column 4 of TM output | For all, also check Column 4 of xrefs.txt, if entry contains a number AND lower case letter after the first '.', strip number after second '.' if the latter exists, and add all resulting unique values (if no number AND lower case letter after the first '.', then we can skip this column in the xrefs file); for all, also check Column 5 of xrefs.txt, if value exists, add unique values prefaced with 'WP:'; if no values, then skip | n/a |
05 | DB_Object_Type | required | 1 | 12 | protein | gene | n/a | n/a | gene |
06 | Taxon | required | 1 | 13 | taxon:9606 | taxon:6239 | n/a | n/a | taxon:6239 |
07 | Parent_Object_ID | optional | 0 or 1 | - | UniProtKB:Q4VCS5 | WB:WBGene00000035 | WBGene ID from Column 1 of TM output prefaced with 'WB:' | n/a | n/a |
08 | DB_Xref(s) | optional | 0 or greater | - | - | UniProtKB:P38433 | n/a | for all, if value exists, add unique values from xrefs.txt as follows: Column 7 of xrefs.txt prefaced with 'CCD:' and Column 8 of xrefs.txt prefaced with "UniProtKB:"; if no values, then skip | n/a |
09 | Gene_Product_Properties | optional | 0 or greater | - | See Note 4 below | n/a | n/a | n/a | n/a |
MGI
These specifications are based on the documentation on the GO wiki:
The input file is a tab-delimited text file converted from an Excel spreadsheet sent to us by MGI.
The tab-delimited text file is called: mgi_gpi_input
and is located here:
column | name | required? | cardinality | GAF column | Example for MGI | Column in mgi_gpi_input | Value if not in mgi_gpi_input (i.e. fixed value) | |
---|---|---|---|---|---|---|---|---|
01 | DB_Object_ID | required | 1 | 2/17 | 1930768 | Genes_elegans, Column 1 of TM output | n/a | |
02 | DB_Object_Symbol | required | 1 | 3 | AMOT | CGC_name, Column 2 of TM output; if no CGC_name then Sequence_name, Column 3 of TM output | n/a | |
03 | DB_Object_Name | optional | 0 or 1 | 10 | Angiomotin | n/a | n/a | |
04 | DB_Object_Synonym(s) | optional | 0 or greater | 11 | KIAA1071|AMOT | If CGC_name exists, then Sequence_name, Column 3 of TM output, AND Other_name, Column 4 of TM output; If no CGC_name, but Sequence_name, then Other_name, Column 4 of TM output | For all, also check Column 4 of xrefs.txt, if entry contains a number AND lower case letter after the first '.', strip number after second '.' if the latter exists, and add all resulting unique values (if no number AND lower case letter after the first '.', then we can skip this column in the xrefs file); for all, also check Column 5 of xrefs.txt, if value exists, add unique values prefaced with 'WP:'; if no values, then skip | |
05 | DB_Object_Type | required | 1 | 12 | n/a | n/a | gene | |
06 | Taxon | required | 1 | 13 | n/a | n/a | n/a | taxon:10090 |
07 | Parent_Object_ID | optional | 0 or 1 | - | MGI:1861229 | WBGene ID from Column 1 of TM output prefaced with 'WB:' | n/a | |
08 | DB_Xref(s) | optional | 0 or greater | - | - | n/a | for all, if value exists, add unique values from xrefs.txt as follows: Column 7 of xrefs.txt prefaced with 'CCD:' and Column 8 of xrefs.txt prefaced with "UniProtKB:"; if no values, then skip | |
09 | Gene_Product_Properties | optional | 0 or greater | - | See Note 4 below | n/a | n/a |