Difference between revisions of "UniProtKB gpad to WormBase .ace"
From WormBaseWiki
Jump to navigationJump to searchLine 25: | Line 25: | ||
*****O17545 - perfect alignment to kin-32 | *****O17545 - perfect alignment to kin-32 | ||
*****O17546 - perfect alignment to src-1 | *****O17546 - perfect alignment to src-1 | ||
+ | *****O17911 - ceh-25 = unc-62 | ||
****IDs that correspond to duplicate entries for some protein isoforms - these need to be merged in UniProt (column 3) | ****IDs that correspond to duplicate entries for some protein isoforms - these need to be merged in UniProt (column 3) | ||
*****O17547 - avr-15 | *****O17547 - avr-15 |
Revision as of 16:24, 22 July 2015
- The gpad file that contains all of the C. elegans annotations currently in Protein2GO is produced by UniProtKB on a weekly basis.
- A new file is available every Monday.
- The file is located here: ftp://ftp.ebi.ac.uk/pub/contrib/goa/ and is named: gp_association.6239_wormbase.gz
- Download the file from the UniProtKB ftp link and put it on tazendra here (in the appropriate year and month directory):
- /home/acedb/kimberly/citace_upload/go/gpad2ace
- for example: /home/acedb/kimberly/citace_upload/go/gpad2ace/2015_February
- To convert the gpad file to a .ace file you'll need:
- gp2protein.wb file that maps UniProtKB IDs to WBGenes
- go_gpad_parser.pl
- The go_gpad_parse.pl generates three files:
- gpad_extra_column - a file that adds the WBGene ID as an extra column (a new column 2) to the gpad file
- gpad_extra_column.err - a file that indicates:
- which UniProtKB IDs don't map to WBGene IDs in the gp2protein.wb file, for example:
- IDs that correspond to a transposon or retrotransposon reverse transcriptase that is encoded in the C. elegans genome (column 3)
- Q9TZX4
- IDs that correspond to protein fragments based on translation of incomplete nucleotide submissions to EMBL/GenBank/DDBJ databases (column 3)
- O17540 - doesn't match 100% to the C. elegans proteome; closest match is kin-20
- O17541 - near perfect match to kin-24
- O17542 - closest match is kin-26
- O17543 - closest match is kin-26
- O17544 - near perfect match to sid-3
- O17545 - perfect alignment to kin-32
- O17546 - perfect alignment to src-1
- O17911 - ceh-25 = unc-62
- IDs that correspond to duplicate entries for some protein isoforms - these need to be merged in UniProt (column 3)
- O17547 - avr-15
- O17548 - avr-15
- O17911 - ceh-25 = unc-62
- IDs that correspond to non-elegans species, used for ISS annotations or the occasional annotation extension (column 8 or 12)
- Q8WXF0
- IDs that correspond to a transposon or retrotransposon reverse transcriptase that is encoded in the C. elegans genome (column 3)
- which PMIDs don't map to a WBPaper ID
- which annotation extensions can't be mapped to the model
- which UniProtKB IDs don't map to WBGene IDs in the gp2protein.wb file, for example:
- gp_annotation.ace - the .ace file for upload to citace
- The gpad file format is documented on the GOC wiki here:
- http://wiki.geneontology.org/index.php/Final_GPAD_and_GPI_file_format
- Note that once the parsing script runs and we add an additional column, the column numbers shift one higher in the .err output file
- Possible modifications to pipeline
- Ask Tony if we could only include in the gpad file annotations to UniProt entries that map to a C. elegans gene ID