Difference between revisions of "UniProtKB gpad to WormBase .ace"

From WormBaseWiki
Jump to navigationJump to search
Line 43: Line 43:
 
***which annotation extensions can't be mapped to the model
 
***which annotation extensions can't be mapped to the model
 
****Need to add SO_term_relation Text ?SO_term  to ?GO_annotation model
 
****Need to add SO_term_relation Text ?SO_term  to ?GO_annotation model
 +
***'''which UniProtKB IDs are not converted properly in the annotation extensions (column 12)'''
 +
****In the script, this corresponds to the code starting at ~line 100
 +
*****This works: 
 
**gp_annotation.ace - the .ace file for upload to citace
 
**gp_annotation.ace - the .ace file for upload to citace
  

Revision as of 16:29, 24 July 2015

  • The gpad file that contains all of the C. elegans annotations currently in Protein2GO is produced by UniProtKB on a weekly basis.
  • Download the file from the UniProtKB ftp link and put it on tazendra here (in the appropriate year and month directory):
    • /home/acedb/kimberly/citace_upload/go/gpad2ace
    • for example: /home/acedb/kimberly/citace_upload/go/gpad2ace/2015_February
  • To convert the gpad file to a .ace file you'll need:
    • gp2protein.wb file that maps UniProtKB IDs to WBGenes
    • go_gpad_parser.pl
  • The go_gpad_parse.pl generates three files:
    • gpad_extra_column - a file that adds the WBGene ID as an extra column (a new column 2) to the gpad file
    • gpad_extra_column.err - a file that indicates:
      • which UniProtKB IDs don't map to WBGene IDs in the gp2protein.wb file, for example:
        • IDs that correspond to a transposon or retrotransposon reverse transcriptase that is encoded in the C. elegans genome (column 3)
          • Q9TZX4
        • IDs that correspond to protein fragments based on translation of incomplete nucleotide submissions to EMBL/GenBank/DDBJ databases (column 3)
          • O17540 - doesn't match 100% to the C. elegans proteome; closest match is kin-20
          • O17541 - near perfect match to kin-24
          • O17542 - closest match is kin-26
          • O17543 - closest match is kin-26
          • O17544 - near perfect match to sid-3
          • O17545 - perfect alignment to kin-32
          • O17546 - perfect alignment to src-1
          • O17911 - perfect alignment to ceh-25 = unc-62
          • O44076 - perfect alignment to cbp-1
          • O44385 - perfect alignment to tat-1 (repeat masked some spans)
          • O61265 - closest match is H19M22.3
        • IDs that correspond to duplicate entries for some full length protein isoforms - these need to be merged in UniProt (column 3)
          • O45049 - lfe-2
          • O45050 - lfe-2
          • O45051 - lfe-2
        • IDs that correspond to full length protein isoforms that only differ between WB and UniProt by one or a few amino acids, i.e. very near match (column 3)
          • O17547 - avr-15
          • O17548 - avr-15
          • O45630 - ncx-2
          • O61304 - abts-1
        • IDs that correspond to non-elegans species, used for ISS annotations or the occasional annotation extension (column 8 or 12)
          • Q8WXF0
      • which PMIDs don't map to a WBPaper ID
      • which annotation extensions can't be mapped to the model
        • Need to add SO_term_relation Text ?SO_term to ?GO_annotation model
      • which UniProtKB IDs are not converted properly in the annotation extensions (column 12)
        • In the script, this corresponds to the code starting at ~line 100
          • This works:
    • gp_annotation.ace - the .ace file for upload to citace
  • Possible modifications to pipeline
    • Ask Tony if we could only include in the gpad file annotations to UniProt entries that map to a C. elegans gene ID