UniProtKB gpad to WormBase .ace

From WormBaseWiki
Revision as of 16:27, 24 July 2015 by Vanaukenk (talk | contribs)
Jump to navigationJump to search
  • The gpad file that contains all of the C. elegans annotations currently in Protein2GO is produced by UniProtKB on a weekly basis.
  • Download the file from the UniProtKB ftp link and put it on tazendra here (in the appropriate year and month directory):
    • /home/acedb/kimberly/citace_upload/go/gpad2ace
    • for example: /home/acedb/kimberly/citace_upload/go/gpad2ace/2015_February
  • To convert the gpad file to a .ace file you'll need:
    • gp2protein.wb file that maps UniProtKB IDs to WBGenes
    • go_gpad_parser.pl
  • The go_gpad_parse.pl generates three files:
    • gpad_extra_column - a file that adds the WBGene ID as an extra column (a new column 2) to the gpad file
    • gpad_extra_column.err - a file that indicates:
      • which UniProtKB IDs don't map to WBGene IDs in the gp2protein.wb file, for example:
        • IDs that correspond to a transposon or retrotransposon reverse transcriptase that is encoded in the C. elegans genome (column 3)
          • Q9TZX4
        • IDs that correspond to protein fragments based on translation of incomplete nucleotide submissions to EMBL/GenBank/DDBJ databases (column 3)
          • O17540 - doesn't match 100% to the C. elegans proteome; closest match is kin-20
          • O17541 - near perfect match to kin-24
          • O17542 - closest match is kin-26
          • O17543 - closest match is kin-26
          • O17544 - near perfect match to sid-3
          • O17545 - perfect alignment to kin-32
          • O17546 - perfect alignment to src-1
          • O17911 - perfect alignment to ceh-25 = unc-62
          • O44076 - perfect alignment to cbp-1
          • O44385 - perfect alignment to tat-1 (repeat masked some spans)
          • O61265 - closest match is H19M22.3
        • IDs that correspond to duplicate entries for some full length protein isoforms - these need to be merged in UniProt (column 3)
          • O45049 - lfe-2
          • O45050 - lfe-2
          • O45051 - lfe-2
        • IDs that correspond to full length protein isoforms that only differ between WB and UniProt by one or a few amino acids, i.e. very near match (column 3)
          • O17547 - avr-15
          • O17548 - avr-15
          • O45630 - ncx-2
          • O61304 - abts-1
        • IDs that correspond to non-elegans species, used for ISS annotations or the occasional annotation extension (column 8 or 12)
          • Q8WXF0
      • which PMIDs don't map to a WBPaper ID
      • which annotation extensions can't be mapped to the model
        • Need to add SO_term_relation Text ?SO_term to ?GO_annotation model
    • gp_annotation.ace - the .ace file for upload to citace
  • Possible modifications to pipeline
    • Ask Tony if we could only include in the gpad file annotations to UniProt entries that map to a C. elegans gene ID