Difference between revisions of "UniProtKB gpad to WormBase .ace"

From WormBaseWiki
Jump to navigationJump to search
Line 78: Line 78:
 
**B. malayi
 
**B. malayi
 
**S. ratti
 
**S. ratti
 +
*Pipeline will be the same as for C. elegans
 +
*Will need a gp2protein file for these species

Revision as of 14:45, 20 August 2015

C. elegans Annotations

  • The gpad file that contains all of the C. elegans annotations currently in Protein2GO is produced by UniProtKB on a weekly basis.
  • Download the file from the UniProtKB ftp link and put it on tazendra here (in the appropriate year and month directory):
    • /home/acedb/kimberly/citace_upload/go/gpad2ace
    • for example: /home/acedb/kimberly/citace_upload/go/gpad2ace/2015_February
  • To convert the gpad file to a .ace file you'll need:
    • gp2protein.wb file that maps UniProtKB IDs to WBGenes
    • go_gpad_parser.pl
  • The go_gpad_parser.pl generates three files:
    • gpad_extra_column - a file that adds the WBGene ID as an extra column (a new column 2) to the gpad file
    • gpad_extra_column.err - a file that indicates:
      • which UniProtKB IDs don't map to WBGene IDs in the gp2protein.wb file, for example:
        • IDs that correspond to a transposon or retrotransposon reverse transcriptase that is encoded in the C. elegans genome (column 3)
          • Q9TZX4
        • IDs that correspond to protein fragments based on translation of incomplete nucleotide submissions to EMBL/GenBank/DDBJ databases (column 3)
          • O17540 - doesn't match 100% to the C. elegans proteome; closest match is kin-20
          • O17541 - near perfect match to kin-24
          • O17542 - closest match is kin-26
          • O17543 - closest match is kin-26
          • O17544 - near perfect match to sid-3
          • O17545 - perfect alignment to kin-32
          • O17546 - perfect alignment to src-1
          • O17911 - perfect alignment to ceh-25 = unc-62
          • O44076 - perfect alignment to cbp-1
          • O44385 - perfect alignment to tat-1 (repeat masked some spans)
          • O61265 - closest match is H19M22.3
        • IDs that correspond to duplicate entries for some full length protein isoforms - these need to be merged in UniProt (column 3)
          • O45049 - lfe-2
          • O45050 - lfe-2
          • O45051 - lfe-2
        • IDs that correspond to full length protein isoforms that only differ between WB and UniProt by one or a few amino acids, i.e. very near match (column 3)
          • O17547 - avr-15
          • O17548 - avr-15
          • four isoform IDs - ncx-2
          • O45630 - ncx-2
          • O61304 - abts-1
          • G5EFT5 - hsf-1
          • Q6Q4G5 - hsf-1
        • IDs that correspond to non-elegans species, used for ISS annotations or the occasional annotation extension (column 8 or 12)
          • Q8WXF0
      • which PMIDs don't map to a WBPaper ID
      • which annotation extensions can't be mapped to the model
        • Need to add SO_term_relation Text ?SO_term to ?GO_annotation model
      • which UniProtKB IDs are not converted properly in the annotation extensions (column 12)
        • In the script, this corresponds to the code starting at ~line 100
          • This conversion works:
            • UniProtKB Q20616 involved_in GO:0010629 PMID:24621828 ECO:0000315 20150716 UniProt_GOA has_regulation_target(UniProtKB:Q23546) go_evidence=IMP
          • To give:
            • UniProtKB WBGene00009897 Q20616 involved_in GO:0010629 PMID:24621828 ECO:0000315 20150716 UniProt_GOA has_regulation_target(UniProtKB:WBGene00014046) go_evidence=IMP
          • This conversion works:
            • UniProtKB O44408 enables GO:0004674 PMID:23437011 ECO:0000314 20150129 WormBase has_direct_input(UniProtKB:G5EDK8) go_evidence=IDA|id=2113794919|curator_name=Kimberly Van Auken
          • To give:
            • UniProtKB WBGene00002187 O44408 enables GO:0004674 PMID:23437011 ECO:0000314 20150129 WormBase has_direct_input(UniProtKB:G5EDK8) go_evidence=IDA|id=2113794919|curator_name=Kimberly Van Auken
          • But when the Annotation Extension column as multiple, comma-separated values it doesn't work:
            • UniProtKB WBGene00000906 G5EFW7 involved_in GO:0097500 PMID:24646679 ECO:0000315 20140418 WormBase has_input(UniProtKB:Q18571),occurs_in(WBbt:0005671) go_evidence=IMP|id=2113710657|curator_name=Kimberly Van Auken
            • UniProtKB WBGene00020094 H2KZ22 involved_in GO:0031647 PMID:16378591 ECO:0000315 20130529 WormBase has_regulation_target(UniProtKB:Q17795),has_regulation_target(UniProtKB:Q8MQE6) go_evidence=IMP|id=2113492388|curator_name=Kimberly Van Auken
    • gp_annotation.ace - the .ace file for upload to citace
  • The gpad file format is documented on the GOC wiki here:
  • Possible modifications to pipeline
    • Ask Tony to include WBGene IDs as DB_xref in the gpi files; would cover ID mappings if gp2protein file gets retired

Other Core Species Annotations

  • Other Core Species to Include in .ace Upload:
    • C. briggsae
    • C. remanei
    • C. brenneri
    • C. japonica
    • P. pacificus
    • O. volvulus
    • B. malayi
    • S. ratti
  • Pipeline will be the same as for C. elegans
  • Will need a gp2protein file for these species