Difference between revisions of "UniProtKB gpad to WormBase .ace"

From WormBaseWiki
Jump to navigationJump to search
(Created page with "The gpad file that contains all of the C. elegans annotations currently in Protein2GO is produced by UniProtKB on a weekly basis. A new file is available every Monday. The f...")
 
 
(58 intermediate revisions by the same user not shown)
Line 1: Line 1:
The gpad file that contains all of the C. elegans annotations currently in Protein2GO is produced by UniProtKB on a weekly basis.
+
= SOPs for Getting Different Species GO Annotations into WB and GOC =
  
A new file is available every Monday.
+
==C. elegans Annotations==
 +
*The gpad file that contains all of the C. elegans annotations currently in Protein2GO is produced by UniProtKB on a weekly basis.
 +
**A new file is available every Monday.
 +
**The file is located here: ftp://ftp.ebi.ac.uk/pub/contrib/goa/ and is named: gp_association.6239_wormbase.gz
  
The file is located here:
+
*Create a directory named according to the current year and month here:
 +
**/home/acedb/kimberly/citace_upload/go/gpad2ace
 +
**for example: /home/acedb/kimberly/citace_upload/go/gpad2ace/2015_February
  
ftp://ftp.ebi.ac.uk/pub/contrib/goa/
+
*To convert the gpad file to a .ace file you'll need:
 +
**gp2protein.wb file that maps UniProtKB IDs to WBGenes
 +
**go_gpad_parser.pl
  
and is named: gp_association.6239_wormbase
+
*The go_gpad_parser.pl generates three files:
 +
**gpad_extra_column - a file that adds the WBGene ID as an extra column (a new column 2) to the gpad file
 +
**gpad_extra_column.err - a file that indicates:
 +
***which UniProtKB IDs don't map to WBGene IDs in the gp2protein.wb file, for example:
 +
****IDs that correspond to a transposon or retrotransposon reverse transcriptase that is encoded in the C. elegans genome (column 3)
 +
*****Q9TZX4
 +
****IDs that correspond to protein fragments based on translation of incomplete nucleotide submissions to EMBL/GenBank/DDBJ databases (column 3)
 +
*****O17540 - doesn't match 100% to the C. elegans proteome; closest match is kin-20
 +
*****O17541 - near perfect match to kin-24
 +
*****O17542 - closest match is kin-26
 +
*****O17543 - closest match is kin-26
 +
*****O17544 - near perfect match to sid-3
 +
*****O17545 - perfect alignment to kin-32
 +
*****O17546 - perfect alignment to src-1
 +
*****O17911 - perfect alignment to ceh-25 = unc-62
 +
*****O44076 - perfect alignment to cbp-1
 +
*****O44385 - perfect alignment to tat-1 (repeat masked some spans)
 +
*****O61265 - closest match is H19M22.3
 +
****IDs that correspond to duplicate entries for some full length protein isoforms - these need to be merged in UniProt (column 3)
 +
*****O45049 - lfe-2
 +
*****O45050 - lfe-2
 +
*****O45051 - lfe-2
 +
****IDs that correspond to full length protein isoforms that only differ between WB and UniProt by one or a few amino acids, i.e. very near match (column 3)
 +
*****O17547 - avr-15
 +
*****O17548 - avr-15
 +
*****four isoform IDs - ncx-2
 +
*****O45630 - ncx-2
 +
*****O61304 - abts-1
 +
*****G5EFT5 - hsf-1
 +
*****Q6Q4G5 - hsf-1
 +
****IDs that correspond to non-elegans species, used for ISS annotations or the occasional annotation extension (column 8 or 12)
 +
*****Q8WXF0
 +
***which PMIDs don't map to a WBPaper ID
 +
***which annotation extensions can't be mapped to the model
 +
****Need to add SO_term_relation Text ?SO_term  to ?GO_annotation model
 +
***'''which UniProtKB IDs are not converted properly in the annotation extensions (column 12)'''
 +
****In the script, this corresponds to the code starting at ~line 100
 +
*****This conversion works:
 +
******UniProtKB  Q20616  involved_in GO:0010629  PMID:24621828  ECO:0000315        20150716    UniProt_GOA has_regulation_target(UniProtKB:Q23546) go_evidence=IMP
 +
*****To give:
 +
******UniProtKB  WBGene00009897  Q20616  involved_in GO:0010629  PMID:24621828  ECO:0000315        20150716    UniProt_GOA has_regulation_target(UniProtKB:WBGene00014046) go_evidence=IMP
 +
*****This conversion works:
 +
******UniProtKB  O44408  enables GO:0004674  PMID:23437011  ECO:0000314        20150129    WormBase    has_direct_input(UniProtKB:G5EDK8)  go_evidence=IDA|id=2113794919|curator_name=Kimberly Van Auken
 +
*****To give:
 +
******UniProtKB  WBGene00002187  O44408  enables GO:0004674  PMID:23437011  ECO:0000314        20150129    WormBase    has_direct_input(UniProtKB:G5EDK8)  go_evidence=IDA|id=2113794919|curator_name=Kimberly Van Auken
 +
*****But when the Annotation Extension column as multiple, comma-separated values it doesn't work:
 +
******UniProtKB  WBGene00000906  G5EFW7  involved_in GO:0097500  PMID:24646679  ECO:0000315        20140418    WormBase    has_input(UniProtKB:Q18571),occurs_in(WBbt:0005671) go_evidence=IMP|id=2113710657|curator_name=Kimberly Van Auken
 +
******UniProtKB  WBGene00020094  H2KZ22  involved_in GO:0031647  PMID:16378591  ECO:0000315        20130529    WormBase    has_regulation_target(UniProtKB:Q17795),has_regulation_target(UniProtKB:Q8MQE6) go_evidence=IMP|id=2113492388|curator_name=Kimberly Van Auken
 +
**gp_annotation.ace - the .ace file for upload to citace
 +
*The gpad file format is documented on the GOC wiki here:
 +
**http://wiki.geneontology.org/index.php/Final_GPAD_and_GPI_file_format
 +
**Note that once the parsing script runs and we add an additional column, the column numbers shift one higher in the .err output file
 +
 
 +
*Possible modifications to pipeline
 +
**Ask Tony to include WBGene IDs as DB_xref in the gpi files; would cover ID mappings if gp2protein file gets retired
 +
 
 +
==Other Core Species Annotations==
 +
*Other Core Species to Include in .ace Upload:
 +
**C. briggsae
 +
**C. remanei
 +
**C. brenneri
 +
**C. japonica
 +
**P. pacificus
 +
**O. volvulus
 +
**B. malayi
 +
**S. ratti
 +
*Pipeline will be the same as for C. elegans, i.e. gpad -> .ace -> GAF
 +
*Will need a gp2protein file for these species (could be combined)
 +
 
 +
==Non-Core Species Annotations==
 +
*Non-Core Speices to Include in GOC Submissions:
 +
**
 +
*Pipeline will need to append annotations to existing GAFs
 +
*Either convert gpad to GAF using gpi (note that gpad doesn't include taxon IDs) or download species-specific GAF from:  ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/
 +
*gpad is dumped out weekly; GAF is dumped out monthly

Latest revision as of 18:52, 3 November 2015

SOPs for Getting Different Species GO Annotations into WB and GOC

C. elegans Annotations

  • The gpad file that contains all of the C. elegans annotations currently in Protein2GO is produced by UniProtKB on a weekly basis.
  • Create a directory named according to the current year and month here:
    • /home/acedb/kimberly/citace_upload/go/gpad2ace
    • for example: /home/acedb/kimberly/citace_upload/go/gpad2ace/2015_February
  • To convert the gpad file to a .ace file you'll need:
    • gp2protein.wb file that maps UniProtKB IDs to WBGenes
    • go_gpad_parser.pl
  • The go_gpad_parser.pl generates three files:
    • gpad_extra_column - a file that adds the WBGene ID as an extra column (a new column 2) to the gpad file
    • gpad_extra_column.err - a file that indicates:
      • which UniProtKB IDs don't map to WBGene IDs in the gp2protein.wb file, for example:
        • IDs that correspond to a transposon or retrotransposon reverse transcriptase that is encoded in the C. elegans genome (column 3)
          • Q9TZX4
        • IDs that correspond to protein fragments based on translation of incomplete nucleotide submissions to EMBL/GenBank/DDBJ databases (column 3)
          • O17540 - doesn't match 100% to the C. elegans proteome; closest match is kin-20
          • O17541 - near perfect match to kin-24
          • O17542 - closest match is kin-26
          • O17543 - closest match is kin-26
          • O17544 - near perfect match to sid-3
          • O17545 - perfect alignment to kin-32
          • O17546 - perfect alignment to src-1
          • O17911 - perfect alignment to ceh-25 = unc-62
          • O44076 - perfect alignment to cbp-1
          • O44385 - perfect alignment to tat-1 (repeat masked some spans)
          • O61265 - closest match is H19M22.3
        • IDs that correspond to duplicate entries for some full length protein isoforms - these need to be merged in UniProt (column 3)
          • O45049 - lfe-2
          • O45050 - lfe-2
          • O45051 - lfe-2
        • IDs that correspond to full length protein isoforms that only differ between WB and UniProt by one or a few amino acids, i.e. very near match (column 3)
          • O17547 - avr-15
          • O17548 - avr-15
          • four isoform IDs - ncx-2
          • O45630 - ncx-2
          • O61304 - abts-1
          • G5EFT5 - hsf-1
          • Q6Q4G5 - hsf-1
        • IDs that correspond to non-elegans species, used for ISS annotations or the occasional annotation extension (column 8 or 12)
          • Q8WXF0
      • which PMIDs don't map to a WBPaper ID
      • which annotation extensions can't be mapped to the model
        • Need to add SO_term_relation Text ?SO_term to ?GO_annotation model
      • which UniProtKB IDs are not converted properly in the annotation extensions (column 12)
        • In the script, this corresponds to the code starting at ~line 100
          • This conversion works:
            • UniProtKB Q20616 involved_in GO:0010629 PMID:24621828 ECO:0000315 20150716 UniProt_GOA has_regulation_target(UniProtKB:Q23546) go_evidence=IMP
          • To give:
            • UniProtKB WBGene00009897 Q20616 involved_in GO:0010629 PMID:24621828 ECO:0000315 20150716 UniProt_GOA has_regulation_target(UniProtKB:WBGene00014046) go_evidence=IMP
          • This conversion works:
            • UniProtKB O44408 enables GO:0004674 PMID:23437011 ECO:0000314 20150129 WormBase has_direct_input(UniProtKB:G5EDK8) go_evidence=IDA|id=2113794919|curator_name=Kimberly Van Auken
          • To give:
            • UniProtKB WBGene00002187 O44408 enables GO:0004674 PMID:23437011 ECO:0000314 20150129 WormBase has_direct_input(UniProtKB:G5EDK8) go_evidence=IDA|id=2113794919|curator_name=Kimberly Van Auken
          • But when the Annotation Extension column as multiple, comma-separated values it doesn't work:
            • UniProtKB WBGene00000906 G5EFW7 involved_in GO:0097500 PMID:24646679 ECO:0000315 20140418 WormBase has_input(UniProtKB:Q18571),occurs_in(WBbt:0005671) go_evidence=IMP|id=2113710657|curator_name=Kimberly Van Auken
            • UniProtKB WBGene00020094 H2KZ22 involved_in GO:0031647 PMID:16378591 ECO:0000315 20130529 WormBase has_regulation_target(UniProtKB:Q17795),has_regulation_target(UniProtKB:Q8MQE6) go_evidence=IMP|id=2113492388|curator_name=Kimberly Van Auken
    • gp_annotation.ace - the .ace file for upload to citace
  • The gpad file format is documented on the GOC wiki here:
  • Possible modifications to pipeline
    • Ask Tony to include WBGene IDs as DB_xref in the gpi files; would cover ID mappings if gp2protein file gets retired

Other Core Species Annotations

  • Other Core Species to Include in .ace Upload:
    • C. briggsae
    • C. remanei
    • C. brenneri
    • C. japonica
    • P. pacificus
    • O. volvulus
    • B. malayi
    • S. ratti
  • Pipeline will be the same as for C. elegans, i.e. gpad -> .ace -> GAF
  • Will need a gp2protein file for these species (could be combined)

Non-Core Species Annotations

  • Non-Core Speices to Include in GOC Submissions:
  • Pipeline will need to append annotations to existing GAFs
  • Either convert gpad to GAF using gpi (note that gpad doesn't include taxon IDs) or download species-specific GAF from: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/
  • gpad is dumped out weekly; GAF is dumped out monthly