Difference between revisions of "UniProtKB gpad to WormBase .ace"
From WormBaseWiki
Jump to navigationJump to search(48 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | = SOPs for Getting Different Species GO Annotations into WB and GOC = | ||
+ | |||
+ | ==C. elegans Annotations== | ||
*The gpad file that contains all of the C. elegans annotations currently in Protein2GO is produced by UniProtKB on a weekly basis. | *The gpad file that contains all of the C. elegans annotations currently in Protein2GO is produced by UniProtKB on a weekly basis. | ||
**A new file is available every Monday. | **A new file is available every Monday. | ||
**The file is located here: ftp://ftp.ebi.ac.uk/pub/contrib/goa/ and is named: gp_association.6239_wormbase.gz | **The file is located here: ftp://ftp.ebi.ac.uk/pub/contrib/goa/ and is named: gp_association.6239_wormbase.gz | ||
− | * | + | *Create a directory named according to the current year and month here: |
**/home/acedb/kimberly/citace_upload/go/gpad2ace | **/home/acedb/kimberly/citace_upload/go/gpad2ace | ||
**for example: /home/acedb/kimberly/citace_upload/go/gpad2ace/2015_February | **for example: /home/acedb/kimberly/citace_upload/go/gpad2ace/2015_February | ||
Line 11: | Line 14: | ||
**go_gpad_parser.pl | **go_gpad_parser.pl | ||
− | *The | + | *The go_gpad_parser.pl generates three files: |
**gpad_extra_column - a file that adds the WBGene ID as an extra column (a new column 2) to the gpad file | **gpad_extra_column - a file that adds the WBGene ID as an extra column (a new column 2) to the gpad file | ||
**gpad_extra_column.err - a file that indicates: | **gpad_extra_column.err - a file that indicates: | ||
− | ***which UniProtKB IDs don't map to WBGene IDs in the gp2protein.wb file | + | ***which UniProtKB IDs don't map to WBGene IDs in the gp2protein.wb file, for example: |
+ | ****IDs that correspond to a transposon or retrotransposon reverse transcriptase that is encoded in the C. elegans genome (column 3) | ||
+ | *****Q9TZX4 | ||
+ | ****IDs that correspond to protein fragments based on translation of incomplete nucleotide submissions to EMBL/GenBank/DDBJ databases (column 3) | ||
+ | *****O17540 - doesn't match 100% to the C. elegans proteome; closest match is kin-20 | ||
+ | *****O17541 - near perfect match to kin-24 | ||
+ | *****O17542 - closest match is kin-26 | ||
+ | *****O17543 - closest match is kin-26 | ||
+ | *****O17544 - near perfect match to sid-3 | ||
+ | *****O17545 - perfect alignment to kin-32 | ||
+ | *****O17546 - perfect alignment to src-1 | ||
+ | *****O17911 - perfect alignment to ceh-25 = unc-62 | ||
+ | *****O44076 - perfect alignment to cbp-1 | ||
+ | *****O44385 - perfect alignment to tat-1 (repeat masked some spans) | ||
+ | *****O61265 - closest match is H19M22.3 | ||
+ | ****IDs that correspond to duplicate entries for some full length protein isoforms - these need to be merged in UniProt (column 3) | ||
+ | *****O45049 - lfe-2 | ||
+ | *****O45050 - lfe-2 | ||
+ | *****O45051 - lfe-2 | ||
+ | ****IDs that correspond to full length protein isoforms that only differ between WB and UniProt by one or a few amino acids, i.e. very near match (column 3) | ||
+ | *****O17547 - avr-15 | ||
+ | *****O17548 - avr-15 | ||
+ | *****four isoform IDs - ncx-2 | ||
+ | *****O45630 - ncx-2 | ||
+ | *****O61304 - abts-1 | ||
+ | *****G5EFT5 - hsf-1 | ||
+ | *****Q6Q4G5 - hsf-1 | ||
+ | ****IDs that correspond to non-elegans species, used for ISS annotations or the occasional annotation extension (column 8 or 12) | ||
+ | *****Q8WXF0 | ||
***which PMIDs don't map to a WBPaper ID | ***which PMIDs don't map to a WBPaper ID | ||
***which annotation extensions can't be mapped to the model | ***which annotation extensions can't be mapped to the model | ||
+ | ****Need to add SO_term_relation Text ?SO_term to ?GO_annotation model | ||
+ | ***'''which UniProtKB IDs are not converted properly in the annotation extensions (column 12)''' | ||
+ | ****In the script, this corresponds to the code starting at ~line 100 | ||
+ | *****This conversion works: | ||
+ | ******UniProtKB Q20616 involved_in GO:0010629 PMID:24621828 ECO:0000315 20150716 UniProt_GOA has_regulation_target(UniProtKB:Q23546) go_evidence=IMP | ||
+ | *****To give: | ||
+ | ******UniProtKB WBGene00009897 Q20616 involved_in GO:0010629 PMID:24621828 ECO:0000315 20150716 UniProt_GOA has_regulation_target(UniProtKB:WBGene00014046) go_evidence=IMP | ||
+ | *****This conversion works: | ||
+ | ******UniProtKB O44408 enables GO:0004674 PMID:23437011 ECO:0000314 20150129 WormBase has_direct_input(UniProtKB:G5EDK8) go_evidence=IDA|id=2113794919|curator_name=Kimberly Van Auken | ||
+ | *****To give: | ||
+ | ******UniProtKB WBGene00002187 O44408 enables GO:0004674 PMID:23437011 ECO:0000314 20150129 WormBase has_direct_input(UniProtKB:G5EDK8) go_evidence=IDA|id=2113794919|curator_name=Kimberly Van Auken | ||
+ | *****But when the Annotation Extension column as multiple, comma-separated values it doesn't work: | ||
+ | ******UniProtKB WBGene00000906 G5EFW7 involved_in GO:0097500 PMID:24646679 ECO:0000315 20140418 WormBase has_input(UniProtKB:Q18571),occurs_in(WBbt:0005671) go_evidence=IMP|id=2113710657|curator_name=Kimberly Van Auken | ||
+ | ******UniProtKB WBGene00020094 H2KZ22 involved_in GO:0031647 PMID:16378591 ECO:0000315 20130529 WormBase has_regulation_target(UniProtKB:Q17795),has_regulation_target(UniProtKB:Q8MQE6) go_evidence=IMP|id=2113492388|curator_name=Kimberly Van Auken | ||
**gp_annotation.ace - the .ace file for upload to citace | **gp_annotation.ace - the .ace file for upload to citace | ||
− | + | *The gpad file format is documented on the GOC wiki here: | |
− | *The gpad file format is | ||
**http://wiki.geneontology.org/index.php/Final_GPAD_and_GPI_file_format | **http://wiki.geneontology.org/index.php/Final_GPAD_and_GPI_file_format | ||
**Note that once the parsing script runs and we add an additional column, the column numbers shift one higher in the .err output file | **Note that once the parsing script runs and we add an additional column, the column numbers shift one higher in the .err output file | ||
+ | |||
+ | *Possible modifications to pipeline | ||
+ | **Ask Tony to include WBGene IDs as DB_xref in the gpi files; would cover ID mappings if gp2protein file gets retired | ||
+ | |||
+ | ==Other Core Species Annotations== | ||
+ | *Other Core Species to Include in .ace Upload: | ||
+ | **C. briggsae | ||
+ | **C. remanei | ||
+ | **C. brenneri | ||
+ | **C. japonica | ||
+ | **P. pacificus | ||
+ | **O. volvulus | ||
+ | **B. malayi | ||
+ | **S. ratti | ||
+ | *Pipeline will be the same as for C. elegans, i.e. gpad -> .ace -> GAF | ||
+ | *Will need a gp2protein file for these species (could be combined) | ||
+ | |||
+ | ==Non-Core Species Annotations== | ||
+ | *Non-Core Speices to Include in GOC Submissions: | ||
+ | ** | ||
+ | *Pipeline will need to append annotations to existing GAFs | ||
+ | *Either convert gpad to GAF using gpi (note that gpad doesn't include taxon IDs) or download species-specific GAF from: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/ | ||
+ | *gpad is dumped out weekly; GAF is dumped out monthly |
Latest revision as of 18:52, 3 November 2015
Contents
SOPs for Getting Different Species GO Annotations into WB and GOC
C. elegans Annotations
- The gpad file that contains all of the C. elegans annotations currently in Protein2GO is produced by UniProtKB on a weekly basis.
- A new file is available every Monday.
- The file is located here: ftp://ftp.ebi.ac.uk/pub/contrib/goa/ and is named: gp_association.6239_wormbase.gz
- Create a directory named according to the current year and month here:
- /home/acedb/kimberly/citace_upload/go/gpad2ace
- for example: /home/acedb/kimberly/citace_upload/go/gpad2ace/2015_February
- To convert the gpad file to a .ace file you'll need:
- gp2protein.wb file that maps UniProtKB IDs to WBGenes
- go_gpad_parser.pl
- The go_gpad_parser.pl generates three files:
- gpad_extra_column - a file that adds the WBGene ID as an extra column (a new column 2) to the gpad file
- gpad_extra_column.err - a file that indicates:
- which UniProtKB IDs don't map to WBGene IDs in the gp2protein.wb file, for example:
- IDs that correspond to a transposon or retrotransposon reverse transcriptase that is encoded in the C. elegans genome (column 3)
- Q9TZX4
- IDs that correspond to protein fragments based on translation of incomplete nucleotide submissions to EMBL/GenBank/DDBJ databases (column 3)
- O17540 - doesn't match 100% to the C. elegans proteome; closest match is kin-20
- O17541 - near perfect match to kin-24
- O17542 - closest match is kin-26
- O17543 - closest match is kin-26
- O17544 - near perfect match to sid-3
- O17545 - perfect alignment to kin-32
- O17546 - perfect alignment to src-1
- O17911 - perfect alignment to ceh-25 = unc-62
- O44076 - perfect alignment to cbp-1
- O44385 - perfect alignment to tat-1 (repeat masked some spans)
- O61265 - closest match is H19M22.3
- IDs that correspond to duplicate entries for some full length protein isoforms - these need to be merged in UniProt (column 3)
- O45049 - lfe-2
- O45050 - lfe-2
- O45051 - lfe-2
- IDs that correspond to full length protein isoforms that only differ between WB and UniProt by one or a few amino acids, i.e. very near match (column 3)
- O17547 - avr-15
- O17548 - avr-15
- four isoform IDs - ncx-2
- O45630 - ncx-2
- O61304 - abts-1
- G5EFT5 - hsf-1
- Q6Q4G5 - hsf-1
- IDs that correspond to non-elegans species, used for ISS annotations or the occasional annotation extension (column 8 or 12)
- Q8WXF0
- IDs that correspond to a transposon or retrotransposon reverse transcriptase that is encoded in the C. elegans genome (column 3)
- which PMIDs don't map to a WBPaper ID
- which annotation extensions can't be mapped to the model
- Need to add SO_term_relation Text ?SO_term to ?GO_annotation model
- which UniProtKB IDs are not converted properly in the annotation extensions (column 12)
- In the script, this corresponds to the code starting at ~line 100
- This conversion works:
- UniProtKB Q20616 involved_in GO:0010629 PMID:24621828 ECO:0000315 20150716 UniProt_GOA has_regulation_target(UniProtKB:Q23546) go_evidence=IMP
- To give:
- UniProtKB WBGene00009897 Q20616 involved_in GO:0010629 PMID:24621828 ECO:0000315 20150716 UniProt_GOA has_regulation_target(UniProtKB:WBGene00014046) go_evidence=IMP
- This conversion works:
- UniProtKB O44408 enables GO:0004674 PMID:23437011 ECO:0000314 20150129 WormBase has_direct_input(UniProtKB:G5EDK8) go_evidence=IDA|id=2113794919|curator_name=Kimberly Van Auken
- To give:
- UniProtKB WBGene00002187 O44408 enables GO:0004674 PMID:23437011 ECO:0000314 20150129 WormBase has_direct_input(UniProtKB:G5EDK8) go_evidence=IDA|id=2113794919|curator_name=Kimberly Van Auken
- But when the Annotation Extension column as multiple, comma-separated values it doesn't work:
- UniProtKB WBGene00000906 G5EFW7 involved_in GO:0097500 PMID:24646679 ECO:0000315 20140418 WormBase has_input(UniProtKB:Q18571),occurs_in(WBbt:0005671) go_evidence=IMP|id=2113710657|curator_name=Kimberly Van Auken
- UniProtKB WBGene00020094 H2KZ22 involved_in GO:0031647 PMID:16378591 ECO:0000315 20130529 WormBase has_regulation_target(UniProtKB:Q17795),has_regulation_target(UniProtKB:Q8MQE6) go_evidence=IMP|id=2113492388|curator_name=Kimberly Van Auken
- This conversion works:
- In the script, this corresponds to the code starting at ~line 100
- which UniProtKB IDs don't map to WBGene IDs in the gp2protein.wb file, for example:
- gp_annotation.ace - the .ace file for upload to citace
- The gpad file format is documented on the GOC wiki here:
- http://wiki.geneontology.org/index.php/Final_GPAD_and_GPI_file_format
- Note that once the parsing script runs and we add an additional column, the column numbers shift one higher in the .err output file
- Possible modifications to pipeline
- Ask Tony to include WBGene IDs as DB_xref in the gpi files; would cover ID mappings if gp2protein file gets retired
Other Core Species Annotations
- Other Core Species to Include in .ace Upload:
- C. briggsae
- C. remanei
- C. brenneri
- C. japonica
- P. pacificus
- O. volvulus
- B. malayi
- S. ratti
- Pipeline will be the same as for C. elegans, i.e. gpad -> .ace -> GAF
- Will need a gp2protein file for these species (could be combined)
Non-Core Species Annotations
- Non-Core Speices to Include in GOC Submissions:
- Pipeline will need to append annotations to existing GAFs
- Either convert gpad to GAF using gpi (note that gpad doesn't include taxon IDs) or download species-specific GAF from: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/
- gpad is dumped out weekly; GAF is dumped out monthly