Latest revision as of 18:52, 3 November 2015

SOPs for Getting Different Species GO Annotations into WB and GOC

C. elegans Annotations

The gpad file that contains all of the C. elegans annotations currently in Protein2GO is produced by UniProtKB on a weekly basis.
- A new file is available every Monday.
- The file is located here: ftp://ftp.ebi.ac.uk/pub/contrib/goa/ and is named: gp_association.6239_wormbase.gz

Create a directory named according to the current year and month here:
- /home/acedb/kimberly/citace_upload/go/gpad2ace
- for example: /home/acedb/kimberly/citace_upload/go/gpad2ace/2015_February

To convert the gpad file to a .ace file you'll need:
- gp2protein.wb file that maps UniProtKB IDs to WBGenes
- go_gpad_parser.pl

The go_gpad_parser.pl generates three files:
- gpad_extra_column - a file that adds the WBGene ID as an extra column (a new column 2) to the gpad file
- gpad_extra_column.err - a file that indicates:
  - which UniProtKB IDs don't map to WBGene IDs in the gp2protein.wb file, for example:
    - IDs that correspond to a transposon or retrotransposon reverse transcriptase that is encoded in the C. elegans genome (column 3)
      - Q9TZX4
    - IDs that correspond to protein fragments based on translation of incomplete nucleotide submissions to EMBL/GenBank/DDBJ databases (column 3)
      - O17540 - doesn't match 100% to the C. elegans proteome; closest match is kin-20
      - O17541 - near perfect match to kin-24
      - O17542 - closest match is kin-26
      - O17543 - closest match is kin-26
      - O17544 - near perfect match to sid-3
      - O17545 - perfect alignment to kin-32
      - O17546 - perfect alignment to src-1
      - O17911 - perfect alignment to ceh-25 = unc-62
      - O44076 - perfect alignment to cbp-1
      - O44385 - perfect alignment to tat-1 (repeat masked some spans)
      - O61265 - closest match is H19M22.3
    - IDs that correspond to duplicate entries for some full length protein isoforms - these need to be merged in UniProt (column 3)
      - O45049 - lfe-2
      - O45050 - lfe-2
      - O45051 - lfe-2
    - IDs that correspond to full length protein isoforms that only differ between WB and UniProt by one or a few amino acids, i.e. very near match (column 3)
      - O17547 - avr-15
      - O17548 - avr-15
      - four isoform IDs - ncx-2
      - O45630 - ncx-2
      - O61304 - abts-1
      - G5EFT5 - hsf-1
      - Q6Q4G5 - hsf-1
    - IDs that correspond to non-elegans species, used for ISS annotations or the occasional annotation extension (column 8 or 12)
      - Q8WXF0
  - which PMIDs don't map to a WBPaper ID
  - which annotation extensions can't be mapped to the model
    - Need to add SO_term_relation Text ?SO_term to ?GO_annotation model
  - which UniProtKB IDs are not converted properly in the annotation extensions (column 12)
    - In the script, this corresponds to the code starting at ~line 100
      - This conversion works:
        UniProtKB Q20616 involved_in GO:0010629 PMID:24621828 ECO:0000315 20150716 UniProt_GOA has_regulation_target(UniProtKB:Q23546) go_evidence=IMP
      - To give:
        UniProtKB WBGene00009897 Q20616 involved_in GO:0010629 PMID:24621828 ECO:0000315 20150716 UniProt_GOA has_regulation_target(UniProtKB:WBGene00014046) go_evidence=IMP
      - This conversion works:
        UniProtKB O44408 enables GO:0004674 PMID:23437011 ECO:0000314 20150129 WormBase has_direct_input(UniProtKB:G5EDK8) go_evidence=IDA|id=2113794919|curator_name=Kimberly Van Auken
      - To give:
        UniProtKB WBGene00002187 O44408 enables GO:0004674 PMID:23437011 ECO:0000314 20150129 WormBase has_direct_input(UniProtKB:G5EDK8) go_evidence=IDA|id=2113794919|curator_name=Kimberly Van Auken
      - But when the Annotation Extension column as multiple, comma-separated values it doesn't work:
        UniProtKB WBGene00000906 G5EFW7 involved_in GO:0097500 PMID:24646679 ECO:0000315 20140418 WormBase has_input(UniProtKB:Q18571),occurs_in(WBbt:0005671) go_evidence=IMP|id=2113710657|curator_name=Kimberly Van Auken
        
        UniProtKB WBGene00020094 H2KZ22 involved_in GO:0031647 PMID:16378591 ECO:0000315 20130529 WormBase has_regulation_target(UniProtKB:Q17795),has_regulation_target(UniProtKB:Q8MQE6) go_evidence=IMP|id=2113492388|curator_name=Kimberly Van Auken
- gp_annotation.ace - the .ace file for upload to citace
The gpad file format is documented on the GOC wiki here:
- http://wiki.geneontology.org/index.php/Final_GPAD_and_GPI_file_format
- Note that once the parsing script runs and we add an additional column, the column numbers shift one higher in the .err output file

Possible modifications to pipeline
- Ask Tony to include WBGene IDs as DB_xref in the gpi files; would cover ID mappings if gp2protein file gets retired

Other Core Species Annotations

Other Core Species to Include in .ace Upload:
- C. briggsae
- C. remanei
- C. brenneri
- C. japonica
- P. pacificus
- O. volvulus
- B. malayi
- S. ratti
Pipeline will be the same as for C. elegans, i.e. gpad -> .ace -> GAF
Will need a gp2protein file for these species (could be combined)

Non-Core Species Annotations

Non-Core Speices to Include in GOC Submissions:
Pipeline will need to append annotations to existing GAFs
Either convert gpad to GAF using gpi (note that gpad doesn't include taxon IDs) or download species-specific GAF from: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/
gpad is dumped out weekly; GAF is dumped out monthly

@@ Line 1: / Line 1: @@
+= SOPs for Getting Different Species GO Annotations into WB and GOC =
+==C. elegans Annotations==
 *The gpad file that contains all of the C. elegans annotations currently in Protein2GO is produced by UniProtKB on a weekly basis.
 **A new file is available every Monday.
 **The file is located here: ftp://ftp.ebi.ac.uk/pub/contrib/goa/ and is named: gp_association.6239_wormbase.gz
-*Download the file from the UniProtKB ftp link and put it on tazendra here (in the appropriate year and month directory):
+*Create a directory named according to the current year and month here:
 **/home/acedb/kimberly/citace_upload/go/gpad2ace
 **for example: /home/acedb/kimberly/citace_upload/go/gpad2ace/2015_February
@@ Line 11: / Line 14: @@
 **go_gpad_parser.pl
-*The go_gpad_parse.pl generates three files:
+*The go_gpad_parser.pl generates three files:
 **gpad_extra_column - a file that adds the WBGene ID as an extra column (a new column 2) to the gpad file
 **gpad_extra_column.err - a file that indicates:
-***which UniProtKB IDs don't map to WBGene IDs in the gp2protein.wb file
+***which UniProtKB IDs don't map to WBGene IDs in the gp2protein.wb file, for example:
+****IDs that correspond to a transposon or retrotransposon reverse transcriptase that is encoded in the C. elegans genome (column 3)
+*****Q9TZX4
+****IDs that correspond to protein fragments based on translation of incomplete nucleotide submissions to EMBL/GenBank/DDBJ databases (column 3)
+*****O17540 - doesn't match 100% to the C. elegans proteome; closest match is kin-20
+*****O17541 - near perfect match to kin-24
+*****O17542 - closest match is kin-26
+*****O17543 - closest match is kin-26
+*****O17544 - near perfect match to sid-3
+*****O17545 - perfect alignment to kin-32
+*****O17546 - perfect alignment to src-1
+*****O17911 - perfect alignment to ceh-25 = unc-62
+*****O44076 - perfect alignment to cbp-1
+*****O44385 - perfect alignment to tat-1 (repeat masked some spans)
+*****O61265 - closest match is H19M22.3
+****IDs that correspond to duplicate entries for some full length protein isoforms - these need to be merged in UniProt (column 3)
+*****O45049 - lfe-2
+*****O45050 - lfe-2
+*****O45051 - lfe-2
+****IDs that correspond to full length protein isoforms that only differ between WB and UniProt by one or a few amino acids, i.e. very near match (column 3)
+*****O17547 - avr-15
+*****O17548 - avr-15
+*****four isoform IDs - ncx-2
+*****O45630 - ncx-2
+*****O61304 - abts-1
+*****G5EFT5 - hsf-1
+*****Q6Q4G5 - hsf-1
+****IDs that correspond to non-elegans species, used for ISS annotations or the occasional annotation extension (column 8 or 12)
+*****Q8WXF0
 ***which PMIDs don't map to a WBPaper ID
 ***which annotation extensions can't be mapped to the model
+****Need to add SO_term_relation Text ?SO_term  to ?GO_annotation model
+***'''which UniProtKB IDs are not converted properly in the annotation extensions (column 12)'''
+****In the script, this corresponds to the code starting at ~line 100
+*****This conversion works:
+******UniProtKB   Q20616  involved_in GO:0010629  PMID:24621828   ECO:0000315         20150716    UniProt_GOA has_regulation_target(UniProtKB:Q23546) go_evidence=IMP
+*****To give:
+******UniProtKB   WBGene00009897  Q20616  involved_in GO:0010629  PMID:24621828   ECO:0000315         20150716    UniProt_GOA has_regulation_target(UniProtKB:WBGene00014046) go_evidence=IMP
+*****This conversion works:
+******UniProtKB   O44408  enables GO:0004674  PMID:23437011   ECO:0000314         20150129    WormBase    has_direct_input(UniProtKB:G5EDK8)  go_evidence=IDA|id=2113794919|curator_name=Kimberly Van Auken
+*****To give:
+******UniProtKB   WBGene00002187  O44408  enables GO:0004674  PMID:23437011   ECO:0000314         20150129    WormBase    has_direct_input(UniProtKB:G5EDK8)  go_evidence=IDA|id=2113794919|curator_name=Kimberly Van Auken
+*****But when the Annotation Extension column as multiple, comma-separated values it doesn't work:
+******UniProtKB   WBGene00000906  G5EFW7  involved_in GO:0097500  PMID:24646679   ECO:0000315         20140418    WormBase    has_input(UniProtKB:Q18571),occurs_in(WBbt:0005671) go_evidence=IMP|id=2113710657|curator_name=Kimberly Van Auken
+******UniProtKB   WBGene00020094   H2KZ22  involved_in GO:0031647  PMID:16378591   ECO:0000315         20130529    WormBase    has_regulation_target(UniProtKB:Q17795),has_regulation_target(UniProtKB:Q8MQE6) go_evidence=IMP|id=2113492388|curator_name=Kimberly Van Auken
 **gp_annotation.ace - the .ace file for upload to citace
+*The gpad file format is documented on the GOC wiki here:
-*The gpad file format is document on the GOC wiki here:
 **http://wiki.geneontology.org/index.php/Final_GPAD_and_GPI_file_format
 **Note that once the parsing script runs and we add an additional column, the column numbers shift one higher in the .err output file
+*Possible modifications to pipeline
+**Ask Tony to include WBGene IDs as DB_xref in the gpi files; would cover ID mappings if gp2protein file gets retired
+==Other Core Species Annotations==
+*Other Core Species to Include in .ace Upload:
+**C. briggsae
+**C. remanei
+**C. brenneri
+**C. japonica
+**P. pacificus
+**O. volvulus
+**B. malayi
+**S. ratti
+*Pipeline will be the same as for C. elegans, i.e. gpad -> .ace -> GAF
+*Will need a gp2protein file for these species (could be combined)
+==Non-Core Species Annotations==
+*Non-Core Speices to Include in GOC Submissions:
+**
+*Pipeline will need to append annotations to existing GAFs
+*Either convert gpad to GAF using gpi (note that gpad doesn't include taxon IDs) or download species-specific GAF from:  ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/
+*gpad is dumped out weekly; GAF is dumped out monthly

Difference between revisions of "UniProtKB gpad to WormBase .ace"

Latest revision as of 18:52, 3 November 2015

Contents

SOPs for Getting Different Species GO Annotations into WB and GOC

C. elegans Annotations

Other Core Species Annotations

Non-Core Species Annotations

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools