Difference between revisions of "UniProtKB gpad to WormBase .ace"

Revision as of 16:29, 24 July 2015

The gpad file that contains all of the C. elegans annotations currently in Protein2GO is produced by UniProtKB on a weekly basis.
- A new file is available every Monday.
- The file is located here: ftp://ftp.ebi.ac.uk/pub/contrib/goa/ and is named: gp_association.6239_wormbase.gz

Download the file from the UniProtKB ftp link and put it on tazendra here (in the appropriate year and month directory):
- /home/acedb/kimberly/citace_upload/go/gpad2ace
- for example: /home/acedb/kimberly/citace_upload/go/gpad2ace/2015_February

To convert the gpad file to a .ace file you'll need:
- gp2protein.wb file that maps UniProtKB IDs to WBGenes
- go_gpad_parser.pl

The go_gpad_parse.pl generates three files:
- gpad_extra_column - a file that adds the WBGene ID as an extra column (a new column 2) to the gpad file
- gpad_extra_column.err - a file that indicates:
  - which UniProtKB IDs don't map to WBGene IDs in the gp2protein.wb file, for example:
    - IDs that correspond to a transposon or retrotransposon reverse transcriptase that is encoded in the C. elegans genome (column 3)
      - Q9TZX4
    - IDs that correspond to protein fragments based on translation of incomplete nucleotide submissions to EMBL/GenBank/DDBJ databases (column 3)
      - O17540 - doesn't match 100% to the C. elegans proteome; closest match is kin-20
      - O17541 - near perfect match to kin-24
      - O17542 - closest match is kin-26
      - O17543 - closest match is kin-26
      - O17544 - near perfect match to sid-3
      - O17545 - perfect alignment to kin-32
      - O17546 - perfect alignment to src-1
      - O17911 - perfect alignment to ceh-25 = unc-62
      - O44076 - perfect alignment to cbp-1
      - O44385 - perfect alignment to tat-1 (repeat masked some spans)
      - O61265 - closest match is H19M22.3
    - IDs that correspond to duplicate entries for some full length protein isoforms - these need to be merged in UniProt (column 3)
      - O45049 - lfe-2
      - O45050 - lfe-2
      - O45051 - lfe-2
    - IDs that correspond to full length protein isoforms that only differ between WB and UniProt by one or a few amino acids, i.e. very near match (column 3)
      - O17547 - avr-15
      - O17548 - avr-15
      - O45630 - ncx-2
      - O61304 - abts-1
    - IDs that correspond to non-elegans species, used for ISS annotations or the occasional annotation extension (column 8 or 12)
      - Q8WXF0
  - which PMIDs don't map to a WBPaper ID
  - which annotation extensions can't be mapped to the model
    - Need to add SO_term_relation Text ?SO_term to ?GO_annotation model
  - which UniProtKB IDs are not converted properly in the annotation extensions (column 12)
    - In the script, this corresponds to the code starting at ~line 100
      - This works:
- gp_annotation.ace - the .ace file for upload to citace

The gpad file format is documented on the GOC wiki here:
- http://wiki.geneontology.org/index.php/Final_GPAD_and_GPI_file_format
- Note that once the parsing script runs and we add an additional column, the column numbers shift one higher in the .err output file

Possible modifications to pipeline
- Ask Tony if we could only include in the gpad file annotations to UniProt entries that map to a C. elegans gene ID

@@ Line 43: / Line 43: @@
 ***which annotation extensions can't be mapped to the model
 ****Need to add SO_term_relation Text ?SO_term  to ?GO_annotation model
+***'''which UniProtKB IDs are not converted properly in the annotation extensions (column 12)'''
+****In the script, this corresponds to the code starting at ~line 100
+*****This works:
 **gp_annotation.ace - the .ace file for upload to citace

Difference between revisions of "UniProtKB gpad to WormBase .ace"

Revision as of 16:29, 24 July 2015

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools