Generating Initial GAF file for Upload to Postgres

From WormBaseWiki
Jump to navigationJump to search

Initial Round of Entering Phenotype2GO-Based Annotations into Postgres

  • The idea here is to generate a non-redundant set of Phenotype2GO annotations to enter into postgres. Non-redundant means that these annotations do no overlap with any existing manual annotations.
  • Step 1: Retrieve the phenotype2go.wsxxx.wb file from the ftp site (for now, we will use the file on the Sanger site: ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS246/ONTOLOGY/phenotype2go.WS246.wb)
  • Step 2: Using the annotations in file phenotype2go.WS246.wb retrieved from Step 1, create a two-column table that contains the WBGene ID found in Column 2 and the PMID value found in Column 6.
  • Step 3: Using the annotations in the gp_association file from UniProt-GOA (found here: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/WORM/ and called gene_association.goa_worm), for those lines of annotation with the value 'P' in Column 9, first map Column 2 value (a UniProtKB ID) to a WBGene ID using the WB gp2protein called gp2protein.wb file and then replace the UniProtKB ID in Column 2 with the corresponding WBGene ID.
  • Step 4: Output a list of any UniProtKB IDs that don't map to a WBGene ID. Would be good if this list could be restricted to unique values. I will update the gp2protein file, if needed.
  • Step 5: Repeat Step 3 with an updated gp2protein file, if needed.
  • Step 6: Create a second two-column table that contains the WBGene ID now found in Column 2 of the gene_association.goa_worm file and the PMID value found in Column 6 of the gene_association.goa_worm file.
  • Step 7: Compare the values in each of the two tables and then output two files of annotations from the initial file of Phenotype2GO annotations in phenotype2go.WS246.wb: 1) a file containing those lines of annotation that are redundant with the gene_association annotations (i.e., there is an exact match between the gene and the paper in each table), and 2) a file containing those annotations that are NOT present in the gene_association file (i.e., the gene-paper combination exists in the Phenotype2GO file but NOT in the gene_association file).
  • Step 8: We will use the file of non-redundant annotations (i.e., output file #2) as the source of annotations to upload to postgres gop_ OA tables.

Notes

  • The assumption here is that if a paper-gene connection exists from manual annotation, that it will have covered all of the possible annotations from that paper. This may not always be 100% true but, at least for now, it will be simpler to filter this way than to also check the actual term used and determine parent-child (i.e. more or less granular) relations to decide which annotations to keep. Typically, the manual annotations are more granular than the Phenotype2GO annotations anyway. We can always re-visit this assumption, if needed.


Back to 20141022_-_Phenotype2GO_Pipeline