Adding new Phenotype2GO annotations to postgres

From WormBaseWiki
Jump to navigationJump to search
  • All Phenotype2GO annotations are generated anew with each WS build, however, we only need to add the annotations from the most recent build to the OA tables in postgres.
  • The original set of annotations added to postgres came from WS247.
    • This set of annotations was first compared to the GAF (note, not the GPAD) from UniProt to avoid entering redundant annotations from manual curation.
    • Once redundant annotations were removed, the unique file of entries, newGpaEntries, was parsed and data added to the GO OA tables using the script here: /home/postgres/work/pgpopulation/go/go_curation/20141106_kevin_godata/populate_gop_OA_pheno2go.pl
  • Subsequent annotations will be generated by comparing the live and staging versions of the phenotype2go.wb files found on the ftp site (e.g. ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS247/ONTOLOGY/phenotype2go.WS247.wb and ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS248/ONTOLOGY/phenotype2go.WS248.wb) and then parsing unique entries into postgres
    • Caveat: unique entries could be the result of new annotations added or old annotations that were removed, therefore the curator will need to check the newGpaEntries file at each build to make sure they know what is being added and if needed, to delete any annotations from postgres that were removed during the WS build
  • Next Steps, 2015-04-09:
    • New script:Except for the date in Column 14, will compare all columns in two phenotype2go.WSnnn.wb GAF files and output a new newGpaEntries file of all unique lines
    • /home/postgres/work/pgpopulation/go/go_curation/20150410_phenotype2go_build_comparison/phenotype2go_build_comparison.pl
    • The two GAF files for generating the WS249 upload are on mangolassi here: /home/acedb/kimberly/citace_upload/go/phenotype2go/new_annotation_entries/WS249_upload
    • If needed, curator will manually edit the newGpaEntries file (i.e. remove deleted entries)
    • Final newGpaEntries file will be used for postgres upload using the populate_gop_OA_pheno2go.pl script
  • 2015-04-20:
    • The new script from 2015-04-09 identified over 1000 lines that were different between WS247 and WS248, which was about an order of magnitude higher than expected.
    • One main difference seems to be that there are RNAi experiments that map to one gene in WS248 (and WS246 even) that map to two genes in WS247. Confirmed with Hinxton that WS246 and WS248 have the correct data.
    • So....WS247 seems to have an erroneous number of RNAi mappings, and WS248 would be better as the baseline, starting point for Phenotype2GO annotations in postgres.
    • Plan: remove WS247 Phenotype2GO annotations from postgres and re-populate with Phenotype2GO annotations from WS248.
    • Will need to re-run scripts with new data
    • Create new directory: /home/postgres/work/pgpopulation/go/go_curation/20150416_initial_phenotype2go
    • Files needed in this directory: gene_association.goa_worm, gp2protein.wb, phenotype2go.WS248.wb
      • Files are currently on mangolassi here: /home/acedb/kimberly/citace_upload/go/phenotype2go/new_annotation_entries/WS249_upload/WS248_import
    • Scripts needed in this directory: parse_kevin_godata.pl and populate_gop_OA_pheno2go.pl
      • Script are currently on mangolassi here: /home/postgres/work/pgpopulation/go/go_curation/20141106_kevin_godata/
    • parse_kevin_godata.pl will create three new files, one of which, newGpaEntries, will be used to re-populate the OA
  • 2015-07-14:
    • Compared WS248 and WS249 phenotype2GO files for uploading new annotations to postgres
    • For comparison script, ignoring DB_Object_Symbol, DB_Object_Synonym, and Date columns
    • NEW 2 entries can be populated into GO OA tables
    • Script to populate GO OA tables from entire Phenotype2GO file on ftp site is here: populate_gop_OA_pheno2go.pl
    • Note that we will need to modify the populate_gop_OA_pheno2go.pl for this to only look at NEW 2 lines and also to ignore NEW2 lines that have WB:WBPersonNNN in the Reference column.
    • Maybe best to clone and rename the script - something like populate_gop_OA_pheno2go_comparison.pl
  • 2015-07-20:
    • There is about 28% overlap between what is new in the Phenotype2GO file and what is already curated in Protein2GO.
    • Because we would then be adding duplicate annotations to postgres, I think it's better to wipe out the old annotations and use the new file and associated script that takes into account any duplicates before populating postgres.
    • Made new directory on mangolassi and added relevant files:
      • /home/acedb/kimberly/citace_upload/go/phenotype2go/new_annotation_entries/WS250_upload/WS249_import has the following three files that are needed by the script here: /home/postgres/work/pgpopulation/go/go_curation/20141106_kevin_godata/parse_kevin_godata.pl:
      • gene_association.goa_worm
      • gp2protein.wb
      • phenotype2go.WS249.wb
  • 2015-07-27:
    • Procedure for WS250 upload
      • Parse the WS249 phenotype2go GAF available on the WB ftp site
        • Key features of that parsing script:
          • Need to update the script for the correct build input file - my $ptgfile
          • Use the most up-to-date gp2protein.wb file for replacing UniProtKB identifiers with WBGene IDs
          • Compares phenotype2go GAF with monthly GAF from UniProt to remove from the phenotype2go upload file annotations to papers that have been manually curated for BP
          • Final output is a GAF stripped of duplicate entries wrt manual curation
      • Populate the GO OA tables with the WS249 data
        • Key features of that script:
          • Will populate postgres tables with data from the NewGpaEntries file
          • Need to delete previous upload's phenotype2go annotations if replacing annotations
            • Need to determine what the timestamp range was for the previous upload
            • Starting range: SELECT * FROM gop_curator WHERE gop_curator = 'WBPerson3111' ORDER BY gop_timestamp LIMIT 2;
            • Ending range: SELECT * FROM gop_curator WHERE gop_curator = 'WBPerson3111' ORDER BY gop_timestamp DESC LIMIT 2;
            • Make sure the number within that range corresponds to the number found by querying the OA: SELECT COUNT(*) FROM gop_curator WHERE gop_timestamp > '2015-04-21 15:20' AND gop_timestamp < '2015-04-21 16:20';
          • Before uncommenting out lines 75-78 (see below), run the script to check and make sure there aren't any major errors/problems with the script (e.g., one of the file formats changed and there's now no data for a required field)
          • In the populate script then need to update the timestamps for the Delete commands at the end of the script and then uncomment out lines 75-78 when running the script (Pad the timestamps by a minute or so on either end just to make sure you cover the range completely)