Difference between revisions of "20141022 - Phenotype2GO Pipeline"

From WormBaseWiki
Jump to navigationJump to search
 
(20 intermediate revisions by the same user not shown)
Line 3: Line 3:
 
=Source of Annotations=
 
=Source of Annotations=
 
*Phenotype2GO-based annotations are generated as part of the WB build based upon the presence of a phenotype associated with a perturbation of gene activity.
 
*Phenotype2GO-based annotations are generated as part of the WB build based upon the presence of a phenotype associated with a perturbation of gene activity.
*The perturbation may be an RNAi experiment (or perhaps a variation, in future releases).
+
*The perturbation may be an RNAi experiment or a variation.
 
*If the perturbation is an RNAi experiment, we will only associate the GO annotations with phenotypes resulting from disruption of the primary target.  Secondary targets will not be used for Phenotype2GO annotations.
 
*If the perturbation is an RNAi experiment, we will only associate the GO annotations with phenotypes resulting from disruption of the primary target.  Secondary targets will not be used for Phenotype2GO annotations.
 
*All papers will be included in the pipeline (see below for some details on this).
 
*All papers will be included in the pipeline (see below for some details on this).
 
*It will be up to Caltech to integrate the annotations into their curation database and perform any necessary editing and QC there.
 
*It will be up to Caltech to integrate the annotations into their curation database and perform any necessary editing and QC there.
 
*Phenotype2GO annotations displayed in WB, and sent to the GOC, will come from the Caltech .ace file uploaded with each build.
 
*Phenotype2GO annotations displayed in WB, and sent to the GOC, will come from the Caltech .ace file uploaded with each build.
 +
*'''Questions Still to Address:'''
 +
**Annotations still attached to pseudogenes?
 +
**Annotations when the RNAi experiment is actually testing for a genetic interaction.  To distinguish these, can we include the ?Interaction ID in the With/From column along with the WBRNAi and WBPhenotype IDs?
  
 
=Source of Phenotype2GO Mappings=
 
=Source of Phenotype2GO Mappings=
 
*All mappings between WB Phenotype Ontology and Gene Ontology terms are submitted during the build in a separate .ace file submitted by Ranjana.
 
*All mappings between WB Phenotype Ontology and Gene Ontology terms are submitted during the build in a separate .ace file submitted by Ranjana.
 
*No mappings are present in the ?Phenotype objects submitted separately during the build, as this would have the potential to overwrite mappings coming from the GO curators.
 
*No mappings are present in the ?Phenotype objects submitted separately during the build, as this would have the potential to overwrite mappings coming from the GO curators.
 +
*One change we will make in the mappings is to remove terms that qualify regulation with positive or negative, since we sometimes assign the wrong GO term if these terms are assigned based on a genetic interaction experiment, i.e. we get the directionality wrong.
  
 
=GAFs=
 
=GAFs=
*The GAF generated as part of the WB build will contain all of the manual annotations and all sequence-based IEA annotations, i.e. InterPro2GO.
+
*The GAF generated as part of the WB build will contain all of the manual annotations, all Phenotype2GO-based annotations, and all sequence-based IEA annotations, i.e. InterPro2GO.
*The Phenotype2GO-based annotations will be in a separate GAF and will now use the IEA evidence code.  This GAF will be the basis for what Caltech uploads to postgres after each build.
 
*Caltech will have to dump out a separate GAF from postgres for Phenotype2GO-based annotations to send to the GOC.  This GAF will reflect what annotations are also in the .ace file.
 
 
*The annotation date in the WB GAF (Column 14) will reflect the value of the Date_last_updated tag in the GO_annotation model when that information exists.  Otherwise, the date in the GAF will reflect the date at which the GAF file was written as part of the build process.
 
*The annotation date in the WB GAF (Column 14) will reflect the value of the Date_last_updated tag in the GO_annotation model when that information exists.  Otherwise, the date in the GAF will reflect the date at which the GAF file was written as part of the build process.
 +
*For InterPro2GO-based annotations, the new reference will be GO_REF:0000002  (see http://www.geneontology.org/cgi-bin/references.cgi).  This will replace the current use of:  PMID:12520011|PMID:12654719
 +
*The Phenotype2GO annotations will now use the IEA evidence code to reflect that they are not manually generated.  When an annotation is manually reviewed, the evidence code will be upgraded to IMP (or whatever is most suitable).
  
 
=Papers=
 
=Papers=
 
*For now, all papers will be included.
 
*For now, all papers will be included.
*One outstanding issue, though, is what to do with RNAi experiments that are actually genetic interactions.  For manual annotation, these would use the IGI evidence code and the curator would need to indicate the interacting gene in the With/From column.  It doesn't look like the interacting genes are directly listed in the RNAi object, but rather found in the strain listed or in the corresponding ?Interaction object.   
+
*One outstanding issue, though, is what to do with RNAi experiments that actually describe genetic interactions.  For manual annotation, these would use the IGI evidence code and the curator would need to indicate the interacting gene in the With/From column.  It doesn't look like the interacting genes are directly listed in the RNAi object, but rather found in the strain listed or in the corresponding ?Interaction object.   
*For example, the Lehner et al., 2006 paper has a large number of RNAi experiments, but these are all in the background of a genetic variation.  Thus, the phenotypes are believed to be the result of an interaction.
+
*For example, the Lehner et al., 2006 paper (WBPaper00027756) has a large number of RNAi experiments, but these are all in the background of a genetic variation.  Thus, the phenotypes are believed to be the result of an interaction.
*Could the Interaction tag in the ?RNAi object be used to exclude such experiments from the Phenotype2GO-based pipeline?
+
*For future builds, could we include the ?Interaction ID in the With/From column as a way to distinguish these RNAi experiments from others that are not intended to assess genetic interactions?
*If Caltech wanted, at some point we could review these papers and upload the annotations in bulk using the IGI evidence code, but we would have to review them as IGI annotation are not always as straightforward.
 
  
 
=Uploading Annotations into Postgres=
 
=Uploading Annotations into Postgres=
 
*Annotations generated by the Phenotype2GO pipeline will be uploaded into the postgres database at Caltech so they will be visible in the OA curation tool for review and editing.
 
*Annotations generated by the Phenotype2GO pipeline will be uploaded into the postgres database at Caltech so they will be visible in the OA curation tool for review and editing.
 +
*[[Generating Initial GAF file for Upload to Postgres]]
 
*[[Mapping the GAF to GO OA tables]]
 
*[[Mapping the GAF to GO OA tables]]
 +
*[[Populating the OA and New Dumping Script]]
 +
*[[Subsequent Phenotype2GO Uploads to Postgres]]
 
*Caltech will need to devise an intelligent diff that will allow us to only upload those annotations that are new for any given build.
 
*Caltech will need to devise an intelligent diff that will allow us to only upload those annotations that are new for any given build.
*Updates to annotations should be made in Protein2GO.
+
*Ideally, manual updates to annotations would be made in Protein2GO so that all of our manual annotations are in one place.
*Probably the best approach would be to compare the latest build GAF with the postgres GAF and a Protein2GO GAF and only upload those annotations where we don't have the combination of a gene and a paper.  I don't think we can include GO term in this diff, since it is quite likely that the manual review will result in a more granular GO annotation.
+
*Probably the best approach would be to compare the latest build GAF with the postgres GAF and the Protein2GO GPAD and only upload those annotations where we don't already have the combination of a gene and a paper and a P annotation.  I don't think we can include GO term in this diff, since it is quite likely that the manual review will result in a more granular GO annotation.
*One complicating factor here is what happens when the curator simply removed the automated annotation? It will then appear as if the annotation is not in postgres but if it is still in the Phenotype2GO GAF, we will upload it again.  Can we include a NO DUMP flag in the GO OA and instead of removing an annotation, we actually toggle the NO DUMP flag and then we have a GAF of NO DUMP annotations to compare to for upload?  NO DUMP annotations would not be included in the .ace file and the GAF that goes to GOC, but would be used for deciding what annotations to upload to postgres.
+
*One complicating factor here is what happens when the curator wants to remove the automated annotation from the annotation set. To handle these, we will create a new table, gop_falsepositive, in the GO OA and instead of deleting an annotation, we will actually toggle the False Positive flag and these will not be included in subsequent uploads so we don't keep re-uploading the same false positive annotations. False Positive annotations would not be included in the .ace file and thus will not go to GOC.  The false positive annotations will just be used for deciding what annotations to upload to postgres after each WB build.
*Also, there are papers where we probably don't want the GO annotations at all.  We will need to keep this list at Caltech, but also need to define clearly why we aren't including the annotations from that paper, e.g., a certain threshold of annotations have been found to be incorrect.
+
*Also, there are papers where we probably don't want the GO annotations at all.  We will need to keep this list at Caltech, but also need to define clearly why we aren't including the annotations from that paper, e.g., a certain threshold of Phenotype2GO annotations have been found to be incorrect.
*Also need to upload to postgres the variation Phenotype2GO annotations generated when Jolene was here.
+
*Also need to upload to postgres the variation Phenotype2GO annotations generated when Jolene was still at Caltech - may not need to do this if these annotations are in the phenotype2go file from the build.
  
 
=Timeline=
 
=Timeline=
*WS246 release date on ftp site: 2014-11-03
+
*WS246 release date (estimated) on ftp site: 2014-11-08
 
*Upload annotations to postgres at Caltech: completed by 2014-11-30
 
*Upload annotations to postgres at Caltech: completed by 2014-11-30
 
*.ace file from postgres ready for upload: completed by 2014-12-?
 
*.ace file from postgres ready for upload: completed by 2014-12-?
  
 
=Bigger Philosophical Issues=
 
=Bigger Philosophical Issues=
*Annotating to the process term versus annotating to the regulation of that process.  This issue isn't unique to the Phenotype2GO-based annotations, but is generally true of all BP annotations.  It may be helped by the addition of a new explicit qualifier for GO, 'implicated_in', which is proposed to mean the gene product is somehow implicated in 'that process OR regulation of that process' but that currently there is insufficient information to distinguish between the two roles.
+
*Annotating to the process term versus annotating to the regulation of that process.  This issue isn't unique to the Phenotype2GO-based annotations, but is generally true of all BP annotations.  It may be helped by the addition of a new explicit qualifier for GO, 'implicated_in', which is proposed to mean the gene product is somehow implicated in 'that process OR regulation of that process' but currently there is insufficient information to distinguish between the two roles.
*Creating more accurate annotations.  One idea here is to combine a cellular-level process annotation with the Phenotype2GO-based organismal annotation to generate annotations like "microtubule-based transport involved in axon guidance".  Again, this isn't unique to Phenotype2GO annotations, but is also true for manual annotations.  Addition of more granular evidence codes in GO should help to capture this line of curation thought more accurately, e.g. an evidence codes that indicates the annotation is based on both a mutant phenotype and the molecular identity of the gene product.
+
*Creating more accurate annotations.  One idea here is to combine a cellular-level process annotation with the Phenotype2GO-based organismal annotation to generate annotations like "microtubule-based transport involved in axon guidance".  Again, this isn't unique to Phenotype2GO annotations, but is also true for manual annotations.  Addition of more granular evidence codes in GO should help to capture this line of curation thought more accurately, e.g. an evidence code that indicates the annotation is based on both a mutant phenotype and the molecular identity of the gene product.
 +
 
 +
Back to [[Gene Ontology]]

Latest revision as of 17:54, 19 February 2015

This page begins to summarize how the Phenotype2GO-based GO annotations will be handled for future WormBase builds and GO uploads.

Source of Annotations

  • Phenotype2GO-based annotations are generated as part of the WB build based upon the presence of a phenotype associated with a perturbation of gene activity.
  • The perturbation may be an RNAi experiment or a variation.
  • If the perturbation is an RNAi experiment, we will only associate the GO annotations with phenotypes resulting from disruption of the primary target. Secondary targets will not be used for Phenotype2GO annotations.
  • All papers will be included in the pipeline (see below for some details on this).
  • It will be up to Caltech to integrate the annotations into their curation database and perform any necessary editing and QC there.
  • Phenotype2GO annotations displayed in WB, and sent to the GOC, will come from the Caltech .ace file uploaded with each build.
  • Questions Still to Address:
    • Annotations still attached to pseudogenes?
    • Annotations when the RNAi experiment is actually testing for a genetic interaction. To distinguish these, can we include the ?Interaction ID in the With/From column along with the WBRNAi and WBPhenotype IDs?

Source of Phenotype2GO Mappings

  • All mappings between WB Phenotype Ontology and Gene Ontology terms are submitted during the build in a separate .ace file submitted by Ranjana.
  • No mappings are present in the ?Phenotype objects submitted separately during the build, as this would have the potential to overwrite mappings coming from the GO curators.
  • One change we will make in the mappings is to remove terms that qualify regulation with positive or negative, since we sometimes assign the wrong GO term if these terms are assigned based on a genetic interaction experiment, i.e. we get the directionality wrong.

GAFs

  • The GAF generated as part of the WB build will contain all of the manual annotations, all Phenotype2GO-based annotations, and all sequence-based IEA annotations, i.e. InterPro2GO.
  • The annotation date in the WB GAF (Column 14) will reflect the value of the Date_last_updated tag in the GO_annotation model when that information exists. Otherwise, the date in the GAF will reflect the date at which the GAF file was written as part of the build process.
  • For InterPro2GO-based annotations, the new reference will be GO_REF:0000002 (see http://www.geneontology.org/cgi-bin/references.cgi). This will replace the current use of: PMID:12520011|PMID:12654719
  • The Phenotype2GO annotations will now use the IEA evidence code to reflect that they are not manually generated. When an annotation is manually reviewed, the evidence code will be upgraded to IMP (or whatever is most suitable).

Papers

  • For now, all papers will be included.
  • One outstanding issue, though, is what to do with RNAi experiments that actually describe genetic interactions. For manual annotation, these would use the IGI evidence code and the curator would need to indicate the interacting gene in the With/From column. It doesn't look like the interacting genes are directly listed in the RNAi object, but rather found in the strain listed or in the corresponding ?Interaction object.
  • For example, the Lehner et al., 2006 paper (WBPaper00027756) has a large number of RNAi experiments, but these are all in the background of a genetic variation. Thus, the phenotypes are believed to be the result of an interaction.
  • For future builds, could we include the ?Interaction ID in the With/From column as a way to distinguish these RNAi experiments from others that are not intended to assess genetic interactions?

Uploading Annotations into Postgres

  • Annotations generated by the Phenotype2GO pipeline will be uploaded into the postgres database at Caltech so they will be visible in the OA curation tool for review and editing.
  • Generating Initial GAF file for Upload to Postgres
  • Mapping the GAF to GO OA tables
  • Populating the OA and New Dumping Script
  • Subsequent Phenotype2GO Uploads to Postgres
  • Caltech will need to devise an intelligent diff that will allow us to only upload those annotations that are new for any given build.
  • Ideally, manual updates to annotations would be made in Protein2GO so that all of our manual annotations are in one place.
  • Probably the best approach would be to compare the latest build GAF with the postgres GAF and the Protein2GO GPAD and only upload those annotations where we don't already have the combination of a gene and a paper and a P annotation. I don't think we can include GO term in this diff, since it is quite likely that the manual review will result in a more granular GO annotation.
  • One complicating factor here is what happens when the curator wants to remove the automated annotation from the annotation set. To handle these, we will create a new table, gop_falsepositive, in the GO OA and instead of deleting an annotation, we will actually toggle the False Positive flag and these will not be included in subsequent uploads so we don't keep re-uploading the same false positive annotations. False Positive annotations would not be included in the .ace file and thus will not go to GOC. The false positive annotations will just be used for deciding what annotations to upload to postgres after each WB build.
  • Also, there are papers where we probably don't want the GO annotations at all. We will need to keep this list at Caltech, but also need to define clearly why we aren't including the annotations from that paper, e.g., a certain threshold of Phenotype2GO annotations have been found to be incorrect.
  • Also need to upload to postgres the variation Phenotype2GO annotations generated when Jolene was still at Caltech - may not need to do this if these annotations are in the phenotype2go file from the build.

Timeline

  • WS246 release date (estimated) on ftp site: 2014-11-08
  • Upload annotations to postgres at Caltech: completed by 2014-11-30
  • .ace file from postgres ready for upload: completed by 2014-12-?

Bigger Philosophical Issues

  • Annotating to the process term versus annotating to the regulation of that process. This issue isn't unique to the Phenotype2GO-based annotations, but is generally true of all BP annotations. It may be helped by the addition of a new explicit qualifier for GO, 'implicated_in', which is proposed to mean the gene product is somehow implicated in 'that process OR regulation of that process' but currently there is insufficient information to distinguish between the two roles.
  • Creating more accurate annotations. One idea here is to combine a cellular-level process annotation with the Phenotype2GO-based organismal annotation to generate annotations like "microtubule-based transport involved in axon guidance". Again, this isn't unique to Phenotype2GO annotations, but is also true for manual annotations. Addition of more granular evidence codes in GO should help to capture this line of curation thought more accurately, e.g. an evidence code that indicates the annotation is based on both a mutant phenotype and the molecular identity of the gene product.

Back to Gene Ontology