Difference between revisions of "UniProt Paper - Gene - Data Type"

From WormBaseWiki
Jump to navigationJump to search
 
(24 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
= 2019 Pipeline =
 +
 +
Objective: supply a regularly updated file to UniProt that lists:
 +
 +
WBGene WBPaperID PMID Category
 +
 +
 +
The Categories would be gene-specific and we will supply information for:
 +
 +
 +
Expression
 +
 +
GO
 +
 +
Phenotype
 +
 +
PPI (Protein-Protein Interaction)
 +
 +
Disease
 +
 +
Sequence
 +
 +
New pipeline proposal is to retrieve information from WormBase ftp site, which will be updated with each release.
 +
 +
ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release
 +
 +
 +
From: ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/ONTOLOGY
 +
 +
General principles:
 +
# ignore lines with NOT qualifier
 +
# convert WB_REFs to PMIDs ??  will check with Ceci
 +
 +
Expression:
 +
* File: anatomy_association.WSnnn.wb
 +
* File: development_association.WS270.wb
 +
* Convert WBPaper ids into PMID
 +
* Relevant columns: 2 and 6
 +
* Remove redundant gene-paper associations
 +
 +
Disease:
 +
* File: disease_association.WSnnn.wb
 +
* Convert WBPaper ids into PMID
 +
* Ignore lines with 'IEA' in column 7
 +
* Relevant columns: 2 and 6
 +
 +
GO:
 +
* File: gene_association.WSnnn.wb.c_elegans
 +
* Ignore lines with 'NOT' in column 4
 +
* Only use annotation lines that include a PMID, but ignore lines with WB_REF:WBPaper00046480|PMID:21873635
 +
* Relevant columns: 2 and 6
 +
 +
Phenotype:
 +
* File: phenotype_association.WSnnn.wb
 +
* Ignore lines with 'NOT' in column 4
 +
* Convert WBPaper ids into PMID
 +
* Relevant columns: 2 and 6
 +
 +
From: ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/
 +
 +
PPI:
 +
* Problem with this file: papers are listed as text, not IDs
 +
* Check with Chris G. about another file?
 +
* May need to dump from postgres
 +
* Need type of interaction (physical AND proteinprotein), reference, and each gene listed
 +
 +
From: ?? need to email Hinxton
 +
 +
Sequence:
 +
    ?Variation -> Affects -> Gene
 +
                -> Type_of_mutation -> Paper_evidence
 +
 +
= 2015 Pipeline =
 
Original file generated for UniProt:
 
Original file generated for UniProt:
  
 
http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/uniprot.cgi
 
http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/uniprot.cgi
 
Updates to file will now include data types curated for a given gene.
 
 
  
 
What we currently supply:
 
What we currently supply:
Line 10: Line 80:
 
WBGene WBPaperID PMID
 
WBGene WBPaperID PMID
  
Not sure how this is generated - before my time.
+
Not sure how this is generated; I believe this was done before my time as paper curator.
 +
 
  
 +
Proposed updates to file will now include data types curated for a given gene.
  
What we need to add:
+
We would need to add:
  
 
WBGene WBPaperID PMID Category
 
WBGene WBPaperID PMID Category
  
 +
 +
The Categories would be gene-specific and we will supply information for:
 +
 +
GO
 +
 +
PPI (Protein-Protein Interaction)
 +
 +
Phenotype
 +
 +
Disease
 +
 +
Expression
 +
 +
Sequence
 +
 +
 +
An example:
 +
 +
    WBGene00003508  WBPaper00003680  pmid10517638  GO;Phenotype;Disease;Expression
 +
 +
 +
 +
 +
Strategy: Several possible strategies, perhaps - not sure which is best.
  
 
Easiest to get everything from WS or a mixture of WS and postgres?
 
Easiest to get everything from WS or a mixture of WS and postgres?
 +
 
Some things, like GO, RNAi and Variation Phenotypes, need to be from WS
 
Some things, like GO, RNAi and Variation Phenotypes, need to be from WS
  
The Categories would be gene-specific and we will supply information for:
+
Possibilities:
 +
 
 +
1) Start with Paper object and then trace the information in the objects xref'ed in the Refers_to tag - this works for everything but Disease
  
GO:PPI;Phenotype;Disease;Expression;Sequence
+
2) Look at each object in each relevant class - this seems computationally very intensive
  
  
How to map this onto our data types from each WS release:
+
Relevant tags in the different object models:
  
  
Line 37: Line 136:
 
     ?Interaction -> Interaction_type Physical
 
     ?Interaction -> Interaction_type Physical
 
                   -> Interactor_overlapping_gene
 
                   -> Interactor_overlapping_gene
          -> Paper
+
                  -> Paper
  
  
 
Phenotype:
 
Phenotype:
     ?RNAi -> Inhibits
+
     ?RNAi -> Inhibits -> Gene
           -> Phenotype
+
           -> Phenotype (Only Phenotype Observed, doesn't matter what the Phenotype is)
 
           -> Reference
 
           -> Reference
  
     ?Variation -> Affects
+
     ?Variation -> Affects -> Gene
                -> Phenotype
+
                -> Phenotype (Only Phenotype Observed, doesn't matter what the Phenotype is)
                -> Reference
+
                -> Reference
  
 
Disease:
 
    ?Gene -> Disease_info -> Experimental -> Evidence -> Paper_evidence
 
          -> Disease_info -> Disease_relevance -> Evidence -> Paper_evidence
 
  
 
Expression:
 
Expression:
Line 60: Line 155:
  
 
Sequence:
 
Sequence:
     ?Variation -> Affects
+
     ?Variation -> Affects -> Gene
                 -> Nonsense
+
                 -> Type_of_mutation -> Paper_evidence
                -> Missense
+
           
                -> Silent Any one of these filled in
+
Disease:
                -> Splice_site
+
    ?Gene -> Disease_info -> Experimental_model -> Evidence -> Paper_evidence
                -> Frameshift
+
          -> Disease_info -> Disease_relevance -> Evidence -> Paper_evidence
                -> Readthrough
+
 
                -> Reference
+
Disease:
 +
    Alternatively, use OA tables for Experimental Model For and Paper for Exp Mod
 +
                                      Disease relevance and Paper for Disease Rel
 +
    Will need to get OA table names

Latest revision as of 12:45, 21 May 2019

2019 Pipeline

Objective: supply a regularly updated file to UniProt that lists:

WBGene WBPaperID PMID Category


The Categories would be gene-specific and we will supply information for:


Expression

GO

Phenotype

PPI (Protein-Protein Interaction)

Disease

Sequence

New pipeline proposal is to retrieve information from WormBase ftp site, which will be updated with each release.

ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release


From: ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/ONTOLOGY

General principles:

  1. ignore lines with NOT qualifier
  2. convert WB_REFs to PMIDs ?? will check with Ceci

Expression:

  • File: anatomy_association.WSnnn.wb
  • File: development_association.WS270.wb
  • Convert WBPaper ids into PMID
  • Relevant columns: 2 and 6
  • Remove redundant gene-paper associations

Disease:

  • File: disease_association.WSnnn.wb
  • Convert WBPaper ids into PMID
  • Ignore lines with 'IEA' in column 7
  • Relevant columns: 2 and 6

GO:

  • File: gene_association.WSnnn.wb.c_elegans
  • Ignore lines with 'NOT' in column 4
  • Only use annotation lines that include a PMID, but ignore lines with WB_REF:WBPaper00046480|PMID:21873635
  • Relevant columns: 2 and 6

Phenotype:

  • File: phenotype_association.WSnnn.wb
  • Ignore lines with 'NOT' in column 4
  • Convert WBPaper ids into PMID
  • Relevant columns: 2 and 6

From: ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/

PPI:

  • Problem with this file: papers are listed as text, not IDs
  • Check with Chris G. about another file?
  • May need to dump from postgres
  • Need type of interaction (physical AND proteinprotein), reference, and each gene listed

From: ?? need to email Hinxton

Sequence:

    ?Variation -> Affects -> Gene
               -> Type_of_mutation -> Paper_evidence

2015 Pipeline

Original file generated for UniProt:

http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/uniprot.cgi

What we currently supply:

WBGene WBPaperID PMID

Not sure how this is generated; I believe this was done before my time as paper curator.


Proposed updates to file will now include data types curated for a given gene.

We would need to add:

WBGene WBPaperID PMID Category


The Categories would be gene-specific and we will supply information for:

GO

PPI (Protein-Protein Interaction)

Phenotype

Disease

Expression

Sequence


An example:

    WBGene00003508  WBPaper00003680  pmid10517638  GO;Phenotype;Disease;Expression 



Strategy: Several possible strategies, perhaps - not sure which is best.

Easiest to get everything from WS or a mixture of WS and postgres?

Some things, like GO, RNAi and Variation Phenotypes, need to be from WS

Possibilities:

1) Start with Paper object and then trace the information in the objects xref'ed in the Refers_to tag - this works for everything but Disease

2) Look at each object in each relevant class - this seems computationally very intensive


Relevant tags in the different object models:


GO:

    ?GO_annotation -> Gene
                   -> Reference


PPI:

    ?Interaction -> Interaction_type Physical
                 -> Interactor_overlapping_gene
                 -> Paper


Phenotype:

    ?RNAi -> Inhibits -> Gene
          -> Phenotype (Only Phenotype Observed, doesn't matter what the Phenotype is)
          -> Reference
    ?Variation -> Affects -> Gene
               -> Phenotype (Only Phenotype Observed, doesn't matter what the Phenotype is)
               -> Reference


Expression:

    ?Expr_pattern -> Expression_of -> Gene
                  -> Reference


Sequence:

    ?Variation -> Affects -> Gene
               -> Type_of_mutation -> Paper_evidence
            

Disease:

    ?Gene -> Disease_info -> Experimental_model -> Evidence -> Paper_evidence
          -> Disease_info -> Disease_relevance -> Evidence -> Paper_evidence

Disease:

    Alternatively, use OA tables for Experimental Model For and Paper for Exp Mod
                                     Disease relevance and Paper for Disease Rel
    Will need to get OA table names