Difference between revisions of "UniProt Paper - Gene - Data Type"
m (→2019 Pipeline) |
|||
(15 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | = 2019 Pipeline = | ||
+ | |||
+ | Objective: supply a regularly updated file to UniProt that lists: | ||
+ | |||
+ | WBGene WBPaperID PMID Category | ||
+ | |||
+ | |||
+ | The Categories would be gene-specific and we will supply information for: | ||
+ | |||
+ | |||
+ | Expression | ||
+ | |||
+ | GO | ||
+ | |||
+ | Phenotype | ||
+ | |||
+ | PPI (Protein-Protein Interaction) | ||
+ | |||
+ | Disease | ||
+ | |||
+ | Sequence | ||
+ | |||
+ | New pipeline proposal is to retrieve information from WormBase ftp site, which will be updated with each release. | ||
+ | |||
+ | ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release | ||
+ | |||
+ | |||
+ | From: ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/ONTOLOGY | ||
+ | |||
+ | General principles: | ||
+ | # ignore lines with NOT qualifier | ||
+ | # convert WB_REFs to PMIDs ?? will check with Ceci | ||
+ | |||
+ | Expression: | ||
+ | * File: anatomy_association.WSnnn.wb | ||
+ | * File: development_association.WS270.wb | ||
+ | * Convert WBPaper ids into PMID | ||
+ | * Relevant columns: 2 and 6 | ||
+ | * Remove redundant gene-paper associations | ||
+ | |||
+ | Disease: | ||
+ | * File: disease_association.WSnnn.wb | ||
+ | * Convert WBPaper ids into PMID | ||
+ | * Ignore lines with 'IEA' in column 7 | ||
+ | * Relevant columns: 2 and 6 | ||
+ | |||
+ | GO: | ||
+ | * File: gene_association.WSnnn.wb.c_elegans | ||
+ | * Ignore lines with 'NOT' in column 4 | ||
+ | * Only use annotation lines that include a PMID, but ignore lines with WB_REF:WBPaper00046480|PMID:21873635 | ||
+ | * Relevant columns: 2 and 6 | ||
+ | |||
+ | Phenotype: | ||
+ | * File: phenotype_association.WSnnn.wb | ||
+ | * Ignore lines with 'NOT' in column 4 | ||
+ | * Convert WBPaper ids into PMID | ||
+ | * Relevant columns: 2 and 6 | ||
+ | |||
+ | From: ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/ | ||
+ | |||
+ | PPI: | ||
+ | * Problem with this file: papers are listed as text, not IDs | ||
+ | * Check with Chris G. about another file? | ||
+ | * May need to dump from postgres | ||
+ | * Need type of interaction (physical AND proteinprotein), reference, and each gene listed | ||
+ | |||
+ | From: ?? need to email Hinxton | ||
+ | |||
+ | Sequence: | ||
+ | ?Variation -> Affects -> Gene | ||
+ | -> Type_of_mutation -> Paper_evidence | ||
+ | |||
+ | = 2015 Pipeline = | ||
Original file generated for UniProt: | Original file generated for UniProt: | ||
Line 67: | Line 140: | ||
Phenotype: | Phenotype: | ||
− | ?RNAi -> Inhibits | + | ?RNAi -> Inhibits -> Gene |
-> Phenotype (Only Phenotype Observed, doesn't matter what the Phenotype is) | -> Phenotype (Only Phenotype Observed, doesn't matter what the Phenotype is) | ||
-> Reference | -> Reference | ||
− | ?Variation -> Affects | + | ?Variation -> Affects -> Gene |
-> Phenotype (Only Phenotype Observed, doesn't matter what the Phenotype is) | -> Phenotype (Only Phenotype Observed, doesn't matter what the Phenotype is) | ||
-> Reference | -> Reference | ||
Line 82: | Line 155: | ||
Sequence: | Sequence: | ||
− | ?Variation -> Affects | + | ?Variation -> Affects -> Gene |
− | + | -> Type_of_mutation -> Paper_evidence | |
− | -> | + | |
− | + | Disease: | |
− | + | ?Gene -> Disease_info -> Experimental_model -> Evidence -> Paper_evidence | |
− | + | -> Disease_info -> Disease_relevance -> Evidence -> Paper_evidence | |
− | |||
− | |||
− | |||
− | |||
Disease: | Disease: | ||
− | + | Alternatively, use OA tables for Experimental Model For and Paper for Exp Mod | |
− | + | Disease relevance and Paper for Disease Rel | |
+ | Will need to get OA table names |
Latest revision as of 12:45, 21 May 2019
2019 Pipeline
Objective: supply a regularly updated file to UniProt that lists:
WBGene WBPaperID PMID Category
The Categories would be gene-specific and we will supply information for:
Expression
GO
Phenotype
PPI (Protein-Protein Interaction)
Disease
Sequence
New pipeline proposal is to retrieve information from WormBase ftp site, which will be updated with each release.
ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release
From: ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/ONTOLOGY
General principles:
- ignore lines with NOT qualifier
- convert WB_REFs to PMIDs ?? will check with Ceci
Expression:
- File: anatomy_association.WSnnn.wb
- File: development_association.WS270.wb
- Convert WBPaper ids into PMID
- Relevant columns: 2 and 6
- Remove redundant gene-paper associations
Disease:
- File: disease_association.WSnnn.wb
- Convert WBPaper ids into PMID
- Ignore lines with 'IEA' in column 7
- Relevant columns: 2 and 6
GO:
- File: gene_association.WSnnn.wb.c_elegans
- Ignore lines with 'NOT' in column 4
- Only use annotation lines that include a PMID, but ignore lines with WB_REF:WBPaper00046480|PMID:21873635
- Relevant columns: 2 and 6
Phenotype:
- File: phenotype_association.WSnnn.wb
- Ignore lines with 'NOT' in column 4
- Convert WBPaper ids into PMID
- Relevant columns: 2 and 6
PPI:
- Problem with this file: papers are listed as text, not IDs
- Check with Chris G. about another file?
- May need to dump from postgres
- Need type of interaction (physical AND proteinprotein), reference, and each gene listed
From: ?? need to email Hinxton
Sequence:
?Variation -> Affects -> Gene -> Type_of_mutation -> Paper_evidence
2015 Pipeline
Original file generated for UniProt:
http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/uniprot.cgi
What we currently supply:
WBGene WBPaperID PMID
Not sure how this is generated; I believe this was done before my time as paper curator.
Proposed updates to file will now include data types curated for a given gene.
We would need to add:
WBGene WBPaperID PMID Category
The Categories would be gene-specific and we will supply information for:
GO
PPI (Protein-Protein Interaction)
Phenotype
Disease
Expression
Sequence
An example:
WBGene00003508 WBPaper00003680 pmid10517638 GO;Phenotype;Disease;Expression
Strategy: Several possible strategies, perhaps - not sure which is best.
Easiest to get everything from WS or a mixture of WS and postgres?
Some things, like GO, RNAi and Variation Phenotypes, need to be from WS
Possibilities:
1) Start with Paper object and then trace the information in the objects xref'ed in the Refers_to tag - this works for everything but Disease
2) Look at each object in each relevant class - this seems computationally very intensive
Relevant tags in the different object models:
GO:
?GO_annotation -> Gene -> Reference
PPI:
?Interaction -> Interaction_type Physical -> Interactor_overlapping_gene -> Paper
Phenotype:
?RNAi -> Inhibits -> Gene -> Phenotype (Only Phenotype Observed, doesn't matter what the Phenotype is) -> Reference
?Variation -> Affects -> Gene -> Phenotype (Only Phenotype Observed, doesn't matter what the Phenotype is) -> Reference
Expression:
?Expr_pattern -> Expression_of -> Gene -> Reference
Sequence:
?Variation -> Affects -> Gene -> Type_of_mutation -> Paper_evidence
Disease:
?Gene -> Disease_info -> Experimental_model -> Evidence -> Paper_evidence -> Disease_info -> Disease_relevance -> Evidence -> Paper_evidence
Disease:
Alternatively, use OA tables for Experimental Model For and Paper for Exp Mod Disease relevance and Paper for Disease Rel Will need to get OA table names