UniProt Paper - Gene - Data Type
2019 Pipeline
Objective: supply a regularly updated file to UniProt that lists:
WBGene WBPaperID PMID Category
The Categories would be gene-specific and we will supply information for:
Expression
GO
Phenotype
PPI (Protein-Protein Interaction)
Disease
Sequence
New pipeline proposal is to retrieve information from WormBase ftp site, which will be updated with each release.
ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release
From: ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/ONTOLOGY
General principles:
- ignore lines with NOT qualifier
- convert WB_REFs to PMIDs ?? will check with Ceci
Expression:
- File: anatomy_association.WSnnn.wb
- File: development_association.WS270.wb
- Convert WBPaper ids into PMID
- Relevant columns: 2 and 6
- Remove redundant gene-paper associations
Disease:
- File: disease_association.WSnnn.wb
- Convert WBPaper ids into PMID
- Ignore lines with 'IEA' in column 7
- Relevant columns: 2 and 6
GO:
- File: gene_association.WSnnn.wb.c_elegans
- Ignore lines with 'NOT' in column 4
- Only use annotation lines that include a PMID, but ignore lines with WB_REF:WBPaper00046480|PMID:21873635
- Relevant columns: 2 and 6
Phenotype:
- File: phenotype_association.WSnnn.wb
- Ignore lines with 'NOT' in column 4
- Convert WBPaper ids into PMID
- Relevant columns: 2 and 6
PPI:
- Problem with this file: papers are listed as text, not IDs
- Check with Chris G. about another file?
- May need to dump from postgres
- Need type of interaction (physical AND proteinprotein), reference, and each gene listed
From: ?? need to email Hinxton
Sequence:
?Variation -> Affects -> Gene -> Type_of_mutation -> Paper_evidence
2015 Pipeline
Original file generated for UniProt:
http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/uniprot.cgi
What we currently supply:
WBGene WBPaperID PMID
Not sure how this is generated; I believe this was done before my time as paper curator.
Proposed updates to file will now include data types curated for a given gene.
We would need to add:
WBGene WBPaperID PMID Category
The Categories would be gene-specific and we will supply information for:
GO
PPI (Protein-Protein Interaction)
Phenotype
Disease
Expression
Sequence
An example:
WBGene00003508 WBPaper00003680 pmid10517638 GO;Phenotype;Disease;Expression
Strategy: Several possible strategies, perhaps - not sure which is best.
Easiest to get everything from WS or a mixture of WS and postgres?
Some things, like GO, RNAi and Variation Phenotypes, need to be from WS
Possibilities:
1) Start with Paper object and then trace the information in the objects xref'ed in the Refers_to tag - this works for everything but Disease
2) Look at each object in each relevant class - this seems computationally very intensive
Relevant tags in the different object models:
GO:
?GO_annotation -> Gene -> Reference
PPI:
?Interaction -> Interaction_type Physical -> Interactor_overlapping_gene -> Paper
Phenotype:
?RNAi -> Inhibits -> Gene -> Phenotype (Only Phenotype Observed, doesn't matter what the Phenotype is) -> Reference
?Variation -> Affects -> Gene -> Phenotype (Only Phenotype Observed, doesn't matter what the Phenotype is) -> Reference
Expression:
?Expr_pattern -> Expression_of -> Gene -> Reference
Sequence:
?Variation -> Affects -> Gene -> Type_of_mutation -> Paper_evidence
Disease:
?Gene -> Disease_info -> Experimental_model -> Evidence -> Paper_evidence -> Disease_info -> Disease_relevance -> Evidence -> Paper_evidence
Disease:
Alternatively, use OA tables for Experimental Model For and Paper for Exp Mod Disease relevance and Paper for Disease Rel Will need to get OA table names