UniProt Paper - Gene - Data Type

From WormBaseWiki
Jump to: navigation, search

2019 Pipeline

Objective: supply a regularly updated file to UniProt that lists:

WBGene WBPaperID PMID Category


The Categories would be gene-specific and we will supply information for:


Expression

GO

Phenotype

PPI (Protein-Protein Interaction)

Disease

Sequence

New pipeline proposal is to retrieve information from WormBase ftp site, which will be updated with each release.

ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release


From: ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/ONTOLOGY

General principles:

  1. ignore lines with NOT qualifier
  2. convert WB_REFs to PMIDs ?? will check with Ceci

Expression:

  • File: anatomy_association.WSnnn.wb
  • File: development_association.WS270.wb
  • Convert WBPaper ids into PMID
  • Relevant columns: 2 and 6
  • Remove redundant gene-paper associations

Disease:

  • File: disease_association.WSnnn.wb
  • Convert WBPaper ids into PMID
  • Ignore lines with 'IEA' in column 7
  • Relevant columns: 2 and 6

GO:

  • File: gene_association.WSnnn.wb.c_elegans
  • Ignore lines with 'NOT' in column 4
  • Only use annotation lines that include a PMID, but ignore lines with WB_REF:WBPaper00046480|PMID:21873635
  • Relevant columns: 2 and 6

Phenotype:

  • File: phenotype_association.WSnnn.wb
  • Ignore lines with 'NOT' in column 4
  • Convert WBPaper ids into PMID
  • Relevant columns: 2 and 6

From: ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/

PPI:

  • Problem with this file: papers are listed as text, not IDs
  • Check with Chris G. about another file?
  • May need to dump from postgres
  • Need type of interaction (physical AND proteinprotein), reference, and each gene listed

From: ?? need to email Hinxton

Sequence:

    ?Variation -> Affects -> Gene
               -> Type_of_mutation -> Paper_evidence

2015 Pipeline

Original file generated for UniProt:

http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/uniprot.cgi

What we currently supply:

WBGene WBPaperID PMID

Not sure how this is generated; I believe this was done before my time as paper curator.


Proposed updates to file will now include data types curated for a given gene.

We would need to add:

WBGene WBPaperID PMID Category


The Categories would be gene-specific and we will supply information for:

GO

PPI (Protein-Protein Interaction)

Phenotype

Disease

Expression

Sequence


An example:

    WBGene00003508  WBPaper00003680  pmid10517638  GO;Phenotype;Disease;Expression 



Strategy: Several possible strategies, perhaps - not sure which is best.

Easiest to get everything from WS or a mixture of WS and postgres?

Some things, like GO, RNAi and Variation Phenotypes, need to be from WS

Possibilities:

1) Start with Paper object and then trace the information in the objects xref'ed in the Refers_to tag - this works for everything but Disease

2) Look at each object in each relevant class - this seems computationally very intensive


Relevant tags in the different object models:


GO:

    ?GO_annotation -> Gene
                   -> Reference


PPI:

    ?Interaction -> Interaction_type Physical
                 -> Interactor_overlapping_gene
                 -> Paper


Phenotype:

    ?RNAi -> Inhibits -> Gene
          -> Phenotype (Only Phenotype Observed, doesn't matter what the Phenotype is)
          -> Reference
    ?Variation -> Affects -> Gene
               -> Phenotype (Only Phenotype Observed, doesn't matter what the Phenotype is)
               -> Reference


Expression:

    ?Expr_pattern -> Expression_of -> Gene
                  -> Reference


Sequence:

    ?Variation -> Affects -> Gene
               -> Type_of_mutation -> Paper_evidence
            

Disease:

    ?Gene -> Disease_info -> Experimental_model -> Evidence -> Paper_evidence
          -> Disease_info -> Disease_relevance -> Evidence -> Paper_evidence

Disease:

    Alternatively, use OA tables for Experimental Model For and Paper for Exp Mod
                                     Disease relevance and Paper for Disease Rel
    Will need to get OA table names