Automated descriptions for C. briggsae

From WormBaseWiki
Revision as of 19:05, 4 December 2014 by Rkishore (talk | contribs)
Jump to navigationJump to search

Location of project-related files on Textpresso

http://textpresso-dev.caltech.edu/concise_descriptions/

Location of the concise description files for C. elegans:

  • For viewing the latest dump:

http://tazendra.caltech.edu/~postgres/cgi-bin/data/concise_dump_new.ace

  • Script: /home/postgres/work/citace_upload/concise/dump_concise.pl
  • File location: /home/postgres/public_html/cgi-bin/data/concise_dump_new.ace

Semantic categories in a Concise Description for C. briggsae

1. Orthology/Similarity to C. elegans and human
2. Processes
3. Molecular Function
4. Sub-cellular localization (cell component) 5. C.elegans description (experimental data only)

Source files for homology data

Use WS246 files:

1. Best BlastP hits file: ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS245/species/c_briggsae/PRJNA1073/c_briggsae.PRJNA10731.WS246.best_blastp_hits.txt.gz

2. Mapping of briggsae genes to proteins: ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS245/species/c_briggsae/PRJNA10731/c_briggsae.PRJNA10731.WS246.xrefs.txt.gz

3. Orthologs file: ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS246/species/c_briggsae/PRJNA10731/annotation/c_briggsae.PRJNA10731.WS246.orthologs.txt.gz

4. Biomart Ensembl file for human protein names: File:Biomart query.pdf

  • Contact: Contact for orthology files-Michael Paulini

Source file for Process, Molecular function and Sub-cellular localization (cell component)data

  • Process:
    • Need data from these rows:
      • where column 9: has value 'P' (Process),
      • column 2 (DB_Object ID): i.e WBGene00000307
      • column 3 (DB_Object symbol), i.e, Cbr-bli-4
      • column 5: GOID, eg, GO:0006508
      • column 6: DB:Reference (Reference), eg.PMID:12062106, take all references that are pipe-separated
      • column 7: Evidence code, i.e, IEA
      • column 8: With, eg. INTERPRO:IPR000209
  • Molecular Function:
    • Need data from these rows:
      • where column 9 has value 'F' (Molecular Function)
      • column 2: (DB_Object ID), eg., WBGene00000307
      • column 3: DB_Object symbol, eg., Cbr-bli-4
      • column 5: GOID, eg, GO:0004252
      • column 6: DB:Reference (Reference), eg.PMID:12520011, take all references that are pipe-separated
      • column 7: Evidence code, eg, IEA
      • column 8: 'With (or) From' eg., INTERPRO:IPR000209
  • Sub-cellular localization (cell component)
    • Need data from these rows:
      • where column 9 has value 'C' (Cellular Component)
      • column 2: (DB_Object ID), eg., WBGene00000324
      • column 3: DB_Object symbol, eg, Cbr-exp-2
      • column 5: GOID, eg, GO:0008076
      • column 6: DB:Reference (Reference), eg.PMID:12520011, take all references that are pipe-separated
      • column 7: Evidence code, eg, IEA
      • column 8: 'With (or) From eg., INTERPRO:IPR000209

Template and Rules for a C. briggsae gene description

Order of sentences:

  • Orthology
  • Process
  • Function/identity
  • Component

Rules for sentence construction: Orthology

Template of orthology sentence

<briggsae gene> encodes an ortholog of C. elegans <elegans gene> and human <human gene>;

Input data files

1. Best BlastP hits file: ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS245/species/c_briggsae/PRJNA1073/c_briggsae.PRJNA10731.WS246.best_blastp_hits.txt.gz

2. Mapping of briggsae genes to proteins: ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS245/species/c_briggsae/PRJNA10731/c_briggsae.PRJNA10731.WS246.xrefs.txt.gz

3. Orthologs file: ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS246/species/c_briggsae/PRJNA10731/annotation/c_briggsae.PRJNA10731.WS246.orthologs.txt.gz

Finding the elegans ortholog for a C. briggsae gene:

*Example 1: Cbr-aak-1 (CBG0691 WBGene00029106)

    • Using the xrefs file for C. briggsae, for a given C. briggsae gene, look up protein, which is CBP07510
    • Using the best_blastp_hits, CBP07510 best matches the elegans protein WP:CE39167,
    • Using the xrefs file for C. elegans, CE39167 xrefs to WBGene00019801 aak-1
    • Using the best_blastp_hits file CBP37911 best matches the human protein:ENSEMBL:ENSP00000380317
    • Using the biomart.query, file translate the human protein ID, ENSP00000380317 to 'protein kinase, AMP-activated, alpha 1 catalytic subunit' using 3rd comma-separated value-'Description'
    • Using the biomart.query file fetch the HGNC symbol:PRKAA1 (5th comma-separated value)

Orthology sentence for Cbr-aal-1: Cbr-aak-1 encodes an ortholog of C. elegans aak-1 and human protein kinase, AMP-activated, alpha 1 catalytic subunit (HGNC:PRKAA);

  • Example 2: CBG00001 (WBGene00023521, protein: CBP37911)
    • Using the xrefs file for C. briggsae, for a given C. briggsae gene, look up protein, which is CBP37911
    • Using the best_blastp_hits, CBP37911 best matches the elegans protein WP:CE31585
    • Using the xrefs file for C. elegans, CE31585 xrefs to T04A6.1 WBGene00020200
    • Using the best_blastp_hits file CBP37911 does not best match any human protein:ENSEMBL.
    • Using orthologs.txt, no 'Homo sapien' line for ENSEMBL protein line
    • Using elegans c_elegans.PRJNA13758.WS246.orthologs.txt, check for Homo sapien protein, so no human orthlogy.

Orthology sentence for CBG00001: CBG00001 encodes an ortholog of C. elegans T04A6.1;

Human protein description rules (developed for elegans)

  • Rule 1: If the words 'family member' occurs in the description between words before it and after it, then ignore 'family member'
    • Examples:
    • 'aldehyde dehydrogenase 8 family member a1’ becomes 'aldehyde dehydrogenase 8a1'
    • 'aldehyde dehydrogenase 9 family member a1' becomes 'aldehyde dehydrogenase 9a1'

Jul 2, 2014:

  • Rule 2: If the words 'human Uncharacterized protein' occur ignore this homology
    • Examples:
    • ctg-1 encodes an ortholog of human Uncharacterized protein;
    • mtp-18 encodes an ortholog of human Uncharacterized protein;
  • Rule 3 : If 2 or more of these words occur: 'family', 'subfamily', group', 'member' 'polypeptide' or 'class', ignore them and resolve as in examples:
    • olfactory receptor, family 56, subfamily B, member 1 becomes olfactory receptor 56B1
    • potassium intermediate/small conductance calcium-activated channel, subfamily N, member 2 becomes potassium intermediate/small conductance calcium-activated channel N2
    • potassium inwardly-rectifying channel, subfamily J, member 12 becomes human potassium inwardly-rectifying channel J12
    • nuclear receptor subfamily 3, group C, member 2 becomes nuclear receptor 3C2
    • nuclear receptor subfamily 5, group A, member 2 becomes nuclear receptor 5A2
    • nuclear receptor subfamily 1, group H, member 4 becomes nuclear receptor 1H4
    • cytochrome P450, family 3, subfamily A, polypeptide 5 becomes cytochrome P450 3A5
    • cytochrome P450, family 21, subfamily A, polypeptide 2 becomes cytochrome P450 21A2
    • UDP glycosyltransferase 3 family, polypeptide A1 becomes UDP glycosyltransferase 3A1
    • mannosidase, alpha, class 1B, member 1 becomes human mannosidase, alpha, 1B1
    • phosphatidylinositol glycan anchor biosynthesis, class V stays as is, because the word 'class' occurs by itself.
    • scavenger receptor class B, member 2 becomes scavenger receptor B2
  • Rule 4: If the word 'homolog' co-occurs with a species name = 'Drosophila', 'S. cerevisiae', 'yeast', inside brackets , ignore 'homolog' and move the species without the brackets.
    • salvador homolog 1 (Drosophila) becomes Drosophila and human salvador 1
    • SSU72 RNA polymerase II CTD phosphatase homolog (S. cerevisiae) becomes S. cerevisia and human SSU72 RNA polymerase II CTD phosphatase
    • translocase of outer mitochondrial membrane 22 homolog (yeast) becomes yeast and human translocase of outer mitochondrial membrane 22
    • human vacuolar protein sorting 4 homolog B (S. cerevisiae) becomes S. cerevisiae and human vacuolar protein sorting 4
    • unconventional SNARE in the ER 1 homolog (S. cerevisiae) becomes S. cerevisiae and human unconventional SNARE in the ER 1
  • Rule 5: if the description field (meaning human protein name), has '(C.elegans)' in it (these refer to an elegans gene, making it circular), then ignore the description field and use the HGNC symbol instead (accession number lookup to symbol required).
    • Examples:
      • Data: WBGene00017948,mth-1,ENSEMBL:ENSP00000407190,ENSG00000166979,ENST00000435323,eva-1 homolog C (C. elegans) [Source:HGNC Symbol;Acc:13239],EVA1C,EVA1C,EVA1C-005,EVA1C,EVA 1 HOMOLOG C PRECURSOR FAM176C
      • Sentence: mth-1 encodes an ortholog of human EVA1C (HGNC:EVA1C);.
      • Data:WBGene00004895,smu-1,ENSEMBL:ENSP00000380336,ENSG00000122692,ENST00000397149,smu-1 suppressor of mec-8 and unc-52 homolog (C. elegans) [Source:HGNC Symbol;Acc:18247],SMU1,SMU1,SMU1-001,SMU1,WD40 REPEAT CONTAINING SMU1 SMU 1 SUPPRESSOR OF MEC 8 AND UNC 52 HOMOLOG
      • Sentence: smu-1 encodes an ortholog of human SMU1 (HGNC:SMU1);.
      • Data: WBGene00044079,tag-241,ENSEMBL:ENSP00000287482,ENSG00000156876,ENST00000287482,spindle assembly 6 homolog (C. elegans) [Source:HGNC Symbol;Acc:25403],SASS6,SASS6,SASS6-001,SASS6,SPINDLE ASSEMBLY ABNORMAL 6 HOMOLOG
      • Sentence: tag-241 encodes an ortholog of human SASS6 (HGNC:SASS6);.
  • Rule 6: Ignore the text string '<numeral>kDa at the end of terms and the trailing comma, including within parentheses.
    • cleavage and polyadenylation specific factor 4, 30kDa becomes cleavage and polyadenylation specific factor 4
    • cleavage stimulation factor, 3' pre-RNA, subunit 1, 50kDa becomes cleavage stimulation factor, 3' pre-RNA, subunit 1
    • nucleoporin 153kDa becomes nucleoporin
  • Rule 7 : Grab the HGNC symbol and the place it on the end of the sentence, within parentheses as ', (HGNC symbol:<HGNC symbol>)'

Rules for sentence construction: Process, Function and Component

For C. briggsae, currently all process, component and function GO terms are IEA-evidence code, based on INTERPRO domains.

Template for sentence for process, function and sub-cellular expression (cell component):

  • Only experimental data from the C. elegans orthologous gene will be involved in the briggsae description.

Based on protein domain information, <c. briggsae gene> is involved in <process term1>, <process term2> and <process term3>, is predicted to have <molecular function term1> and <molecular function term2>, and is predicted to be expressed in <component 1>; the C. elegans gene <orthologous elegans gene> is involved in <process1>, <process2> and <process>, exhibits <activity1> and <activity2> and is expressed in <cell component>.

Input data file

1. C. briggsae gene association file: ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS246/ONTOLOGY/gene_association.WS246.wb.c_briggsae

*Example 1: Cbr-aak-1 (CBG0691 WBGene00029106) Data from above file:

  • WB WBGene00029106 Cbr-aak-1 GO:0004672 GO_REF:0000002 IEA InterPro:IPR000719|InterPro:IPR002290 F
  • WB WBGene00029106 Cbr-aak-1 GO:0005524 GO_REF:0000002 IEA InterPro:IPR000719|InterPro:IPR002290 F
  • WB WBGene00029106 Cbr-aak-1 GO:0006468 GO_REF:0000002 IEA InterPro:IPR000719|InterPro:IPR002290 P
  • WB WBGene00029106 Cbr-aak-1 GO:0016772 GO_REF:0000002 IEA InterPro:IPR011009 F
  • GO:0004672 = protein kinase activity
  • GO:0005524 = ATP binding
  • GO:0016772 = transferase activity, transferring phosphorus-containing groups (gets dropped because it is a parent).
  • GO:006468 = protein phosphorylation
  • Sentence for Process, Function and Component

Based on protein domain information CBr-aak-1 is involved in protein phosphorylation and is predicted to have protein kinase and ATP binding activities; in C. elegans, aak-1 <build the sentence for elegans aak-1 using only experimental (non-IEA and non-ISS evidence codes) process, component and function GO terms>.

  • So full description for Cbr-aak-1 reads as:

Cbr-aak-1 encodes an ortholog of C. elegans aak-1 and human protein kinase, AMP-activated, alpha 1 catalytic subunit (HGNC:PRKAA); based on protein domain information CBr-aak-1 is involved in protein phosphorylation and is predicted to have protein kinase and ATP binding activities; in C. elegans, aak-1 <build the sentence for elegans aak-1 using only experimental (non-IEA and non-ISS evidence codes) process, component and function GO terms>.


All rules developed for elegans process, function and sub-cellular localization (cell component) apply to briggsae as well.

Tab-delimited file for OA insert

Note change: A new column 'Species' needs to be added to the tab-delimited file to denote species

  • Order of the data will be: WBGene, Date, Paper, Accession_evidence, Automated_concise_description, Inferred_automatically text, Species
  • Format: tab-delimited file, comma separate the values when multiple values are present
  • The 'Species' column values need to be exactly as in any OA: Caenorhabditis elegans, Caenorhabditis briggsae, Pristionchus pacificus etc (to see the full postgres species list, go to http://tazendra.caltech.edu/~postgres/cgi-bin/oa/ontology_annotator.cgi, go to the disease OA and click on the 'Species' field drop-down).
  • Date is the last date that the script was run to generate the automated descriptions (eg. 2014-12-01)
  • File will be placed on textpresso-dev to be picked up by a cron job by JC

Changes to the Concise Descriptions OA in order to store descriptions for multiple species

  • A new 'Species' field drop-down will be added to the OA after the WBGene field (between 'WBGene' and 'Curator' fields)
  • When 'new line' is hit the 'Species' field will not auto-fill, curator needs to choose the species

Dumping rules for the different types of descriptions in the OA

  • If a Concise_description exists for a gene, do not dump the Automated_description (Provisional_description always gets dumped)