Difference between revisions of "Automated descriptions for C. briggsae"

From WormBaseWiki
Jump to navigationJump to search
Line 152: Line 152:
 
For C. briggsae, currently all process, component and function GO terms are IEA-evidence code, based on INTERPRO domains.
 
For C. briggsae, currently all process, component and function GO terms are IEA-evidence code, based on INTERPRO domains.
  
====Template for sentence for Process, Function and Cell component terms:====
+
====Template for sentence for process, function and sub-cellular expression (cell component):====
 
*Only experimental data from the C. elegans orthologous gene will be involved in the briggsae description.
 
*Only experimental data from the C. elegans orthologous gene will be involved in the briggsae description.
  

Revision as of 23:34, 4 November 2014

Location of project-related files on Textpresso

http://textpresso-dev.caltech.edu/concise_descriptions/

Location of the concise description files for C. elegans:

  • For viewing the latest dump:

http://tazendra.caltech.edu/~postgres/cgi-bin/data/concise_dump_new.ace

  • Script: /home/postgres/work/citace_upload/concise/dump_concise.pl
  • File location: /home/postgres/public_html/cgi-bin/data/concise_dump_new.ace

Semantic categories in a Concise Description for C. briggsae

1. Orthology/Similarity to C. elegans and human
2. Processes
3. Molecular Function
4. Sub-cellular localization (Cell component)

Source files for homology data

Use WS246 files:

1. Best BlastP hits file: ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS245/species/c_briggsae/PRJNA1073/c_briggsae.PRJNA10731.WS246.best_blastp_hits.txt.gz

2. Mapping of briggsae genes to proteins: ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS245/species/c_briggsae/PRJNA10731/c_briggsae.PRJNA10731.WS246.xrefs.txt.gz

3. Orthologs file: ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS246/species/c_briggsae/PRJNA10731/annotation/c_briggsae.PRJNA10731.WS246.orthologs.txt.gz

4. Biomart Ensembl file for human protein names: File:Biomart query.pdf

  • Contact: Contact for orthology files-Michael Paulini

Source file for Process, Molecular function and Sub-cellular localization (cell component)data

  • Process:
    • Need data from these rows:
      • where column 9: has value 'P' (Process),
      • column 2 (DB_Object ID): i.e WBGene00000307
      • column 3 (DB_Object symbol), i.e, Cbr-bli-4
      • column 5: GOID, eg, GO:0006508
      • column 6: DB:Reference (Reference), eg.PMID:12062106, take all references that are pipe-separated
      • column 7: Evidence code, i.e, IEA
      • column 8: With, eg. INTERPRO:IPR000209
  • Molecular Function:
    • Need data from these rows:
      • where column 9 has value 'F' (Molecular Function)
      • column 2: (DB_Object ID), eg., WBGene00000307
      • column 3: DB_Object symbol, eg., Cbr-bli-4
      • column 5: GOID, eg, GO:0004252
      • column 6: DB:Reference (Reference), eg.PMID:12520011, take all references that are pipe-separated
      • column 7: Evidence code, eg, IEA
      • column 8: 'With (or) From' eg., INTERPRO:IPR000209
  • Sub-cellular localization (cell component)
    • Need data from these rows:
      • where column 9 has value 'C' (Cellular Component)
      • column 2: (DB_Object ID), eg., WBGene00000324
      • column 3: DB_Object symbol, eg, Cbr-exp-2
      • column 5: GOID, eg, GO:0008076
      • column 6: DB:Reference (Reference), eg.PMID:12520011, take all references that are pipe-separated
      • column 7: Evidence code, eg, IEA
      • column 8: 'With (or) From eg., INTERPRO:IPR000209

Template and Rules for a C. briggsae gene description

Order of sentences:

  • Orthology
  • Process
  • Function/identity
  • Component

Rules for sentence construction: Orthology

Template of orthology sentence

<briggsae gene> encodes an ortholog of C. elegans <elegans gene> and human <human gene>;

Input data files

1. Best BlastP hits file: ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS245/species/c_briggsae/PRJNA1073/c_briggsae.PRJNA10731.WS246.best_blastp_hits.txt.gz

2. Mapping of briggsae genes to proteins: ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS245/species/c_briggsae/PRJNA10731/c_briggsae.PRJNA10731.WS246.xrefs.txt.gz

3. Orthologs file: ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS246/species/c_briggsae/PRJNA10731/annotation/c_briggsae.PRJNA10731.WS246.orthologs.txt.gz

Finding the elegans ortholog for a C. briggsae gene:

*Example 1: Cbr-aak-1 (CBG0691 WBGene00029106)

    • Using the xrefs file for C. briggsae, for a given C. briggsae gene, look up protein, which is CBP07510
    • Using the best_blastp_hits, CBP07510 best matches the elegans protein WP:CE39167,
    • Using the xrefs file for C. elegans, CE39167 xrefs to WBGene00019801 aak-1
    • Using the best_blastp_hits file CBP37911 best matches the human protein:ENSEMBL:ENSP00000380317
    • Using the biomart.query, file translate the human protein ID, ENSP00000380317 to 'protein kinase, AMP-activated, alpha 1 catalytic subunit' using 3rd comma-separated value-'Description'
    • Using the biomart.query file fetch the HGNC symbol:PRKAA1 (5th comma-separated value)

Orthology sentence for Cbr-aal-1: Cbr-aak-1 encodes an ortholog of C. elegans aak-1 and human protein kinase, AMP-activated, alpha 1 catalytic subunit (HGNC:PRKAA);

  • Example 2: CBG00001 (WBGene00023521, protein: CBP37911)
    • Using the xrefs file for C. briggsae, for a given C. briggsae gene, look up protein, which is CBP37911
    • Using the best_blastp_hits, CBP37911 best matches the elegans protein WP:CE31585
    • Using the xrefs file for C. elegans, CE31585 xrefs to T04A6.1 WBGene00020200
    • Using the best_blastp_hits file CBP37911 does not best match any human protein:ENSEMBL.
    • Using orthologs.txt, no 'Homo sapien' line for ENSEMBL protein line
    • Using elegans c_elegans.PRJNA13758.WS246.orthologs.txt, check for Homo sapien protein, so no human orthlogy.

Orthology sentence for CBG00001: CBG00001 encodes an ortholog of C. elegans T04A6.1;

Human protein description rules (developed for elegans)

  • Rule 1: If the words 'family member' occurs in the description between words before it and after it, then ignore 'family member'
    • Examples:
    • 'aldehyde dehydrogenase 8 family member a1’ becomes 'aldehyde dehydrogenase 8a1'
    • 'aldehyde dehydrogenase 9 family member a1' becomes 'aldehyde dehydrogenase 9a1'

Jul 2, 2014:

  • Rule 2: If the words 'human Uncharacterized protein' occur ignore this homology
    • Examples:
    • ctg-1 encodes an ortholog of human Uncharacterized protein;
    • mtp-18 encodes an ortholog of human Uncharacterized protein;
  • Rule 3 : If 2 or more of these words occur: 'family', 'subfamily', group', 'member' 'polypeptide' or 'class', ignore them and resolve as in examples:
    • olfactory receptor, family 56, subfamily B, member 1 becomes olfactory receptor 56B1
    • potassium intermediate/small conductance calcium-activated channel, subfamily N, member 2 becomes potassium intermediate/small conductance calcium-activated channel N2
    • potassium inwardly-rectifying channel, subfamily J, member 12 becomes human potassium inwardly-rectifying channel J12
    • nuclear receptor subfamily 3, group C, member 2 becomes nuclear receptor 3C2
    • nuclear receptor subfamily 5, group A, member 2 becomes nuclear receptor 5A2
    • nuclear receptor subfamily 1, group H, member 4 becomes nuclear receptor 1H4
    • cytochrome P450, family 3, subfamily A, polypeptide 5 becomes cytochrome P450 3A5
    • cytochrome P450, family 21, subfamily A, polypeptide 2 becomes cytochrome P450 21A2
    • UDP glycosyltransferase 3 family, polypeptide A1 becomes UDP glycosyltransferase 3A1
    • mannosidase, alpha, class 1B, member 1 becomes human mannosidase, alpha, 1B1
    • phosphatidylinositol glycan anchor biosynthesis, class V stays as is, because the word 'class' occurs by itself.
    • scavenger receptor class B, member 2 becomes scavenger receptor B2
  • Rule 4: If the word 'homolog' co-occurs with a species name = 'Drosophila', 'S. cerevisiae', 'yeast', inside brackets , ignore 'homolog' and move the species without the brackets.
    • salvador homolog 1 (Drosophila) becomes Drosophila and human salvador 1
    • SSU72 RNA polymerase II CTD phosphatase homolog (S. cerevisiae) becomes S. cerevisia and human SSU72 RNA polymerase II CTD phosphatase
    • translocase of outer mitochondrial membrane 22 homolog (yeast) becomes yeast and human translocase of outer mitochondrial membrane 22
    • human vacuolar protein sorting 4 homolog B (S. cerevisiae) becomes S. cerevisiae and human vacuolar protein sorting 4
    • unconventional SNARE in the ER 1 homolog (S. cerevisiae) becomes S. cerevisiae and human unconventional SNARE in the ER 1
  • Rule 5: if the description field (meaning human protein name), has '(C.elegans)' in it (these refer to an elegans gene, making it circular), then ignore the description field and use the HGNC symbol instead (accession number lookup to symbol required).
    • Examples:
      • Data: WBGene00017948,mth-1,ENSEMBL:ENSP00000407190,ENSG00000166979,ENST00000435323,eva-1 homolog C (C. elegans) [Source:HGNC Symbol;Acc:13239],EVA1C,EVA1C,EVA1C-005,EVA1C,EVA 1 HOMOLOG C PRECURSOR FAM176C
      • Sentence: mth-1 encodes an ortholog of human EVA1C (HGNC:EVA1C);.
      • Data:WBGene00004895,smu-1,ENSEMBL:ENSP00000380336,ENSG00000122692,ENST00000397149,smu-1 suppressor of mec-8 and unc-52 homolog (C. elegans) [Source:HGNC Symbol;Acc:18247],SMU1,SMU1,SMU1-001,SMU1,WD40 REPEAT CONTAINING SMU1 SMU 1 SUPPRESSOR OF MEC 8 AND UNC 52 HOMOLOG
      • Sentence: smu-1 encodes an ortholog of human SMU1 (HGNC:SMU1);.
      • Data: WBGene00044079,tag-241,ENSEMBL:ENSP00000287482,ENSG00000156876,ENST00000287482,spindle assembly 6 homolog (C. elegans) [Source:HGNC Symbol;Acc:25403],SASS6,SASS6,SASS6-001,SASS6,SPINDLE ASSEMBLY ABNORMAL 6 HOMOLOG
      • Sentence: tag-241 encodes an ortholog of human SASS6 (HGNC:SASS6);.
  • Rule 6: Ignore the text string '<numeral>kDa at the end of terms and the trailing comma, including within parentheses.
    • cleavage and polyadenylation specific factor 4, 30kDa becomes cleavage and polyadenylation specific factor 4
    • cleavage stimulation factor, 3' pre-RNA, subunit 1, 50kDa becomes cleavage stimulation factor, 3' pre-RNA, subunit 1
    • nucleoporin 153kDa becomes nucleoporin
  • Rule 7 : Grab the HGNC symbol and the place it on the end of the sentence, within parentheses as ', (HGNC symbol:<HGNC symbol>)'

Rules for sentence construction: Process, Function and Component

For C. briggsae, currently all process, component and function GO terms are IEA-evidence code, based on INTERPRO domains.

Template for sentence for process, function and sub-cellular expression (cell component):

  • Only experimental data from the C. elegans orthologous gene will be involved in the briggsae description.

Based on protein domain information, <c. briggsae gene> is involved in <process term1>, <process term2> and <process term3>, is predicted to have <molecular function term1> and <molecular function term2>, and is predicted to be expressed in <component 1>; the C. elegans gene <orthologous elegans gene> is involved in <process1>, <process2> and <process>, exhibits <activity1> and <activity2> and is expressed in <cell component>.

Input data file

C. briggsae gene association file: ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS246/ONTOLOGY/gene_association.WS246.wb.c_briggsae

All rules developed for elegans process, function and sub-cellular localization (cell component) apply to briggsae as well.