Difference between revisions of "Concise Descriptions"
|Line 142:||Line 142:|
*Check output of Date_last_updated in Evidence - WormBase model uses DateType, but .ace output includes time of day. -kimberly
*Check output of Date_last_updated in Evidence - WormBase model uses DateType, but .ace output includes time of day. -kimberly
Revision as of 18:16, 16 August 2011
- 1 Concise Description Curation
- 1.1 Concise Descriptions for Genes Involved in Core Biological Processes
- 1.2 Genes with Recent Publications and No Concise Description
- 1.3 Keeping Gene Names Up-to-Date
- 1.4 Building an Ontology Annotator interface for concise description curation
- 1.5 OA Testing
Concise Description Curation
Concise Descriptions for Genes Involved in Core Biological Processes
We can also annotate genes that are related by virtue of being involved in core biological processes, e.g. transcriptional regulation, cell cycle, etc. There are some annotations for individual genes involved in these processes, but the processes themselves were never annotated in a systematic fashion, so it will be good to fill in the remaining gaps.
Also, although these genes are involved in core processes, some of these genes have not been extensively studied in C. elegans, so there aren't many associated papers to read. However, by virtue of sequence conservation their function can be annotated in elegans, and subsequent annotations will be useful for propagation to other nematode genomes.
WormBook chapters will be a good source of gene lists for these annotations; links to select chapters that contain lists of relevant genes are listed below:
Translation:Translation Mechanisms: 
Transcription: Transcription Mechanisms: 
Transcription Factors: Transcriptional Regulation  - Kimberly
Kinases: Protein Kinases 
Genes with Recent Publications and No Concise Description
The list below contains genes for which a paper has recently been published but for which we do not yet have a concise description. The paper in parentheses after the gene name is the recent paper that prompted adding the gene to the list, but may not be the only relevant publication, so we may need to check other papers for writing a complete description.
If you'd like to work on one of these genes, feel free to check it out on the concise description check-out form and remove it from the list here.
Please also feel free to add genes to the list if you come across them in the context of other curation, but can't write a concise description right away. Someone else may be able to pick up that gene and write the description in the mean time.
- pix-1 (see WBPaper00038193)
- git-1 (see WBPaper00038193)
- sepa-1 (see WBPaper00033110)
- maco-1 (see WBPaper00038258)
- tfg-1 (see WBPaper00038310)
- sec-13 and other sec- secretion pathway genes (see WBPaper00038310)
- gcy-28 (see WBPaper00038243)
- pptr-1 (see WBPaper00032946)
- alr-1 (see WBPaper00038207)
- crtc-1 (see WBPaper00038172)
- snf-12 (see WBPaper00038424)
- lin-54 (see WBPaper00028753)
- lin-53 (see WBPaper00028573)
- cnt-2 (see WBPaper00038432)
- orai-1 (see WBPaper00028994)
- arp2/3 complex (arx 1-7), (WBPaper00005843, WBPaper00039779, WBPaper00035433, WBPaper00005843, WBPaper00035279)
- phy-1 (dpy-18)
- 1. Gene : emb-1 (WBGene00001255)
- 2. Gene : emb-1 (K10D2.4; apc-16; WBGene00019630)
- (crl2) LPR-1
not yet completed
Keeping Gene Names Up-to-Date
Concise descriptions typically begin with the name of the gene, either its CGC name, e.g. unc-7, or its sequence name, e.g. Y24F12A.2. Sometimes the CGC name changes, but more frequently, the gene with a sequence name acquires a CGC name due to more intensive study or characterization.
- To keep the names up-to-date in the concise descriptions, there are two emails to check:
- Check each email from the firstname.lastname@example.org and pay particular attention to the emails with subject heading NAMEDB: CGC added to WBGenennnnnnnn. These are typically the cases where a CGC name is added to a gene that previously was known only by its sequence name. If we've written a concise description for this gene using the sequence name, then I update the description to now use the CGC name using the concise_description_new_cgi.
- Mary Ann Tuli and Jonathan Hodgkin send an email, on which they cc Cecilia Nakamura and me, confirming all of the updated persons, labs, and gene names as recorded by the CGC. In the body of the email is a list of new gene names and assignments that I also check. Note that there is some redundancy here: any new gene name mentioned in the updates email should also be added to the name server and thus come through as a NAMEDB email. In my experience, though, a little redundancy is not always a bad thing and helps to keep things from falling through the cracks. As above, I make any necessary gene name changes using the concise_description_new_cgi.
Building an Ontology Annotator interface for concise description curation
- Curation starts by querying for a gene, wbgene is an autocomplete (not a dropdown), the term info is populated from the gene info from the gin_ tables, not an OBO file.
- Querying for a gene brings up data from postgres or tells you there are no matches. To make a new annotation, click the New button to generate curator, date, pgid, and curator history.
- Order and description of fields for the Concise Description OA (name of postgres table in italics):
- Field 1: WBGene con_wbgene: This is an ontology with finite values, has term information
- Field 2: Curator con_curator: This is a drop-down, can have only a single value for parsing old data, this may need to be able to have multiple values, see testing below, has no term information. Since these have linited values, they are hard-coded into postgres. A person must be in the curator list in order for their name to show here. Cecilia created Unknown Curator WBPerson13481 a WBPerson for unknown curator, so we can still populate this field when we don't know who the original curator was.
- Field 3: Curator History con_curhistory: This is an ontology, does not have its own obo tables, uses the history table for the curator field (con_curator_hst)in postgres, has term information, stores the PGID of the annotation. Allows editing, but should not be edited, because then you would change the value and if someone else were to click on term information they would see a different value.
- Field 4: Description_type con_desctype : This is a dropdown, has limited values, so hard-coded in postgres, has no term information, values are:
- Field 5: Description Text con_desctext : This field is a big text box, it expands when selected. How is the size (specifically horizontal width) of this box determined? It will contain the most information in the editor, but can't be expanded horizontally? This is hardcoded generically for all configurations, I'll have to change this later on -- J
Fixed - 07/31/2011.
- Field 6: Paper con_paper: This is an multi-ontology, has limited values, works off the paper tables in postgres, has term information.
- Field 7: Accession Evidence con_accession: This is a text field, values are separated by commas.
- Field 8: Comment con_comment: bigtext -- This is a place to store internal comments.
- Field 9: Last Updated con_lastupdate: This is a text field, when 'New' is clicked, it is auto-populated, date is truncated at seconds, can be manually edited.
- Field 10: PGID no table: postgres Id of the annotation row, cannot be edited.
- Field 11: No dump con_nodump: This is a toggle field, when clicked on, it turns bright red, indicating that this annotation row will not be dumped in the .ace file for upload.
- Field 12: WBPerson_evidence con_person : This is a multiontology, limited values, comes from Person tables in postgres, has term information (Person Name).
- Field 13: Expr_pattern_evidence con_exprtext : text field populated using the format Exprnnnn where n is a number; these values correspond to the WormBase ID of the Expr_pattern object; individual entries are comma-separated.
- Field 14: RNAi_evidence con_rnai : text field populated using the format WBRNAinnnnnnnn where n is a number; these values correspond to the WormBase ID of the RNAi experiment; individual entries are comma-separated
- Field 15: Gene_regulation_evidence con_genereg - text field populated using the name of the gene regulation object in WormBase; individual entries are comma-separated
- Field 16: Microarray_results_evidence con_microarray - text field populated using the name of the microarray results object in WormBase; individual entries are comma-separated
Mapping Other Evidences to Fields in the OA
- Accession Evidence = any entries WP:CEnnnnn or CEnnnnn where n is a number. Also: UniProt:P11586
- WBPerson_evidence = any evidence preceded by WBPerson
- Expr_pattern_evidence = any evidence in the form Exprnnnn where n is a number
- RNAi_evidence = any evidence in the form WBRNAinnnnnnnn where n is a number
- Gene_regulation_evidence = cgc6432_F47G4.3
- Microarray_results_evidence = SMD_K07E3.3
Check Data Constraints
Each pgid must have:
- Last Updated
Dumper currently in sandbox at : /home/postgres/work/pgpopulation/concise_description/20110722_newOA/dump_concise.pl
outputfile is called 'ace'
Please check it out / test in acedb, and then we need to document it
Testing .ace dumper
- Tags are being dumped alphabetically instead of Concise, then Provisional, then Biological_process, etc. If we want a specific order (like we currently have) we will need to specify this in the dumping script? -kimberly
- Check output of Date_last_updated in Evidence - WormBase model uses DateType, but .ace output includes time of day. -kimberly
- August 9th, Changes to the dumper:
We will use the Database and Accession_number tags instead of Accession_evidence for Human_disease_relevance descriptions, so we will need to make a change to the .ace dumper based on my (Ranjana) e-mail exchange with Paul Davis, on 08/09/2011: Question to Paul: So for the .ace file could I do this (taking cup-5 as an example):
Gene : "WBGene00000846" Human_disease_relevance "The cup-5 gene encodes an ortholog of the human mucolipin 1 gene which is mutated in Mucolipidosis type IV; human mucolipin 1 functions as a pH-modulated non-selective cation channel." Database "OMIM" Accession_number "605248"
So my questions for the .ace file are:
--OMIM database information already exists in WormBase, so the acronym "OMIM" is fine? --If so, then I don't have to include the URL tag and its value "http://www.ncbi.nlm.nih.gov/omim" for every entry? --Do the database mappings between OMIM accession numbers and OMIM web pages already exist in WormBase? If so then I don't have to include the URL_constructor tag, for each accession, right? --Otherwise for every accession I would have to include the URL_constructor? So for this accession number it would be "http://omim.org/entry/605248"
Paul's answer: Yes you are right the OMIM database object is populated in the build....not very well(contains the old ncbi data), but we can address this issue. As far as the postgress to .ace dump, what you have in the example is fine, you don't need anything else.
The url constructor is contained in the ?Database object so we don't need any additional .ace from your end.
- Curator history showing all curators?
When there are multiple curators attached to an annotation, are they all still associated with it? Check WBGene00000035 for an example. In the old form, both Carol and Tuco are associated (same time stamp), but in the new form I only see Tuco. Alternatively, should the Curator field allow for more than one value in the concise OA? I see. Well, we could try to put both curators, but it would be random which one would be the "last" curator, which would be the only one that would show in the curator field, while both would show in the (current, hopefully will be replaced) curator history 'ontology' field, and the .ace file. We could make it a multivalue curator, I'm still not sure of all the ramifications of that, but we could. One partial problem with that is we would then have all the curators in two fields, which would be temping to get rid of the other field, but the other field is necessary to keep track of all the timestamps. I'm not really sure which way would be best as far as this field.
I think the main issue for me is being able to query the form using a curator name and be confident that I'm really getting all of the entries attached to that curator. If the curator field was a multivalue field, then when there is more than one curator with the same time stamp attached, we could query that field with a curator name, and know we were getting all of their associated entries. --Kimberly
- Is Curator history queryable by name?
No, all it holds is a pgid. If you want this queryable this should be a text field that can be autopopulated when a new field gets created, but has to be manually edited to say whatever you want on future edits. This would really be better in making the OA more intuitive, you can see the data in the dataTable, and it would be queryable. Unless you edit a lot of entries, it's probably the best solution, just think of it as another Last_Updated field.
I think the main issue here is that we don't want to accidentally change or somehow muck up the curator_history table. If curator could be multivalue and I could query the form for curator using that field, then I'm fine with leaving the curator_history table display as it is in the current version of the form, i.e. a pgid. --Kimberly
- For descriptions attached to invalid/dead WBGene IDs, can we still display the WBGene ID in the form?
If the genes exist in postgres (/ the nameserver) the WBGeneIDs will show, right now the wbgenes don't show only if there's no entry in gin_wbgene in postgres.
Still waiting on an answer from Sanger. 08/01/2011 --Kimberly
- Is there a way to indicate in the form that these annotations are attached to now invalid IDs?
If you mean that the WBGenes are invalid, then no, you'd have to look at each wbgene and look at the term info. The dumper could have some sort of check, but it pretty much already had that before, and I thought those errors were getting ignored anyway.
So we couldn't make a row of annotation attached to an invalid gene, pink or something :-)? This is fine - the dumper was set up to comment out any annotations attached to invalid gene IDs and the list of errors could be used to see which genes might be in need of an update. --Kimberly
Back to Caltech documentation