Difference between revisions of "Concise Descriptions"

From WormBaseWiki
Jump to navigationJump to search
Line 774: Line 774:
 
     Bm33 encodes a pepsin inhibitor; Bm33 elicits an antibody response in hosts infected with Brugia malayi and the presence of host
 
     Bm33 encodes a pepsin inhibitor; Bm33 elicits an antibody response in hosts infected with Brugia malayi and the presence of host
 
     antibodies may be used for detection of infection.
 
     antibodies may be used for detection of infection.
 +
 +
    Bm14 encodes a member of the hsp90 family of heat shock proteins; Bm14 elicits an antibody response in hosts infected with Brugia
 +
    malayi and the presence of host antibodies may be used for detection of infection.
  
 
===Scripted Concise Descriptions===
 
===Scripted Concise Descriptions===

Revision as of 21:36, 18 January 2013

Concise Description Curation

citace upload scripts

Papers

Note that the papers.ace file is large and it takes ~30 minutes or more to dump the file.

papers cronjob is on the acedb account : 0 2 * * thu /home/postgres/work/citace_upload/papers/wrapper.pl

It creates a file at :

 /home/postgres/work/citace_upload/papers/out/papers.ace.<date>

and symlinks it to :

 /home/postgres/public_html/cgi-bin/data/papers.ace

so you can see it on the web at :

 http://tazendra.caltech.edu/~postgres/cgi-bin/data/papers.ace

While the cronjob runs every thursday, the wrapper only dumps on days that are 20something or 30something.

If you ever need to run it on a different week, try :

 /home/postgres/work/citace_upload/papers/dumpPapAce.pl > /home/postgres/work/citace_upload/papers/out/papers.ace.<date>
 rm /home/postgres/public_html/cgi-bin/data/papers.ace
 ln -s /home/postgres/work/citace_upload/papers/out/papers.ace.<date> /home/postgres/public_html/cgi-bin/data/papers.ace

and then you can pick it up from spica by ssh-ing into it, cd to the directory, remove the existing file, and :

 wget "http://tazendra.caltech.edu/~postgres/cgi-bin/data/papers.ace"


Concise descriptions

concise cronjob is on the acedb account : 0 2 * * thu /home/acedb/kimberly/citace_upload/concise/wrapper.pl

It creates a file at :

 /home/postgres/public_html/cgi-bin/data/concise_dump_new.ace

which you can see on the web at :

 http://tazendra.caltech.edu/~postgres/cgi-bin/data/concise_dump_new.ace

If you need to run it manually, on tazendra run :

 /home/acedb/kimberly/citace_upload/concise/wrapper.pl

Then on spica, go to the directory, remove the file, and :

 wget "http://tazendra.caltech.edu/~postgres/cgi-bin/data/concise_dump_new.ace"

Concise Descriptions for Genes Involved in Core Biological Processes

We can also annotate genes that are related by virtue of being involved in core biological processes, e.g. transcriptional regulation, cell cycle, etc. There are some annotations for individual genes involved in these processes, but the processes themselves were never annotated in a systematic fashion, so it will be good to fill in the remaining gaps.

Also, although these genes are involved in core processes, some of these genes have not been extensively studied in C. elegans, so there aren't many associated papers to read. However, by virtue of sequence conservation their function can be annotated in elegans, and subsequent annotations will be useful for propagation to other nematode genomes.

WormBook chapters will be a good source of gene lists for these annotations; links to select chapters that contain lists of relevant genes are listed below:

Translation:Translation Mechanisms: [1]

Transcription: Transcription Mechanisms: [2]

Transcription Factors: Transcriptional Regulation [3] - Kimberly

Kinases: Protein Kinases [4]

Genes with Recent Publications and No Concise Description

The list below contains genes for which a paper has recently been published but for which we do not yet have a concise description. The paper in parentheses after the gene name is the recent paper that prompted adding the gene to the list, but may not be the only relevant publication, so we may need to check other papers for writing a complete description.

If you'd like to work on one of these genes, feel free to check it out on the concise description check-out form and remove it from the list here.

http://tazendra.caltech.edu/~postgres/cgi-bin/concise_description_checkout.cgi

Please also feel free to add genes to the list if you come across them in the context of other curation, but can't write a concise description right away. Someone else may be able to pick up that gene and write the description in the mean time.

  • snap-29:WBPaper00038394
  • aipl-1 WBPaper00038403
  • unc-78 WBPaper00038403
  • acbp (1-7) WBPaper00038378
  • adt-2 WBPaper00038069
  • bed-3 WBPaper00035599
  • bet-1 WBPaper00036007
  • ceh-51 WBPaper00034727
  • cir-1 and mog-3 WBPaper00036278
  • dex-1 and dyf-7 WBPaper00033050
  • drag-1 WBPaper00036372
  • egrh-1 WBPaper00037082
  • enu-3 WBPaper00038105
  • dmd-3 WBPaper00031953 WBPaper00038232 fln-1 WBPaper00037051
  • fmi-1 WBPaper00037648
  • let-765 WBPaper00036029
  • magi-1WBPaper00034656, WBPaper00037755, WBPaper00032939
  • moag-4 WBPaper00037072
  • mog-2 WBPaper00001883
  • pat-12 WBPaper00037850
  • pph-6 WBPaper00035559
  • pptr-1 WBPaper00032946
  • pxn-2 WBPaper00037647
  • sam-10 WBPaper00037849
  • spon-1 WBPaper00032007
  • tbx-35 WBPaper00027745, WBPaper00034727
  • vab-23 WBPaper00035263
  • vang-1 WBPaper00035600
  • isy-27 WBPaper00038390
  • pix-1 (see WBPaper00038193)
  • git-1 (see WBPaper00038193)
  • sepa-1 (see WBPaper00033110)
  • maco-1 (see WBPaper00038258)
  • tfg-1 (see WBPaper00038310)
  • sec-13 and other sec- secretion pathway genes (see WBPaper00038310)
  • gcy-28 (see WBPaper00038243)
  • pptr-1 (see WBPaper00032946)
  • alr-1 (see WBPaper00038207)
  • crtc-1 (see WBPaper00038172)
  • snf-12 (see WBPaper00038424)
  • lin-54 (see WBPaper00028753)
  • lin-53 (see WBPaper00028573)
  • cnt-2 (see WBPaper00038432)
  • orai-1 (see WBPaper00028994)
  • arp2/3 complex (arx 1-7), (WBPaper00005843, WBPaper00039779, WBPaper00035433, WBPaper00005843, WBPaper00035279)
  • phy-1 (dpy-18), phy-2, phy-3, phy-4 (WBPaper00031524, WBPaper00039988, WBPaper00004203)
  • SPR-5 (WBPaper00033101, WBPaper00005525, WBPaper00005564)
  • mpk-1 (WBPaper00031318, WBPaper00003177, WBPaper00028788, WBPaper00001869, WBPaper00001870, WBPaper00038448)
  • emb-1 (WBGene00039858)
  • sas-5 (WBPaper00024227, WBPaper00038324)
  • puf-8 (WBPaper00040262, WBPaper00028346)
  • sas-6 (WBPaper00024914, WBPaper00035594) not yet uploaded
  • let-504 (WBPaper00040589) not yet uploaded
  • lpr-1 (WBPaper00032973) not yet uploaded

COMPLETED

  • mid-1 - can't find in database
  • zyg-1 - written 2005, needs update?
  • mes-2 - written 2006, needs update?
  • utx-1 - written, 2007, needs update?
  • daf-36 - written 2007, needs update?
  • klf-3 - written 2009, needs update?
  • ver-1, ver-2, ver-3, and ver-4 -- see WBPaper00005430 to start, check Textpresso for additional relevant papers
  • xnd-1 - see WBPaper00037683
  • ric-7 - see WBPaper00040694 (also brand new)

not yet completed

Keeping Gene Names Up-to-Date

Concise descriptions typically begin with the name of the gene, either its CGC name, e.g. unc-7, or its sequence name, e.g. Y24F12A.2. Sometimes the CGC name changes, but more frequently, the gene with a sequence name acquires a CGC name due to more intensive study or characterization.

  • To keep the names up-to-date in the concise descriptions, there are two emails to check:
    • Check each email from the webserver@sanger.ac.uk and pay particular attention to the emails with subject heading NAMEDB: CGC added to WBGenennnnnnnn. These are typically the cases where a CGC name is added to a gene that previously was known only by its sequence name. If we've written a concise description for this gene using the sequence name, then I update the description to now use the CGC name using the concise_description_new_cgi.
    • Mary Ann Tuli and Jonathan Hodgkin send an email, on which they cc Cecilia Nakamura and me, confirming all of the updated persons, labs, and gene names as recorded by the CGC. In the body of the email is a list of new gene names and assignments that I also check. Note that there is some redundancy here: any new gene name mentioned in the updates email should also be added to the name server and thus come through as a NAMEDB email. In my experience, though, a little redundancy is not always a bad thing and helps to keep things from falling through the cracks. As above, I make any necessary gene name changes using the concise_description_new_cgi.

Human disease relevance tag

C. elegans has been established as the simplest animal model available for human disease. We have made orthologs of human disease genes, a priority list for curation. In the past, information about the relevance to human disease was to a large extent based on sequence comparison of the related human gene and it's elegans ortholog. This resulted in a scripted description for the elegans gene, that mentioned the human ortholog, the disease caused when the human gene is mutated, and the related OMIM identifiers. We have now begun a more detailed updating of these descriptions, and the writing of new ones, based on the published literature, where elegans is used to model the biology of a disease and/or the orthologous elegans gene function is studied in detail. We have developed a method to identify and flag these papers using a Textpresso (www.textpresso.org) keyword and category based search.

A new database tag 'Human disease gene relevance' was introduced so that we can highlight and make this data available for use by both the elegans as well as the non-elegans biomedical researcher.

This tag will be used to:

--hold OMIM data that is currently in concise descriptions

--highlight how studies in elegans has contributed to understanding of human disease


Database mappings between OMIM accession numbers and OMIM web pages already exist in WormBase. Otherwise for every accession we would have to include the URL_constructor? So for this accession number it would be "http://omim.org/entry/605248"

Paul Davis: Yes you are right the OMIM database object is populated in the build....not very well(contains the old ncbi data), but we can address this issue. As far as the postgress to .ace dump, what you have in the example is fine, you don't need anything else. The url constructor is contained in the ?Database object so we don't need any additional .ace from your end. The only real question we have to resolve is that omim.org use a ID system where they have #*^. characters prefixing their very unique string IDs. The special characters aren't any good for us and their web team don't even use them for their website for this exact reason.

As the Accession/ID is just a numerical string this causes us a problem as there are other databases out there who also adopt this nomenclature and so you get collision between the data sets (in acedb). We would like to propose adding OMIM: to the ID/accession but this messes with the url constructor.

Todd, could you handle this at the website end to convert OMIM:605248 -> http://omim.org/entry/605248 (need to confirm this around the time data is uploaded).

There are lots of other ways that OMIM and Human disease could be more fully populated in WormBase (Disease classes etc.) but for your specific curation task, I think this is what we need to do.

Building an Ontology Annotator interface for concise description curation

  • Curation starts by querying for a gene, wbgene is an autocomplete (not a dropdown), the term info is populated from the gene info from the gin_ tables, not an OBO file.
  • Querying for a gene brings up data from postgres or tells you there are no matches. To make a new annotation, click the New button to generate curator, date, pgid, and curator history.

Order and description of fields for the Concise Description OA

(name of postgres table in italics):

  • Field 1: WBGene con_wbgene: This is an ontology with finite values, has term information
  • Field 2: Curator con_curator: This is a drop-down, can have only a single value for parsing old data, this may need to be able to have multiple values, see testing below, has no term information. Since these have linited values, they are hard-coded into postgres. A person must be in the curator list in order for their name to show here. Cecilia created Unknown Curator WBPerson13481 a WBPerson for unknown curator, so we can still populate this field when we don't know who the original curator was.
  • Field 3: Curator History con_curhistory: This is an ontology, does not have its own obo tables, uses the history table for the curator field (con_curator_hst)in postgres, has term information, stores the PGID of the annotation. Allows editing, but should not be edited, because then you would change the value and if someone else were to click on term information they would see a different value.
  • Field 4: Description_type con_desctype : This is a dropdown, has limited values, so hard-coded in postgres, has no term information, values are:
    • Concise_description
    • Human_disease_relevance
    • Provisional_description
    • Sequence_features
    • Functional_pathway
    • Functional_physical_interaction
    • Biological_process
    • Molecular_function
    • Expression
    • Other_description
  • Field 5: Description Text con_desctext : It's always a textarea (expanded bigtext box) and never shrinks to input html element type.

Technical : Bigtext fields have an input field and a textarea field,toggling between each other when focusing the input field. The OA used to update values only when the input field blurred (or a toggle was clicked), now it also updates when a textarea type field has its textarea blurred. Just the way the code was organized, it was easeir to make a new type of field 'textarea' than my first attempt to make the 'bigtext' type have a 'noresize' toggle, from the way that bigtext fields did not have a blurring listener on the input. In retrospect, maybe it could have worked the other way, but it seems more complicated right now. If we ever rewrite the code again (I hope not) we can switch it to work the other way.

  • Field 6: Paper con_paper: This is an multi-ontology, has limited values, works off the paper tables in postgres, has term information.
  • Field 7: Accession Evidence con_accession: This is a text field, values are separated by commas.
  • Field 8: Comment con_comment: bigtext -- This is a place to store internal comments.
  • Field 9: Last Updated con_lastupdate: This is a text field, when 'New' is clicked, it is auto-populated, date is truncated at seconds, can be manually edited. Note: The data here reflects whatever was in the 'Last verifed' table which matches the 'Update Last_Verified Timestamp on the concise description curation form. If the curator forgot to check this on the old form then one might see an older date even when a later date exists in curator history on the OA.
  • Field 10: PGID no table: postgres Id of the annotation row, cannot be edited.
  • Field 11: No dump con_nodump: This is a toggle field, when clicked on, it turns bright red, indicating that this annotation row will not be dumped in the .ace file for upload.
  • Field 12: WBPerson_evidence con_person : This is a multiontology, limited values, comes from Person tables in postgres, has term information (Person Name).
  • Field 13: Expr_pattern_evidence con_exprtext : text field populated using the format Exprnnnn where n is a number; these values correspond to the WormBase ID of the Expr_pattern object; individual entries are comma-separated.
  • Field 14: RNAi_evidence con_rnai : text field populated using the format WBRNAinnnnnnnn where n is a number; these values correspond to the WormBase ID of the RNAi experiment; individual entries are comma-separated
  • Field 15: Gene_regulation_evidence con_genereg - text field populated using the name of the gene regulation object in WormBase; individual entries are comma-separated
  • Field 16: Microarray_results_evidence con_microarray - text field populated using the name of the microarray results object in WormBase; individual entries are comma-separated

Mapping Other Evidences to Fields in the OA

  • Accession Evidence = any entries WP:CEnnnnn or CEnnnnn where n is a number. Also: UniProt:P11586
  • WBPerson_evidence = any evidence preceded by WBPerson
  • Expr_pattern_evidence = any evidence in the form Exprnnnn where n is a number
  • RNAi_evidence = any evidence in the form WBRNAinnnnnnnn where n is a number
  • Gene_regulation_evidence = cgc6432_F47G4.3
  • Microarray_results_evidence = SMD_K07E3.3

Check Data Constraints

Each pgid must have:

  • WBGene
  • Curator
  • Description_type
  • Description_text
  • Last Updated

.ace Dumper

Dumper currently at : /home/postgres/work/citace_upload/concise/dump_concise.pl


Tags are being dumped alphabetically instead of Concise_description, then Provisional_description, then Human_disease_relevance, then Sequence_features, etc. Here is a suggested specific order for dumping all of the tags in the Structured_description group:

The new order (this would be consistent with the order in which we dump them currently, plus the addition of the Human_disease_relevance tag):done, will email code changes -- J

Concise_description

Human_disease_relevance

Provisional_description

Sequence_features

Functional_pathway

Functional_physical_interaction

Biological_process

Molecular_function

Expression

Other_description

  • Check output of Date_last_updated in Evidence - WormBase model uses DateType, but .ace output includes time of day. -kimberly oooh, I see, the data was parsed in with times, so I should reparse this just down to the date only ? -- J
    • For the CD OA, I am fine with still having the timestamp, e.g. 2011-08-31 16:26:30, in the Last Updated field of the editor. But since the Date_last_updated evidence needs to be DateType in WB/ACeDB, we should just use the format, e.g. 2011-08-31, for dumping. --kimberly done, will email about changes to code -- J
  • September, 2011-Changes to the dumper with regard to the Human_disease_relevance descriptions:

We will continue to use Accession_evidence to hold OMIM IDs, but at the suggestion of Paul D, curators will add a prefix to important database accessions that they want an external link for (taken from previous proposal solution) eg. OMIM:605248 could be placed in the #Evidence hash under the Accession_evidence Tag.

Juancarlos would then have to introduce an addition to his postgress2ace parser whereby if the above formatting is seen in the Accession_evidence field his code would also pull this data out into the Database line as discussed previously.

So the .ace file in addition to the Accession_evidence would have the Database information, eg., taking cup-5 as an example:

Gene : "WBGene00000846"

Human_disease_relevance "The cup-5 gene encodes an ortholog of the human mucolipin 1 gene which is mutated in Mucolipidosis type IV; human mucolipin 1 functions as a pH-modulated non-selective cation channel."

Database "OMIM" Accession_number "605248"

Changes required to the .ace dumper, Nov.17th, 2011, following M. Paulini's instructions:

When the Human_disease_relevance description has Accession_evidence field of the OA filled in as 'OMIM:605248' the dumper should dump this extra line:

Database "OMIM" "Accession_number" "OMIM:605248"

Also the accession_evidence line should be dumped as: Accession_evidence "OMIM" "OMIM:605248"

Example: So cup-5 should look like:

Gene : "WBGene00000846"

Database "OMIM" "Accession_number" "OMIM:605248"

Provisional_description "The cup-5 gene encodes an ortholog of the human mucolipin 1 gene; cup-5 is required for viability, endo-lysosomal transport and the normal degradation of lysosomes; cup-5 mutants also show increased cell death, which might be a secondary consequence of lysosomal dysfunction; the inactivation of mrp-4 which is a ABCC transporter rescues most of the defects in cup-5 mutants, suggesting that the accumulation of ABC transporters in the absence of cup-5, may contribute to the lysosomal defects; cup-5 localizes to lysosomes." Accession_evidence "OMIM" "OMIM:605248"

Documentation for the .ace dumper:

In a nutshell, the dumper reads data from the postgres tables (history instead of data table for curator)

Has mappings of tables to evidence tags :

$eviToTag{paper} = 'Paper_evidence';

$eviToTag{person} = 'Person_evidence';

$eviToTag{accession} = 'Accession_evidence';

$eviToTag{exprtext} = 'Expr_pattern_evidence';

$eviToTag{rnai} = 'RNAi_evidence';

$eviToTag{genereg} = 'Gene_regulation_evidence';

$eviToTag{microarray} = 'Microarray_results_evidence';

$eviToTag{lastupdate} = 'Date_last_updated';

$eviToTag{curator} = 'Curator_confirmed';


- Gets all joinkeys for each wbgene

- Skips entries with no dump OR without description type OR without description text.

- Replaces line breaks with spaces

- Splits evidence on comma, stripping leading and trailing doublequotes and spaces

- Creates a line of tag<tab>""

- Skips if tag is Provisional_description I'm not sure what this part means. --kimberly

- If tag is Concise_description makes it Provisional_description

- For each evidence it creates a line of tag<tab>""<evitag>"<evi>"

Detailed documentation of .ace dumper:

Location of script on Tazendra:

/home/postgres/work/citace_upload/concise/dump_concise.pl


Location of script on the sand box(Curators cannot run script from this directory, see below):

/home/postgres/work/pgpopulation/concise_description/20110722_newOA/dump_concise.pl

Syntax to run the dumper:

./dump_concise.pl > concise.ace

For testing the dumper, go to the following directory on the sand-box:

/home/acedb/ranjana/concise_testing, where it can be run by curators for testing purposes only.

 1 #!/usr/bin/perl -w
 2 
 3 # dump .ace file for concise description based on con_ tables.  2011 07 27
 4 
 5 use strict;
 6 use diagnostics;
 7 use DBI; # module for connecting to postgres
 8 use Encode qw( from_to is_utf8 ); #for converting weird characters to normal UTF-8
 9 
10 my $dbh = DBI->connect ( "dbi:Pg:dbname=testdb", "", "") or die "Cannot connect to database!\n"; #connecting to testdb database - 
   testdb is the name of the postgres working database 
11 
12 my %data; # defines data hash to read in all the data into (or from?) postgres; this is faster than querying each data table one-
   by-one
13 
14 my @tables = qw( wbgene desctype desctext paper accession lastupdate nodump person exprtext rnai genereg microarray );
   # array, list of all the relevant tables in postgres, qw is the quote word function in Perl
15 foreach my $table (@tables) { # for each of the tables, 
16   my $result = $dbh->prepare( "SELECT * FROM con_$table" ); # prepares the query for each of the tables
17   $result->execute() or die "Cannot prepare statement: $DBI::errstr\n"; # executes the above query 
18   while (my @row = $result->fetchrow) { $data{$table}{$row[0]} = $row[1]; } # stores data into data hash by table and postgres ID
19 } # foreach my $table (@tables)
20 
21 my $result = $dbh->prepare( "SELECT * FROM con_curator_hst" ); # query the history table for the curator field; this will allow 
   us to append all curators who worked on the annotation as evidence
22 $result->execute() or die "Cannot prepare statement: $DBI::errstr\n"; # executes the above query
23 while (my @row = $result->fetchrow) { $data{curator}{$row[0]}{$row[1]}++; } # fetch the table, postgres ID and each of the  
   curators
24 
25 my @evi_types = qw( paper person accession exprtext rnai genereg microarray lastupdate );
26 my %eviToTag; # defining the mappings of evidence to tag hash
27 $eviToTag{paper}        = 'Paper_evidence';
28 $eviToTag{person}       = 'Person_evidence';
29 $eviToTag{accession}    = 'Accession_evidence';
30 $eviToTag{exprtext}     = 'Expr_pattern_evidence';
31 $eviToTag{rnai}         = 'RNAi_evidence';
32 $eviToTag{genereg}      = 'Gene_regulation_evidence';
33 $eviToTag{microarray}   = 'Microarray_results_evidence';
34 $eviToTag{lastupdate}   = 'Date_last_updated';
35 $eviToTag{curator}      = 'Curator_confirmed';
36


37 my %ace; #this creates an ace hash that allows us to dump all the data as a single unit/paragraph instead of multiple lines
38 foreach my $joinkey (sort keys %{ $data{wbgene} }){ # using WBGene we are going to get all the data,using PGids
39   next if ($data{nodump}{$joinkey});                    # skip no dump
40   next unless ($data{desctype}{$joinkey});              # skip without type
41   next unless ($data{desctext}{$joinkey});              # skip without text
42   my $text    = $data{desctext}{$joinkey};              # storing the description text into text variable
43   if ($text =~ m/\n/) { $text =~ s/\n/ /g; }            # replace newlines (line breaks) with spaces
44   my $tag     = $data{desctype}{$joinkey};              #
45   my $wbgene  = $data{wbgene}{$joinkey};                # getting description type into the variable tag and wbgene into the      
     variable wbgene
46   my %evi;                                              # store all the evidences
47   foreach my $evi_table (@evi_types) {                  # go through all of the evidence types in line 25       
48     next unless ($data{$evi_table}{$joinkey});          # Skip if no data in evidences
49     my (@data) = split/,/, $data{$evi_table}{$joinkey}; # separate on commas and put in an array called data,
50     foreach my $data (@data) {                          # look at each of the values in the list
51       if ($data =~ m/^\"/) { $data =~ s/^\"//; }        # get rid of the beginning and ending double quotes around the values
52       if ($data =~ m/\"$/) { $data =~ s/\"$//; }
53       if ($data =~ m/^\s+/) { $data =~ s/^\s+//; }      # get rid of multiple spaces 
54       if ($data =~ m/\s+$/) { $data =~ s/\s+$//; }      # replace newlines (line breaks) with spaces
55       if ($data =~ m/\n/) { $data =~ s/\n/ /g; }        # storing the data in the evi hash, stripped of all the above
56       if ($data) { $evi{$evi_table}{$data}++; }
57     } # foreach my $data (@data)                        
58   } # foreach my $evi_table (@evi_types)
59   foreach my $curator (sort keys %{ $data{curator}{$joinkey} }) { $evi{curator}{$curator}++; } # stores the curator data in the 
     evi hash
60


61   if ( ($tag eq 'Concise_description') || ($tag eq 'Provisional_description') ) { $ace{$wbgene}{"$tag\t\"$text\""}++; } 
     # if the line is Concise_description or Provisional_description, only for these two tags make lines without evidences
62   next if ($tag eq 'Provisional_description');          
     # if the tag is Provisional_description (second+extra sentences), then dump only tag, no evidence; Note that second+extra 
     sentences was a box on the old CD form that held additional information that was not referenced.  It'd be good to move this 
     information to a more informative tag and add a reference wherever applicable, but this will take quite a bit of work.
63   if ($tag eq 'Concise_description') { $tag = 'Provisional_description'; }      
     # Concise data have evidence under Provisional_description
64   foreach my $eviType (sort keys %evi) { # looping through each of the evidences
65     my $subtag = $eviToTag{$eviType};    
     # make an ace sub-tag variable like Paper_evidence etc by using the evi to tag mapping earlier on
66     foreach my $evi (sort keys %{ $evi{$eviType} }) { # loop through each of the evidences in lines 
67       $ace{$wbgene}{"$tag\t\"$text\"\t$subtag\t\"$evi\""}++; # put the evidences into the ace hash, i.e adding data to the ace 
     hash
68     } # foreach my $evi (sort keys %{ $evi{$eviType} }) 
69   } # foreach my $eviType (sort keys %evi)
70 } # foreach my $joinkey (sort { $data{wbgene}{$a} <=> $data{wbgene}{$b} } keys %{ $data{wbgene} })
71 
72 foreach my $wbgene (sort keys %ace) { # loop thrugh each WBGene to get the data
73   print "Gene : \"$wbgene\"\n";# print the gene object header
74   foreach my $line (sort keys %{ $ace{$wbgene} }) { print "$line\n"; } 
# loop through all of the lines, 61-63,  by wbgene and print the  lines 
75   print "\n"; # print the lines
76 } # foreach my $wbgene (sort keys %ace)
77 
78 


79 # size of bigtext
80 # # wbgene00000001 has provisional stuff that was overwritten, erase NULL in parsing
81 # # nodump 
82 # # get no_curator curator into OA
83 # # put no_curator curators into person_evidence
84 # # lastupdate for non-concise/humandisease based on latest timestamp from desctext in car_ tables
85 
86 # # constraints
87 # # wbgene, curator, desctype, desctext, lastupdate
88 
89 
90 # try to group by wbgene
91 # skip no dump rows
92 # concise      to Concise_description       without evidence
93 # concise      to Provisional_description   with    evidence
94 # provisional  to Provisional_description   without evidence
95 # humandisease to Human_disease_relevance   with    evidence
96 # others too   to whatever                  with    evidence
97 
98 
99 __END__
100 
101 


102 my $result = $dbh->prepare( "SELECT * FROM two_comment WHERE two_comment ~ ?" );
103 $result->execute() or die "Cannot prepare statement: $DBI::errstr\n"; 
104 while (my @row = $result->fetchrow) {
105   if ($row[0]) { 
106     $row[0] =~ s/^M//g;
107     $row[1] =~ s/^M//g;
108     $row[2] =~ s/^M//g;
109     print "$row[0]\t$row[1]\t$row[2]\n";
110   } # if ($row[0])
111 } # while (@row = $result->fetchrow)
112 
113 __END__
114


115 my $result = $dbh->prepare( 'SELECT * FROM two_comment WHERE two_comment ~ ?' );
116 $result->execute('elegans') or die "Cannot prepare statement: $DBI::errstr\n"; 
117 
118 $result->execute("doesn't") or die "Cannot prepare statement: $DBI::errstr\n"; 
119 my $var = "doesn't";
120 $result->execute($var) or die "Cannot prepare statement: $DBI::errstr\n"; 
121 
122 my $data = 'data';
123 unless (is_utf8($data)) { from_to($data, "iso-8859-1", "utf8"); }
124 
125 my $result = $dbh->do( "DELETE FROM friend WHERE firstname = 'bl\"ah'" );
126 (also do for INSERT and UPDATE if don't have a variable to interpolate with ? )
127 
128 can cache prepared SELECTs with $dbh->prepare_cached( &c. );
129 
130 if ($result->rows == 0) { print "No names matched.\n\n"; }      # test if no return
131 
132 $result->finish;        # allow reinitializing of statement handle (done with query)
133 $dbh->disconnect;       # disconnect from DB
134 
135 http://209.85.173.132/search?q=cache:5CFTbTlhBGMJ:www.perl.com/pub/1999/10/DBI.html+dbi+prepare+execute&cd=4&hl=en&ct=clnk&gl=us
136 
137 interval stuff : 
138 SELECT * FROM afp_passwd WHERE joinkey NOT IN (SELECT joinkey FROM afp_lasttouched) AND joinkey NOT IN (SELECT joinkey FROM cfp_curator)   
AND afp_timestamp < CURRENT_TIMESTAMP - interval '21 days' AND afp_timestamp > CURRENT_TIMESTAMP - interval '28 days';
139 
140 casting stuff to substring on other types :
141 SELECT * FROM afp_passwd WHERE CAST (afp_timestamp AS TEXT) ~ '2009-05-14';
142 
143 to concatenate string to query result :
144   SELECT 'WBPaper' || joinkey FROM pap_identifier WHERE pap_identifier ~ 'pmid';
145 to get :
146   SELECT DISTINCT(gop_paper_evidence) FROM gop_paper_evidence WHERE gop_paper_evidence NOT IN (SELECT 'WBPaper' || joinkey FROM 
pap_identifier WHERE pap_identifier ~ 'pmid') AND gop_paper_evidence != ;
147

Documenting Curator History

  • Curator history showing all curators?

When there are multiple curators attached to an annotation, are they all still associated with it? Check WBGene00000035 for an example. In the old form, both Carol and Tuco are associated (same time stamp), but in the new form I only see Tuco. Alternatively, should the Curator field allow for more than one value in the concise OA? I see. Well, we could try to put both curators, but it would be random which one would be the "last" curator, which would be the only one that would show in the curator field, while both would show in the (current, hopefully will be replaced) curator history 'ontology' field, and the .ace file. We could make it a multivalue curator, I'm still not sure of all the ramifications of that, but we could. One partial problem with that is we would then have all the curators in two fields, which would be temping to get rid of the other field, but the other field is necessary to keep track of all the timestamps. I'm not really sure which way would be best as far as this field.

I think the main issue for me is being able to query the form using a curator name and be confident that I'm really getting all of the entries attached to that curator. If the curator field was a multivalue field, then when there is more than one curator with the same time stamp attached, we could query that field with a curator name, and know we were getting all of their associated entries. --Kimberly

  • Is Curator history queryable by name?

No, all it holds is a pgid. If you want this queryable this should be a text field that can be autopopulated when a new field gets created, but has to be manually edited to say whatever you want on future edits. This would really be better in making the OA more intuitive, you can see the data in the dataTable, and it would be queryable. Unless you edit a lot of entries, it's probably the best solution, just think of it as another Last_Updated field.

I think the main issue here is that we don't want to accidentally change or somehow muck up the curator_history table. If curator could be multivalue and I could query the form for curator using that field, then I'm fine with leaving the curator_history table display as it is in the current version of the form, i.e. a pgid. --Kimberly

How to deal with invalid genes

  • For descriptions attached to invalid/dead WBGene IDs, can we still display the WBGene ID in the form?

If the genes exist in postgres (/ the nameserver) the WBGeneIDs will show, right now the wbgenes don't show only if there's no entry in gin_wbgene in postgres.

Still waiting on an answer from Sanger. 08/01/2011 --Kimberly


  • Is there a way to indicate in the form that these annotations are attached to now invalid IDs?

If you mean that the WBGenes are invalid, then no, you'd have to look at each wbgene and look at the term info. The dumper could have some sort of check, but it pretty much already had that before, and I thought those errors were getting ignored anyway.

So we couldn't make a row of annotation attached to an invalid gene, pink or something :-)? This is fine - the dumper was set up to comment out any annotations attached to invalid gene IDs and the list of errors could be used to see which genes might be in need of an update. --Kimberly

Scripts related to concise descriptions

  • For creating the OA tables :

/home/postgres/work/pgpopulation/concise_description/20110722_newOA/create_con_tables.pl

  • For populating data based on old tables:

/home/postgres/work/pgpopulation/concise_description/20110722_newOA/populate_con_tables.pl

  • New dump_concise.pl is at:

/home/postgres/work/citace_upload/concise/dump_concise.pl

  • Changed wrapper.pl to use new dumper :

/home/postgres/work/citace_upload/concise/wrapper.pl

  • Symlinked wrapper to acedb account :

/home/acedb/ranjana/citace_upload/concise/wrapper.pl

Files still go to the same place.

Changed wrapper to look at any evidence that matches OMIM:[0-9]+ and put the numbers in the .ace file as Database "OMIM" Accession_number "<numbers>"

Template-based curation and changing the OA interface

Features of the template:

  • Provides pre-composed sentences.
  • Based on language most-often used to write a concise description.
  • Pre-composed sentence is incomplete and has blanks to fill in key facts about the gene.
  • Curators to add additional sentences, additional sentences allowed only at the bottom.
  • Will have fields for separate paper references to each sentence (fact), useful internally for

information-tracking, updating, etc (not mandatory)

  • Script will concatenate sentences to build a concise description, based on some rules.
  • Final concise description can be viewed in Term information panel?
  • Hand-editing of the resulting concise description is not allowed.


Advantages:

  • Key facts that need to be captured are pre-determined, act as cues.
  • Has more structure
  • Sentences about gene facts stored in Postgres may be useful for other purposes
  • Ability to attach separate paper references to each sentence (fact), useful internally for

information-tracking, updating, etc (not mandatory)

  • Easier updating


Future features: Ability to display/incorporate results from Textpresso-based text mining and other sources.

Key facts captured in a concise description:

  • Molecular identity
  • Orthology
  • Process/Pathway
  • Genetic interactions
  • Physical interactions
  • Function/Molecular Activity
  • Tissue Expression
  • Sub cellular localization

Working with the template:

Curation starts by picking a gene (as done now in the OA). Concise description template may look like:


Molecular identity
<Gene> encodes .....;

Orthology/Similarity
<Gene> is (drop down with values:orthologous,similar) to .....;

Process/Pathway
<Gene> is (drop-down with values:
required,functions,regulates,is involved in,is part of) .....;
(Sentence needs to be cloned if required)

Genetic interaction with respect to Process or Pathway
<Gene> interacts genetically with (drop-down with ontology:WBGenes) with 
 respect to ........;
 
Physical interaction
<Protein> physically interacts (drop-down with values: invitro, invivo) with 
(drop-down with proteins);

Molecular Function
<Protein> has (drop down with ontology:GO molecular function) or ………  activity;

Tissue Expression
<Gene/Protein> is expressed in …………and expression in ......is (drop-down 
with values: positively, negatively) regulated by .......;

Cellular Component/sub-cellular localization 
<Protein> is localized to (drop-down with ontology:GO cellular component) 
or .......and expression in .....is (drop-down with values:
positively, negatively) regulated by (drop-down with proteins)


Using the template: Example 1, clp-3
clp-3 encodes an atypical  calpain, which is a calcium regulated thiol protease.
clp-3 is orthologous to the human calpains, that include CAPN1, CAPN11 and CAPN12.
<Gene> is involved in …………
<Gene> functions to regulate………….
<Gene> interacts genetically with ………
<Protein> interacts physically with <other protein> in ……..assay.
<Protein> has been demonstrated to have ………  activity with <target >
<Gene> is expressed in ………….	(tissue-level expression)
<Gene> expression in ………….is regulated by……………
<Gene> is localized to …………..	(sub-cellular localization)
<Gene> localization  to…………is regulated by……………
This sentence added by curator (not precomposed):
Reducing the function of clp-3 via RNAi did not yield any obvious phenotypes, 
possibly due to the    functional redundancy among clp genes in C. elegans.

Script builds:

clp-3 encodes an atypical  calpain, which is a calcium regulated thiol protease; 
clp-3 is orthologous to the human calpains, that include CAPN1, CAPN11 and CAPN12; 
reducing the  function of clp-3 via RNAi did not yield any obvious phenotypes, possibly 
due to functional redundancy among clp genes in C. elegans.


Using the template: Example 2, faah-1
faah-1 encodes a putative fatty acid amide hydrolase enzyme.
faah-1 is orthologous to human fatty acid amide hydrolase (FAAH).
faah-1 is involved in regulating N-acylethanolamines (NAEs), which are 
lipid-derived signalling molecules, reduced levels of which, under dietary 
restriction, result in increased lifespan.
faah-1 functions most likely to hydrolyse N-acylethanolamines (NAEs), as worms 
overexpressing   faah-1  had reduced levels of NAEs like eicosapentaenoyl ethanolamide (EPEA) 
and arachidonoyl  ethanolamide (AEA).
faah-1 is also required for normal growth and development.
<Gene> interacts genetically with ………
<Protein> interacts physically with <other protein> in ……..assay.
<Protein> has been demonstrated to have ………  activity with <target >
<Gene> is expressed in …………	(tissue-level expression)
<Gene> expression in …………is regulated by……………
<Gene> is localized to …………..	(sub-cellular localization)
<Gene> localization  to…………is regulated by……………
Script builds:
faah-1 encodes a putative fatty acid amide hydrolase enzyme and is orthologous to human fatty 
acid amide hydrolase (FAAH); faah-1 is involved in regulating 
N-acylethanolamines (NAEs), which are lipid-derived signalling molecules, reduced 
levels of which, under dietary restriction, result in increased lifespan; faah-1 functions 
to possibly hydrolyse N-acylethanolamines (NAEs) like eicosapentaenoyl ethanolamide (EPEA)  
and  arachidonoyl ethanolamide (AEA), as worms overexpressing faah-1 had 
reduced levels of NAEs; faah-1 is also required for normal growth and development.
Using the template: Example 3, teg-1
teg-1 encodes a protein with similarity to mammalian CD2BP2 and Saccharomyces cerevisiae Lin1p; 
<Gene> is (drop down with values:orthologous,similar) to .....;
teg-1 is involved in regulating the stem cell proliferation versus differentiation decision in the germ line; 
<Gene> interacts genetically with (drop-down with ontology:WBGenes) with respect to ........;
TEG-1 physically interacts in vivo with UAF-1, the C. elegans homolog of the U2AF65 splicing factor; 
<Protein> has been demonstrated to have ………  activity with <target >
TEG-1 is expressed in the germline and somatic cells of the gonad and in the intestine; 
<Protein> is localized to (drop-down with ontology:GO cellular component) 
or .......and expression in .....is (drop-down with values:
positively, negatively) regulated by (drop-down with proteins)
in the gonad, TEG-1 is enriched in the nucleoplasm with fainter levels seen in the cytoplasm of the gonad and intestine;
within the gonad, TEG-1 nucleoplasmic expression is highest in oocytes, the somatic Distal Tip Cell (DTC), and sheath cells.
Script builds:
teg-1 encodes a protein with similarity to mammalian CD2BP2 and Saccharomyces cerevisiae Lin1p; 
teg-1 is involved in regulating the stem cell proliferation versus differentiation decision in the germ line; 
TEG-1 physically interacts in vivo with UAF-1, the C.   elegans homolog of the U2AF65 splicing factor; TEG-1 is 
expressed in the germline and somatic cells of the gonad and in the intestine; in the gonad, TEG-1 is enriched 
in the nucleoplasm with fainter levels seen in the cytoplasm of the gonad and intestine; within the gonad, TEG-1 
nucleoplasmic expression is highest in oocytes, the somatic Distal Tip Cell (DTC), and sheath cells.


Existing description:
elt-3 gene encodes a GATA transcription factor; during embryogenesis, ELT-3 appears to act downstream of ELT-1, also a GATA   
transcription factor, in a redundant pathway controlling hypodermal cell differentiation; ELT-3 is also required for positive  
regulation of transcription of nlp-29, which encodes an antimicrobial peptide, in response to fungal infection and gdph-1 in response 
to salt stress; in addition, ELT-3 may also play a role in regulating adult lifespan; ELT-3 is expressed in all of the major 
hypodermal cells except the lateral seam cells and expression in the hypodermis is positively regulated by ELT-1 and negatively 
regulated by ELT-5 and ELT-6; ELT-3 localizes to the nucleus.

Concise descriptions for other nematode species

Caenorhabditis briggsae

Pristionchus pacificus

Brugia malayi

  • Name
    • Synonyms - how are these being collected?
  • Molecular identity
    • Normal/nematode function
  • Biological process
    • Normal/nematode function
    • Mutant phenotypes
    • Metabolic pathway
  • Expression
    • Life stage
    • Microarray results
  • Interactions
    • Physical interactions - intra- and inter-species
    • Genetic interactions - intra- and inter-species
  • Disease role
    • Antigenicity
    • Effects on disease progression


Each statement will need to be supported by either a reference or sequence analysis (how stringent do we want to be with sequence analysis - require anything from the author or just double-check domains/motifs/homology in WB?)

Perhaps users could select a check box and if they select experimental, they could be required to submit a reference identifier.

Use a form similar to the old CD form where users could enter data into a box and then ask for an additional box, if needed.

A few sample Brugia concise descriptions:

    AsnRS encodes an asparaginyl-tRNA synthetase; in Brugia, AsnRS is predicted to function in protein synthesis; AsnRS physically 
    interacts with mammalian IL8 receptors; when expressed in a murine colitis model, AsnRS demonstrates an anti-inflammatory role by
    inducing resolution of cellular infiltration in the colonic mucosa, a CD8(+) cellular response, and a rise in IL10 production 
    from CD3+, LPS- and CPG-stimulated splenic cells.
    Bm33 encodes a pepsin inhibitor; Bm33 elicits an antibody response in hosts infected with Brugia malayi and the presence of host
    antibodies may be used for detection of infection.
    Bm14 encodes a member of the hsp90 family of heat shock proteins; Bm14 elicits an antibody response in hosts infected with Brugia
    malayi and the presence of host antibodies may be used for detection of infection.

Scripted Concise Descriptions

  • GO terms (most granular?) - if scripted
  • Other resources - UniProt, Genbank - if scripted


Back to Caltech documentation