Difference between revisions of "Expression Pattern"
(→Module) |
(→Module) |
||
Line 389: | Line 389: | ||
− | my %ontologyIdToName; | + | my %ontologyIdToName; # mappings for ids to names (only for life stage) |
1; | 1; | ||
sub getExprPattern { | sub getExprPattern { | ||
− | my ($flag) = shift; | + | my ($flag) = shift; #can be all or the name for an expr_id |
− | &populateOntIdToName(); | + | &populateOntIdToName(); #call the subroutine a thte bottom of the page for life stage name mapping |
if ( $flag eq 'all' ) { $result = $dbh->prepare( "SELECT * FROM exp_name ; " ); } # get all entries for type | if ( $flag eq 'all' ) { $result = $dbh->prepare( "SELECT * FROM exp_name ; " ); } # get all entries for type | ||
Line 473: | Line 473: | ||
} # sub getData | } # sub getData | ||
− | sub populateOntIdToName { | + | sub populateOntIdToName { # reads form obo_name_lifestage to get the mappings from life_stage id to name |
$result = $dbh->prepare( "SELECT * FROM obo_name_lifestage;" ); $result->execute(); | $result = $dbh->prepare( "SELECT * FROM obo_name_lifestage;" ); $result->execute(); | ||
while (my @row = $result->fetchrow) { $ontologyIdToName{'lifestage'}{$row[0]} = $row[1]; } | while (my @row = $result->fetchrow) { $ontologyIdToName{'lifestage'}{$row[0]} = $row[1]; } |
Revision as of 19:18, 9 June 2011
Contents
Expression Pattern
Tags currently used in Expr_pattern objects (based on WS221):
Laboratory Expr_pattern Pattern Life_stage Gene Antibody Subcellular_localization GO_term Western Transgene Protein_description In_Situ Author Anatomy_term Reporter_gene Picture Date Reference Expressed_in Antibody_info Protein Northern Clone Cell RT_PCR Strain Remark MovieURL Pseudogene Curated_by Sequence
Types of fields Juancarlos can implement:
* text : text * bigtext : text box expanded * dropdown : few values * ontology : controlled vocabulary * multiontology / multidropdown : allows multiple values * toggle : on / off
OA interface
OA editor label -- postgres table name -- type of table and description. -- J i will prepare a .ace template for dumping soon if this is what you mean. E.g. Reference -> exp_paper D I meant that that's the format of each field below, but adding the .ace Tag somewhere in the pattern is probably good -- J
Dumper
Juancarlos I fixed the 2 objects with the pipe. Remember to change the dumper accordingly
Tab1
- Pgdbid -- no table -- postgres database ID, generates automatically upon entry.
- Expr_pattern -- Expr_pattern : "exp_name" -- text -- Expression Pattern ID should be generated when creating a new object. Take the highest Expr_patternID and increase by one When making a new row, the OA looks at all entries in exp_name that begin with "Expr", then captures the numbers, finding the highest number, adds 1 to it, puts 'Expr' in front, and uses that as the new name. So be aware that if you manually enter 'Expr9999998' it will skip to 'Expr9999999' when you click it. -- J
- Reference -- Reference "exp_paper" -- ontology on paper WBPaperID - Daniela add wish list for term info. Juancarlos I am still thinking on what I would like to see displayed. Probably not much but it will be clear later on. multiontology ? I think all other configs only have a single ontology for Paper -- J single ontology is fine D
- Gene -- Gene "exp_gene" -- ontology on genes WBGeneID - show WBID, locus, and synonym in term info as in GO OA
- Anatomy -- Anatomy_term "exp_anatomy" exp_qualifier "exp_qualifiertext" -- multiontology. Daniela will associate different Anatomy-qualifier-qualifier_text in different OA rows, so some Expr objects will have multiple rows / multiple pgids. When querying by any of these fields, if editing a different field, the curator should query by Expr to make sure all pgids for that object have that other field edited.
- Qualifier -- exp_qualifier -- dropdown -- Certain / Uncertain / Partial
- Qualifier Text -- exp_qualifiertext -- bigtext
- GO_term -- GO_term "exp_goid" -- multiontology of GO_Term like gop_goid.
- Subcellular_localization -- Subcellular_localization "exp_subcellloc"-- bigtext, details on subcellular localization.
- Life_stage -- Life_stage "exp_lifestage" Convert the life stage IDs into names from the obo_name_lifestage -- multiontology like in the phenotype OA and picture OA
Juancarlos parsed .ace dump from WS226: 5518 anatomy_term lines without a #Qualifier at all in expr_no_qualifier
2703 anatomy_term lines with #qualifier and extra text in expr_data_with_extra_anatomy. expr_data_with_extra_anatomy_categorized 796 unique text-expr linked to various anat_terms in expr_data_with_extra_anatomy for example, look at "Expressed iin ventral male specific muscles." which has a unique Expr to multiple anat_terms ; or "1 neuron" linked to multiple different expr / anat_term
Tab2
- Type -- exp_exprtype -- multidropdown select from: Antibody, Reporter_gene, In_situ, RT_PCR, Northern, Western but this is not possible because we have text associated to those values. For J, would be ideal to have a dropdown and once we choose from the dropdown we should have a text box associated with it. Daniela when adding text in the Antibody_text field click also the Antibody in the multidropdown When populating this field from .ace file, always add whether or not there is text.
- Antibody_Text -- Antibody "exp_antibodytext" -- bigtext " this tag was used 462 times and has text associated -> not possible just to toggle Call this Antibody Text or Antibody Type or Antibody Method so that the antibody objects below can be just 'Antibody' ? -- J good idea. Antibody text is fine D
- Reporter_gene_Text -- Reporter_gene "exp_reportergene" -- bigtext " this tag was used 7273 times and has been used twice for the same object! -> We need a separator between lines. We will add lots of text and would be good to have that text split into parts. Details on reporter gene construct. Multiline Not sure what you mean by multiline, if you mean the .ace file should have the tags multiple times (yes)we'd have to decide what the separator would be, you'd type the separator manually, and we'd have the dumper split on it -- J yes, I thnk this is the way to go is to add a separator manually D okay, we've pretty much always used | so just use that to separate entries, and let me know when we write the dumper to split on | and print out data in different tags. -- J great, I'll use the pipe to separate entries. I put a note at the end of the wiki a reminder for you when you write the dumper. Likewise maybe Reporter Gene Text if you think there will ever be a Reporter Gene field holding WBGenes -- J fine here too to have reporter gene text D Well, it's up to you, the table and label are as in the beginning of this line, but if you think that there'll be an ontology field of genes with a similar label, then we should change it -- J
- In_Situ -- In_Situ "exp_insitu" -- bigtext " this tag was used 434 times and has always text -> not possible just to toggle
- RT_PCR -- RT_PCR "exp_rtpcr" -- bigtext " this tag was used 165 times has text associated -> not possible just to toggle
- Northern -- Northern "exp_northern" -- bigtext " this tag was used 347 times and has text or just Northern label -> not possible just to toggle
- Western -- Western "exp_western" -- bigtext " this tag was used 19 times and has always text -> not possible just to toggle
all those above are the values of "type" right ? right D From the Reporter_gene description, does this mean you need to add text to this dropdown ? Do you want a "type" dropdown and a "type text" bigtext ? yes, would be great to be able to select one of the above with a dropdown and, once selected have a bigtext box next to it D. Well, we can have a Type multidropdown, and a Type_text bigtext, but each of the types you pick in the multidropdown won't be associated with anything specific in the big block of bigtext. If you wanted to have associations, you'd have to pick RT_PCR and Antibody (for example) in the multidropdown then in the bigtext you'd have to type RT_PCR <some rtpcr text> | Antibody <some antibody text> using the pipe ( | ) as a divider to separate the different things. At this point there's no point in having a multidropdown because you're typing everything in the bigtext field anyway. If you want to do things this way, add a "Type_text" bigtext field. I would instead suggest that if you want a tag + text associated with each other, you get rid of "Type" and make a lot of toggle_text fields, one for each of the types, then you could just click the toggle and type the text. We should probably talk about this in person since I'm not sure how you were originally picturing it working - J We will talk in person but both your suggestions would work. Suggestion 1 to have "Type_text" bigtext field and suggestion 2 to click the toggle and type the text. The final thing I want to have dumped in the .ace file is e.g. Northern "text" or In_situ "text". as long as we achieve that it does not make any difference :) D
We will have a multidropdown on the values above AND we will have bigtext fields for each of the values above. D&J decided this on March 21
- Picture -- exp_picture -- Multiontology on Picture We will remove this tag: Picture objects will be created in Picture OA and XREF to Expr_pattern. They will not be entered here. Removed from OA -- J We removed Pictures form Expr_pattern as they are XREF'd to it
- Picture flag -- exp_pictureflag -- toggle notify picture person with a cronjob every 2 weeks. We keep this even if we remove the Picture tag
notify picture person with a chronjob when when there is a new picture to curate Notify how ? -- J I put that note for myself but in the long run would be good to have a way to notify other curators when there is a new object they should curate. For the Expr_pattern OA this applies to Picture, transgene and antibody D It's still unclear to me how curators should get notified that there's a new value. We should probably talk about this. If this is something that "would be nice, but isn't important" but is still necessary for this field to exist, then okay, we don't have to talk about it. But if it turns out that we set it up in a way that won't work, I'm not going to want to talk about it after all the code's done and rewrite the code. Of course, we're not doing anything yet, we're just talking about how we will do this _eventually_ so there's no huge rush to talk about it -- J. Again, it will be best to talk in person about it but I think we could set up something like a "New object cgi" so that once that I see a new antibody that needs to be annotated I fill in a field and it generates a form that keeps track on the new objects that need attention. Hard to explain in a written form. Karen showed me something similar for phenotypes e.g. http://tazendra.caltech.edu/~postgres/cgi-bin/new_objects.cgi. We will see it 'live' the first time you step by. We set up flagging for Picture, Antibody and Transgene and the persons responsible for those data types are notified with a cronjob every 2 weeks. we will see with time if cronjob should be more frequent.
- Antibody_info -- Antibody_info "exp_antibody" -- multiontology on antibodies
- Antibody flag -- exp_antibodyflag -- toggle -> notify antibody person with a cronjob every 2 weeks
- Pattern -- Pattern "exp_pattern" -- bigtext, details on tissue distribution. Multiline
- Remark -- Remark "exp_remark" -- bigtext, if any comments required. Multiline
- Transgene -- Transgene "exp_transgene" -- multiontology on transgenes.
- Transgene flag -- exp_transgeneflag -- toggle -> notify transgene person with a cronjob every 2 weeks
- Curator -- exp_curator -- Multiontology on people
- No dump -- exp_nodump -- Toggle Expr_pattern objects not to dump. If an Expr_pattern object is flagged as no dump, don't dump any data for that pgid, nor any other pgid that corresponds to the Expr_pattern object. (Read all exp_nodump + exp_name into a hash of Expr_patterns to not-dump.)
Tab3
- Protein_description -- Protein_description "exp_protein" -- text (30 objects)
- Clone -- Clone "exp_clone" -- multiontology on clones (341 objects) (when OA is in place discuss with Chris on the clone class). Is there a better place to get clones than http://textpresso-dev.caltech.edu/gsa/worm/known_entities/Clone?
Clone and Strain lists could taken from spica from /home/citpub/arun/wb_entities/known_entities All of these don't have any Term Info (nor synonyms) if you need either of those you'd have to query WS for it, Karen probably knows how, she does it for other objects -- J ok, I don't think I'll need a term info and I need it mainly to parse old data which have a clone attached. so for now is fine as it is D ok, I'll change the parser to read these. After reading below it's unclear how I should change the parser for strain / clone, we should talk -- J Karen will generate a file with clone objects and term information, J will have to update some scripts when the file is ready -- J this lists are kept updated with what is in acedb daily using the following cronjob: 06 23 * * * cd /home/citpub/arun/wb_entities; ./01getEntities.pl
- Strain -- Strain "exp_strain" -- multiontology on strains (812 objects). is there a better place to take the strain list then http://textpresso-dev.caltech.edu/gsa/worm/known_entities/Strain?
Clone and Strain lists could take from spica from /home/citpub/arun/wb_entities/known_entities this lists are kept updated with what is in acedb daily using the following cronjob: 06 23 * * * cd /home/citpub/arun/wb_entities; ./01getEntities.pl Note that Strain is a text field under transgene and phenotype -- J I see. We can leave it as text for now D Well, it's up to you, if you want Clones to be an autocomplete, I'll have to change the parser anyway, so we can make strains a multidropdown I'm just letting you know that it's text for other OA configs, so maybe you should talk with them about changing their data, or what they want to show in Term Info -- J Karen will generate a file with strain objects and term information, J will have to update some scripts when the file is ready -- J
- Sequence -- Sequence "exp_sequence" -- text (13 objects) F54E2 2x (clone), R05D8 2x (clone), Y38B5A (clone), "Z28375" -C "EMBL Z28375" (sequence), "Z28376" -C "EMBL Z28376" (sequence), "Z28377" -C "EMBL Z28377" (sequence), R11H6 (clone), Y40H4A (clone), U14525, C47G2 (clone), Z32673 (sequence). We should consolidate these objects with Clones or Genes not sure what this consolidation means -- J. We will keep these objects as text in the beginning (this is to parse into old Expr_pattern data) but Wen and I have to find a way to get rid of this category in the long run and merge the Sequence with the clone, when possible. D okay, we're not working on this OA for a while yet, so if that gives you time to clean up this data, that'd be good. otherwise we can do it down the line -- J for the moment we will leave the text field and see how it goes. If I manage to clean it up before I'll tell you
- MovieURL -- MovieURL "exp_movieurl" -- (32 objects) text ? -- J yes D
- Laboratory -- Laboratory "exp_laboratory" -- ontology (17 objects) There's an ontology of laboratories used for 3 OAs, if you want to use that. The labs are not updated though, so if you want to use "new" labs, text is fine -- J great, then we can use the laboratory ontology D I've changed the type to ontology, if you want multiontology, go ahead and change it -- J ontology is good. No values with multiple labs found so far. Phenotype calls this 'Laboratory', but Antibody and Transgene call this 'Location', do you know why that is ? It would be nice to name them all the tables the same ; I don't recall who the curators are for antibody and transgene. -- J you are right :). In the expr_pattern model is Laboratory and honestly I like it much better than location -which is too ambiguous... Leave it as laboratory D Great =) I'd like to talk to antibody and transgene people to rename the postgres table -- J
notes: In the future we will get rid of the following tags: CDS, Sequence, Pseudogene and Protein and we will propose a model change for that. We will also get rid of Protein_description, Cell, Cell_group.
Daniela think if you want to have Author, Date and Curated_by removed from OA and stored in a separate file that will be read automatically whenever dumping the OA data. This will make OA faster as there will be less fields. Yes, we will keep Author, Date, and Curated_by in a separate file. D 052411
obsolete fields
- Cell -- exp_cell -- text (26 objects)-> when this is live consolidate these objects with the Anatomy_term field We have no autocomplete on Cell, you'd have to create a list of objects / term info -- J ok leave it as text now, I will have to go to see those objects one by one and consolidate with anatomy terms. D We don't have to leave it as text, if you can come up with a list of Cell objects, the way you have strain and clone. You can talk to Karen about how she generates data from acedb or aql queries or something -- J we will not populate the Cell field at all. Daniela add manually terms associated with cell to Anatomy term field (file with mapping is cells under Files from Wen) done DR 06062011 or:
Expr_pattern : "Expr7477" Cell "P3.p" Certain Cell "P4.p" Certain Cell "P5.p" Certain Cell "P6.p" Certain Cell "P7.p" Certain Cell "P8.p" Certain
done DR 06062011
Expr_pattern : "Expr7595" Cell "CANL" Uncertain Cell "CANR" Uncertain
done DR 06062011
Expr_pattern : "Expr7605" Cell "M4" Certain
done DR 06062011
Expr_pattern : "Expr7632" Cell "AVG" Certain Cell "M5" Certain Cell "PVT" Certain Cell "PVCL" Uncertain Cell "PVCR" Uncertain Cell "PVNL" Uncertain Cell "PVNR" Uncertain Cell "PVQL" Uncertain Cell "PVQR" Uncertain
done DR 06062011
Expr_pattern : "Expr7691" Cell "P3.p" Certain Cell "P4.p" Certain Cell "P5.p" Certain Cell "P6.p" Certain Cell "P7.p" Certain Cell "P8.p" Certain
done DR 06062011
Expr_pattern : "Expr8715" Cell "M.dlpa" Certain Cell "M.drpa" Certain
done DR 06062011
- Anatomy certain -- exp_certain -- multiontology. Controlled vocabulary found here: https://github.com/raymond91125/Wao/raw/master/WBbt.obo (same as in Picture OA). We need to have 3 different Anatomy term boxes, one for the Partial, one for the certain and one for the uncertain Qualifiers. We also have to think on how to inplement the text options for each anatomy term. normally the text would go for the partial qualifiers. What does the 3 boxes thing mean ? -- J We should discuss this matter in person because for each anatomy term that I will enter I should also add a qualifier. One box is called Certain, one is called Uncertain, and the other Partial. the meaning is related to the expression pattern: E.g. expression in a subset of neurons I will click Partial. Expression in the K cell I will click certain. Possible expression in the Pharynx I will click uncertain. D It sounds like you want 2 fields, this one + a dropdown field with 3 values, we'll talk in person / skype -- J
- Anatomy Partial -- exp_partial -- multiontology.
- Anatomy Uncertain -- exp_uncertain -- multiontology.
- Anatomy no qualifier-- exp_noqualifier -- multiontology. We added this field because when we parsed the old expr_pattern data (WS226) 5518 anatomy_term lines did not have a #Qualifier.
The following fields will not be inported into OA but will remain in Citace Minus: Author, Date and Curated_by (also discussed with Wen). Attention When deleting Expr_pattern Objects from Citace minus be sure not to delete those.
- Author -- exp_author -- Text separate authors by pipe What is this the author of ? Should it just be 'Person' ? -- J yes you are right :) but in the model is listed as author and I guess it was used for author submissions (large scale). Let's leave it as Author and put a multiontology on people. If in the model it's listed as Author to ?Person objects, that's kind of bad because it doesn't make sense. If in the model it's listed as Author to ?Author objects, then we can't enter people because we'd be entering WBPerson objects into the ?Author class -- J
- Date -- exp_date -- text (2617 objects)
- Curated_by -- exp_curatedby -- text (6228 objects) Not Curator, meaning WBPerson ? The curator field is required already, but this is a different thing ? -- J. no, this is a legacy thing, the values are only Hinxton and Caltech. Wen would like to get rid of it evntually but for the moment we are keeping it there D ah, ok -- J
Tags used only once that should be fixed
- Expressed_in - text 1 entry. No info attached to this term. Left out DR 06062011
- Protein - text 1 entry could be put in Protein_description. Expr1941 done DR 06062011
- Pseudogene - text (1 object) Expr111 done DR 06062011
- Homol_homol tag is used in Chronograms -> we will not include Chronograms in the OA.
Daniela will enter them in the remarks as there is only 1 entry per tag. done DR 06062011. Discussed it with Wen -May 9th. D. Daniela need to write which are Juancarlos, there is a bunch of objects that have a Strain attached in the Remarks instead of in the Strain tag. Would be good to consolidate them
Comments for Parsing ExprCitace226 into OA
Parsing files in /home/postgres/work/pgpopulation/exp_exprpattern Many entries for Anatomy_term don't have one of the Certain/Partial/Uncertain. We leave them without the qualifier.
Chronogram tags
Right_priority Localizome Show_up_strand GFF_source Width Picture Reporter_gene Reference Gene Allow_misalign GFF_feature Transgene Homol_homol Remark Strain Colour Curated_by
the script to get the tags (e.g. from ExprWS221.ace or from Chronograms.ace) was written by Yuling, is called get_tags.pl and is located under desktop/Varia_protocols/get_tags
We will not include Chronograms in Expr_OA anyway as they are one time large scale exp.
Notes
when J will write the dumper for Reporter_gene remember to split on | for bigtext fields and print out data in different tags. Daniela check if this applies to other entries. checked, applies to bigtext fields. Daniela, that makes sense for most bigtext fields, but does it make sense for Qualifier Text ? If you have multiple Qualifier Text values, you'd group it in different OA rows to match the multiontology Qualifiers, right ? None of the current data has pipes. -- J Juancarlos, let's leave the Qualifier text without pipes--D k
to fix manually
* INVALID DATA antibody [WBPaper00032450]:capg-1 Expr8708 * INVALID DATA antibody [cgc3002]:beta-filagenin Expr1442 * INVALID DATA antibody [cgc4387]:hsp-16.2 Expr1117 * INVALID DATA antibody [cgc6057]:daf-21 Expr2687
- INVALID DATA goid GO:0000141 Expr3919 Done DR06062011
- INVALID DATA goid GO:0008221 Expr7871 Done DR06062011
- INVALID DATA transgene Is001 Expr2646 Done DR06062011
- INVALID DATA transgene Is007 Expr2646 Done DR06062011
- INVALID DATA transgene leals30 Expr9151 Done DR06062011
- INVALID DATA transgene pZMI.1In1 Expr725 Done DR06062011
- INVALID DATA transgene pZMI.1In2 Expr725 Done DR06062011
Need to correct the expression pattern transgene name
- Is001 -> WBPaper00006024_Is001 for Expr2646 WBPaper00006024 Done DR06062011
- Is007 -> WBPaper00006024_Is007 for Expr2646 WBPaper00006024 Done DR06062011
- pZMI.1In1 -> WBPaper00002501_In1 for Expr725 WBPaper00002501 Done DR06062011
- pZMI.1In2 -> WBPaper00002501_In2 for Expr725 WBPaper00002501 Done DR06062011
- Add leals30 Expr9151 WBPaper00037728 Done DR06062011
Need to correct the expression pattern GO name
- GO:0000141 is now GO:0032432 Done DR06062011
- GO:0008221 is now GO:0016529 Done DR06062011
There was a list of Anatomy term objects with invalid IDs. this is the mapping for the new ids:
- Old ID New ID
- WBbt000:6748 WBbt:0006748
- WBbt:0003852 WBbt:0003851
- WBbt:0004397 WBbt:0008116
- WBbt:0004398 WBbt:0008111
- WBbt:0004401 WBbt:0004392
- WBbt:0004459 WBbt:0003664
- WBbt:0004514 WBbt:0008052
- WBbt:0004515 WBbt:0008050
- WBbt:0004717 WBbt:0008046
- WBbt:0004718 WBbt:0008051
- WBbt:0004719 WBbt:0008049
- WBbt:0004720 WBbt:0008047
- WBbt:0004721 WBbt:0008045
- WBbt:0004722 WBbt:0008044
- WBbt:0005099 WBbt:0005830
- WBbt:0005211 WBbt:0005801
- WBbt:0005228 WBbt:0005214
- WBbt:0005323 WBbt:0005831
- WBbt:0005814 WBbt:0006909
- WBbt:6789 WBbt:0006789
all OK
Dumper
Module located here: /home/postgres/work/citace_upload/expr_pattern/get_expr_pattern_ace.pm
Script that calls the module located here: /home/postgres/work/citace_upload/expr_pattern/use_package.pl*
use lib qw( /home/postgres/work/citace_upload/expr_pattern ); # this command line tells where to look for the module use get_expr_pattern_ace; # tells to use the module
my $outfile = 'expr_pattern.ace'; my $errfile = 'err.out'; # we did not set any rule for errors yet
open (OUT, ">$outfile") or die "Cannot create $outfile : $!\n"; open (ERR, ">$errfile") or die "Cannot create $errfile : $!\n";
my ($all_entry, $err_text) = &getExprPattern('all'); # uses the module to get all the Expr_pattern objects
print OUT "$all_entry\n"; # prints everything into the output expr_pattern file if ($err_text) { print ERR "$err_text\n"; } # prints error into the output error file
close (OUT) or die "Cannot close $outfile : $!"; close (ERR) or die "Cannot close $errfile : $!";
Module
package get_expr_pattern_ace; #name of the package require Exporter; # exports so that other perl scripts can use it
our @ISA = qw(Exporter);
our @EXPORT = qw( getExprPattern ); # we are only exporting the getExprPattern subroutine
our $VERSION = 1.00;
use strict;
use diagnostics;
use DBI;
my $dbh = DBI->connect ( "dbi:Pg:dbname=testdb", "", "") or die "Cannot connect to database!\n"; # connect to postgres and the testDB database
my $result;
my %theHash; # where all the data are going to be stored
my @tables = qw( name paper gene anatomy qualifier qualifiertext goid subcellloc lifestage exprtype antibodytext reportergene insitu rtpcr nor thern western antibody pattern remark transgene curator nodump protein clone strain sequence movieurl laboratory ); # all the tables that have data
my @maintables = qw( paper gene anatomy goid subcellloc lifestage exprtype antibodytext reportergene insitu rtpcr northern western antibody pa ttern remark transgene protein clone strain sequence movieurl laboratory ); # tables that have .ace tags
my $all_entry = ; # where all the .ace data is going to go
my $err_text = ; # where all the error data is going to go
my %nameToIDs; #maps the expr_object id to PGID # type -> name -> ids -> count my %ids; #list of PGIDs
my %pipeSplit; #tables that need to split on pipes $pipeSplit{subcellloc}++; $pipeSplit{antibodytext}++; $pipeSplit{reportergene}++; $pipeSplit{insitu}++; $pipeSplit{rtpcr}++; $pipeSplit{northern}++; $pipeSplit{western}++; $pipeSplit{pattern}++; $pipeSplit{remark}++;
my %tableToTag; #mapping table to the .ace tag $tableToTag{paper} = 'Reference'; $tableToTag{gene} = 'Gene'; $tableToTag{anatomy} = 'Anatomy_term'; $tableToTag{goid} = 'GO_term'; $tableToTag{subcellloc} = 'Subcellular_localization'; $tableToTag{lifestage} = 'Life_stage'; $tableToTag{exprtype} = 'Special'; $tableToTag{antibodytext} = 'Antibody'; $tableToTag{reportergene} = 'Reporter_gene'; $tableToTag{insitu} = 'In_situ'; $tableToTag{rtpcr} = 'RT_PCR'; $tableToTag{northern} = 'Northern'; $tableToTag{western} = 'Western'; $tableToTag{antibody} = 'Antibody_info'; $tableToTag{pattern} = 'Pattern'; $tableToTag{remark} = 'Remark'; $tableToTag{transgene} = 'Transgene'; $tableToTag{protein} = 'Protein_description'; $tableToTag{clone} = 'Clone'; $tableToTag{strain} = 'Strain'; $tableToTag{sequence} = 'Sequence'; $tableToTag{movieurl} = 'MovieURL'; $tableToTag{laboratory} = 'Laboratory';
my %ontologyIdToName; # mappings for ids to names (only for life stage)
1;
sub getExprPattern {
my ($flag) = shift; #can be all or the name for an expr_id
&populateOntIdToName(); #call the subroutine a thte bottom of the page for life stage name mapping
if ( $flag eq 'all' ) { $result = $dbh->prepare( "SELECT * FROM exp_name ; " ); } # get all entries for type else { $result = $dbh->prepare( "SELECT * FROM exp_name WHERE exp_name = '$flag' ;" ); } # get all entries for type of object name $result->execute(); while (my @row = $result->fetchrow) { $theHash{object}{$row[0]} = $row[1]; $nameToIDs{object}{$row[1]}{$row[0]}++; $ids{$row[0]}++; } my $ids = ; my $qualifier = ; if ($flag ne 'all') { $ids = join"','", sort keys %ids; $qualifier = "WHERE joinkey IN ('$ids')"; } foreach my $table (@tables) { $result = $dbh->prepare( "SELECT * FROM exp_$table $qualifier;" ); # get data for table with qualifier (or not if not) $result->execute(); while (my @row = $result->fetchrow) { $theHash{$table}{$row[0]} = $row[1]; } } # foreach my $table (@tables)
foreach my $name (sort keys %{ $nameToIDs{object} }) { my $entry = ; my $has_data; $entry .= "\nExpr_pattern : \"$name\"\n";
my %cur_entry; foreach my $joinkey (sort {$a<=>$b} keys %{ $nameToIDs{object}{$name} }) { next if ($theHash{nodump}{$joinkey}); foreach my $table (@maintables) { next unless ($tableToTag{$table}); my $tag = $tableToTag{$table}; if ($table eq 'anatomy') { my %e1 = &getData($table, $joinkey); my %e2 = &getData('qualifier', $joinkey); my %e3 = &getData('qualifiertext', $joinkey); my $l2_exists = 0; my $l3_exists = 0; foreach my $e1 (sort keys %e1) { foreach my $e2 (sort keys %e2) { foreach my $e3 (sort keys %e3) { $cur_entry{"$tag\t\"$e1\" $e2 \"$e3\"\n"}++; $l3_exists++; } unless ($l3_exists) { $cur_entry{"$tag\t\"$e1\" $e2\n"}++; $l2_exists++; } } unless ( ($l2_exists) || ($l3_exists) ) { $cur_entry{"$tag\t\"$e1\"\n"}++; } } } elsif ($table eq 'exprtype') { my %entries = &getData($table, $joinkey); foreach my $entry (sort keys %entries) { $cur_entry{"$entry\n"}++; } } else { my %entries = &getData($table, $joinkey); foreach my $entry (sort keys %entries) { $cur_entry{"$tag\t\"$entry\"\n"}++; } } } } # foreach my $joinkey (sort {$a<=>$b} keys %{ $nameToIDs{$type}{$name} }) foreach my $line (sort keys %cur_entry) { $entry .= $line; $has_data++; } if ($has_data) { $all_entry .= $entry; } } # foreach my $name (sort keys %{ $nameToIDs{$type} }) return( $all_entry, $err_text );
} # sub getExprPattern
sub getData { # get hash of values in this table
my ($table, $joinkey) = @_; my %entries; if ($theHash{$table}{$joinkey}) { my $data = $theHash{$table}{$joinkey}; if ($data =~ m/^\"/) { $data =~ s/^\"//; } if ($data =~ m/\"$/) { $data =~ s/\"$//; }
//g; }data =~ s/ m/
if ($data =~ m/\n/) { $data =~ s/\n/ /g; } if ($data =~ m/^\s+/) { $data =~ s/^\s+//g; } if ($data =~ m/\s+$/) { $data =~ s/\s+$//g; } my @data; if ($data =~ m/\",\"/) { @data = split/\",\"/, $data; } elsif ($pipeSplit{$table}) { @data = split/\|/, $data; } else { push @data, $data; } foreach my $value (@data) { if ($value =~ m/\"/) { $value =~ s/\"/\\\"/g; } if ($value =~ m/^\s+/) { $value =~ s/^\s+//g; } if ($value =~ m/\s+$/) { $value =~ s/\s+$//g; } if ($table eq 'lifestage') { if ($ontologyIdToName{$table}{$value}) { $value = $ontologyIdToName{$table}{$value}; } } # convert life
stage ids to lifestage names. 2011 05 13
if ($value) { $entries{$value}++; } } } return %entries;
} # sub getData
sub populateOntIdToName { # reads form obo_name_lifestage to get the mappings from life_stage id to name
$result = $dbh->prepare( "SELECT * FROM obo_name_lifestage;" ); $result->execute(); while (my @row = $result->fetchrow) { $ontologyIdToName{'lifestage'}{$row[0]} = $row[1]; }
} # sub populateOntIdToName
Data parsing
File that was used for parsing is the WS226 dump and is located here: /home/postgres/work/pgpopulation/exp_exprpattern/ExprWS226.ace
There are 1802 objects without any Anatomy_term. I'm assuming this is okay -- J Yes, it is --D
What do we do with Marker objects ? Treat them the same as Expr_pattern objects ? -- J yes, treat the same --D
Life_stage in obo class have WBls:####### IDs, but data has lifestage names, is this bad data ? The OA only supports IDs (see phenotype, generegulation, picture OA) : can we convert the life stage names into WBls:#######? I asked Wen about this and she is fine with it --D Changed the parser to convert from name to ID, but still waiting until we talk to Karen
- WS gene reg obj : http://wormbase.org/db/misc/etree?name=WBPaper00036764_lin-28.b;class=Gene_regulation
- WS expr pat obj : http://wormbase.org/db/misc/etree?name=Expr2201;class=Expr_pattern
/home/postgres/work/pgpopulation/exp_exprpattern/invalid_ontology_values has many other objects that don't fit the ontologies. It would be best to either fix them in citace and redump, or to get mappings of bad-to-good values and put them in the parser. This was run on the sandbox, so if any values are real, the sandbox might not have all the values. -- J I see, there are many objects with invalid format for different classes. i will figure out what was the problem for each of them and get back to you --D. 20 Anatomy terms having old ids -> Daniela generated mapping with new IDs. 2 invalid objects for GO -> Alerted Ranjana, waiting for answer 5 Antibody objects -> alerted Xiaodong, 2 fixed, 3 waiting for Wen's answer (did she create the objects already or we should generate new ones?). 37 transgenes objects -> alerted Karen
Strain and Clone don't have ontologies yet, once we have those we'll see if any data is bad -- J ok --D
Only looking at WBPictureID pictures, if we need to dump both ways, it will get conversions from the WBPictureID's name. -- J I am not sure I get this..D we talked about it