Difference between revisions of "Expression Pattern"

From WormBaseWiki
Jump to navigationJump to search
Line 1,770: Line 1,770:
 
http://www.sanger.ac.uk/sanger/Worm_NameServer
 
http://www.sanger.ac.uk/sanger/Worm_NameServer
  
to generate an ID to use in the OA:
+
check if the new variation already exists by clicking on find variation.
 +
 
 +
If it does:
 +
generate an ID in the OA:
 
http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/generic.cgi?action=TempVariationObo
 
http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/generic.cgi?action=TempVariationObo
 +
putting in the public name and the ID
 +
 +
if it doesn't:
 +
on the name server click on 'request a new variation ID'
 +
put in the public name and the paper -with additional info
  
 +
then go to the OA as above and generate an ID
  
 
== Expression tables ==
 
== Expression tables ==

Revision as of 20:47, 24 May 2016

Expression Pattern

This is the current model (WS252)



?Expr_pattern Expression_of Gene ?Gene XREF Expr_pattern #Evidence
                            Reflects_endogenous_expression_of ?Gene
                            CDS ?CDS XREF Expr_pattern               // for coding genes
                            Sequence ?Sequence XREF Expr_pattern     // for clones???
                            Pseudogene ?Pseudogene XREF Expr_pattern // [030801 krb]
                            Clone ?Clone XREF Expr_pattern
                            Protein ?Protein XREF Expr_pattern
                            Protein_description Text   // information for Expr_patterns with unknown antigens [031105 krb]
              Homol Homol_homol ?Homol_data XREF Expr_homol ?Method Float Int UNIQUE Int Int UNIQUE Int #Homol_info //Expr_pattern mapping [060427 ar2]
              Expression_data Life_stage ?Life_stage XREF Expr_pattern #Qualifier
                              Anatomy_term ?Anatomy_term XREF Expr_pattern #Qualifier
                              GO_term ?GO_term XREF Expr_pattern #GR_condition
                              Not_in_Life_stage ?Life_stage #Qualifier
                              Not_in_Anatomy_term ?Anatomy_term #Qualifier
                              Not_in_GO_term ?GO_term #GR_condition 
              Subcellular_localization ?Text
              Type Reporter_gene ?Text
                   In_situ Text
                   Antibody ?Text
                   Northern Text
                   Western Text
                   RT_PCR Text
                   RNASeq ?Analysis
                   Localizome ?Text
                   Microarray ?Microarray_experiment
                   Tiling_array ?Analysis
                   EPIC ?Text 
                   Cis_regulatory_element Text
              Expression_cluster ?Expression_cluster XREF Expr_pattern //added for localizome
              Microarray_results ?Microarray_results XREF Expr_Pattern
              Pattern ?Text
              Picture ?Picture XREF Expr_pattern
              MovieURL Text //Added by wen for link to movie URLs. 
              Movie ?Movie XREF Expr_pattern  //Added by Wen to curate Expr_pattern video
              Species UNIQUE ?Species
              Remark ?Text #Evidence
              DB_info ?Database ?Database_field Text
              Experiment Laboratory ?Laboratory
                         Author ?Author 
                         Date UNIQUE DateType
                         Strain UNIQUE ?Strain
              Reference ?Paper XREF Expr_pattern
              Transgene ?Transgene XREF Expr_pattern
              Variation ?Variation XREF Expr_pattern
              Construct ?Construct XREF Expression_pattern
              Associated_feature ?Feature XREF Associated_with_expression_pattern #Evidence
              Antibody_info ?Antibody XREF Expr_pattern // This applies to both Western & Antibody staining
                                                        // added [031120 krb]
              Curated_by UNIQUE Text // Hinxton (HX) or Caltech (CIT) Sylvia [010927 krb]
              Historical_gene ?Gene Text


//Qualifer hash will be used for Expr_pattern curation to specify the reliability of data.
                                                                                                 
#Qualifier Certain Text
           Uncertain Text             //For faint or variable expression
           Partial Text               //For expression of unidentified cell in a cell group
           Anatomy_term ?Anatomy_term //combines life stage with anatomy term in expr pattern annotation
           Life_stage  ?Life_stage    //combines life stage with anatomy term in expr pattern annotation
                                                                                                 


Tags currently used in Expr_pattern objects (based on WS221):

Laboratory Expr_pattern Pattern Life_stage Gene Antibody Subcellular_localization GO_term Western Transgene Protein_description In_Situ Author Anatomy_term Reporter_gene Picture Date Reference Expressed_in Antibody_info Protein Northern Clone Cell RT_PCR Strain Remark MovieURL Pseudogene Curated_by Sequence

Types of fields Juancarlos can implement:

   * text : text
   * bigtext : text box expanded
   * dropdown : few values
   * ontology : controlled vocabulary 
   * multiontology / multidropdown : allows multiple values
   * toggle : on / off

Genes with expression

to check the number of genes that do have expression objects you should run this script on tazendra:

/home/postgres/work/get_stuff/for_daniela/20140715_exp_gene_distinct/get_exp_gene_distinct.pl

  • 5575 as of August 2014
  • 5734 as of January 2016

WS248 numbers

for expression in WS248 we have 19052 objects coming from Yanai elegans; 68097 for Yanai other species, 10545 for Miller tiling arrays and 13877 manually curated -> 111571 total for pictures in WS248 we have 19052 objects coming from Yanai elegans; 68097 for Yanai other species,and 13912 manually curated for a total of 101061.

These are the statistics from citace that Wen pulled on May 11th 2015: Here are the changes from WS243 to WS248:

find Antibody: 2525 --> 2785, 260 added.
find Anatomy_term: 6839 --> 6842, 3 added.
find Anatomy_function: 598 --> 924, 326 added.
find DO_term: 6350 --> 6571, 221 added.
find Expr_pattern: 42979 --> 111571, 68592 added.
find Picture: 32636 --> 101061, 68425 added.

OA interface

OA editor label -- postgres table name -- type of table and description. -- J i will prepare a .ace template for dumping soon if this is what you mean. E.g. Reference -> exp_paper D I meant that that's the format of each field below, but adding the .ace Tag somewhere in the pattern is probably good -- J

Dumper

On February 2015 we ahve added the qualifier life stage field so we could capture anatomy and life stage associated to each other. We have added a qualifier_lifestage field and modified the dumper so that whenever there is an anatomy and a life stage in the qualifier life stage it will dump:

Expr_pattern : "Expr12000"
Anatomy_term	"WBbt:0004575" Life_stage "WBls:0000264"
Life_stage "WBls:0000264" Anatomy_term	"WBbt:0004575" 

we also set it up ina way that if only the qualifier life stage is filled it will dump it-this is because data entered in the life stage field in the micropublication form will go into exp_qualifierls

Tab1

  • Pgdbid -- no table -- postgres database ID, generates automatically upon entry.
  • Expr_pattern -- Expr_pattern : "exp_name" -- text -- Expression Pattern ID should be generated when creating a new object. Take the highest Expr_patternID and increase by one When making a new row, the OA looks at all entries in exp_name that begin with "Expr", then captures the numbers, finding the highest number, adds 1 to it, puts 'Expr' in front, and uses that as the new name. So be aware that if you manually enter 'Expr9999998' it will skip to 'Expr9999999' when you click it. -- J
  • Reference -- Reference "exp_paper" -- multiontology on paper WBPaperID - Daniela add wish list for term info. Juancarlos I am still thinking on what I would like to see displayed. Probably not much but it will be clear later on. multiontology ? I think all other configs only have a single ontology for Paper -- J single ontology is fine D. We changed it back to multiontology as there were Expr objects with multiple papers associated. A query for that is: testdb=> SELECT * FROM exp_paper WHERE exp_paper ~ ','; and the result:
302     | "WBPaper00001926","WBPaper00001469"                   | 2011-05-31 11:37:14.153562-07
5478    | "WBPaper00002573","WBPaper00002922"                   | 2011-05-31 12:04:27.053611-07
5479    | "WBPaper00001785","WBPaper00002922"                   | 2011-05-31 12:04:27.284807-07
5501    | "WBPaper00003285","WBPaper00001812"                   | 2011-05-31 12:04:34.649968-07
5502    | "WBPaper00003285","WBPaper00001812"                   | 2011-05-31 12:04:34.893549-07
5557    | "WBPaper00001014","WBPaper00002319","WBPaper00000647" | 2011-05-31 12:04:50.273589-07
5558    | "WBPaper00001014","WBPaper00002319","WBPaper00000647" | 2011-05-31 12:04:50.51613-07
5559    | "WBPaper00001014","WBPaper00002319","WBPaper00000647" | 2011-05-31 12:04:50.724796-07
5689    | "WBPaper00002573","WBPaper00002922"                   | 2011-05-31 12:05:33.692997-07
5707    | "WBPaper00001014","WBPaper00002319","WBPaper00000647" | 2011-05-31 12:05:42.104837-07
8260    | "WBPaper00031556","WBPaper00032077"                   | 2011-05-31 12:18:44.374796-07
  • Gene -- Gene "exp_gene" -- multiontology on genes WBGeneID - show WBID, locus, and synonym in term info as in GO OA
  • Endogenous -- "exp_endogenous" toggle tag in ace file Reflects_endogenous_expression_of
  • Rel Anatomy -- "exp_relanatomy" dropdown on part_of
  • Anatomy -- Anatomy_term "exp_anatomy" exp_qualifier "exp_qualifiertext" -- multiontology. Daniela will associate different Anatomy-qualifier-qualifier_text in different OA rows, so some Expr objects will have multiple rows / multiple pgids. When querying by any of these fields, if editing a different field, the curator should query by Expr to make sure all pgids for that object have that other field edited.
  • Qualifier -- exp_qualifier -- dropdown -- Certain / Uncertain / Partial / NOT (NOT is not dumping for now. Added feb 2015 to capture negative expression)
  • Qualifier Text -- exp_qualifiertext -- bigtext
  • GO_term -- GO_term "exp_goid" -- multiontology of GO_Term like gop_goid.
  • Subcellular_localization -- Subcellular_localization "exp_subcellloc"-- bigtext, details on subcellular localization.
  • Rel LS -- "exp_rellifestage" dropdown on part_of and happens_during
  • Life_stage -- Life_stage "exp_lifestage" Convert the life stage IDs into names from the obo_name_lifestage -- multiontology like in the phenotype OA and picture OA
  • Species -- "exp_species"

on Nov 3rd 2014 we have added 4 fields that will not be dumped yet but will be used to aid granular curation (and implemented also relations on feb 2015)

  • Qualifier LS -> multiontology on life stages -> exp_qualifierls dependent_on
  • GR Anatomy -> multiontology on anatomy terms -> exp_granatomy
  • GR LS -> multiontology on life stages -> exp_grlifestage
  • Rel Cell Cycle -- "exp_relcellcycle" dropdown on part_of, independent_of, happens_during, dependent_on
  • GR Cell Cycle -> multiontology on GO -> exp_grcellcycle

Juancarlos parsed .ace dump from WS226: 5518 anatomy_term lines without a #Qualifier at all in expr_no_qualifier

2703 anatomy_term lines with #qualifier and extra text in expr_data_with_extra_anatomy. expr_data_with_extra_anatomy_categorized 796 unique text-expr linked to various anat_terms in expr_data_with_extra_anatomy for example, look at "Expressed iin ventral male specific muscles." which has a unique Expr to multiple anat_terms ; or "1 neuron" linked to multiple different expr / anat_term

Tab2

  • Type -- exp_exprtype -- multidropdown select from: Antibody, Reporter_gene, In_situ, RT_PCR, Northern, Western but this is not possible because we have text associated to those values. For J, would be ideal to have a dropdown and once we choose from the dropdown we should have a text box associated with it. When populating this field from .ace file, always add whether or not there is text.
  • Antibody_Text -- Antibody "exp_antibodytext" -- bigtext " this tag was used 462 times and has text associated -> not possible just to toggle Call this Antibody Text or Antibody Type or Antibody Method so that the antibody objects below can be just 'Antibody' ? -- J good idea. Antibody text is fine D
  • Reporter_gene_Text -- Reporter_gene "exp_reportergene" -- bigtext " this tag was used 7273 times and has been used twice for the same object! -> We need a separator between lines. We will add lots of text and would be good to have that text split into parts. Details on reporter gene construct. Multiline Not sure what you mean by multiline, if you mean the .ace file should have the tags multiple times (yes)we'd have to decide what the separator would be, you'd type the separator manually, and we'd have the dumper split on it -- J yes, I thnk this is the way to go is to add a separator manually D okay, we've pretty much always used | so just use that to separate entries, and let me know when we write the dumper to split on | and print out data in different tags. -- J great, I'll use the pipe to separate entries. I put a note at the end of the wiki a reminder for you when you write the dumper. Likewise maybe Reporter Gene Text if you think there will ever be a Reporter Gene field holding WBGenes -- J fine here too to have reporter gene text D Well, it's up to you, the table and label are as in the beginning of this line, but if you think that there'll be an ontology field of genes with a similar label, then we should change it -- J
  • In_Situ -- In_Situ "exp_insitu" -- bigtext " this tag was used 434 times and has always text -> not possible just to toggle
  • RT_PCR -- RT_PCR "exp_rtpcr" -- bigtext " this tag was used 165 times has text associated -> not possible just to toggle
  • Northern -- Northern "exp_northern" -- bigtext " this tag was used 347 times and has text or just Northern label -> not possible just to toggle
  • Western -- Western "exp_western" -- bigtext " this tag was used 19 times and has always text -> not possible just to toggle

all those above are the values of "type" right ? right D From the Reporter_gene description, does this mean you need to add text to this dropdown ? Do you want a "type" dropdown and a "type text" bigtext ? yes, would be great to be able to select one of the above with a dropdown and, once selected have a bigtext box next to it D. Well, we can have a Type multidropdown, and a Type_text bigtext, but each of the types you pick in the multidropdown won't be associated with anything specific in the big block of bigtext. If you wanted to have associations, you'd have to pick RT_PCR and Antibody (for example) in the multidropdown then in the bigtext you'd have to type RT_PCR <some rtpcr text> | Antibody <some antibody text> using the pipe ( | ) as a divider to separate the different things. At this point there's no point in having a multidropdown because you're typing everything in the bigtext field anyway. If you want to do things this way, add a "Type_text" bigtext field. I would instead suggest that if you want a tag + text associated with each other, you get rid of "Type" and make a lot of toggle_text fields, one for each of the types, then you could just click the toggle and type the text. We should probably talk about this in person since I'm not sure how you were originally picturing it working - J We will talk in person but both your suggestions would work. Suggestion 1 to have "Type_text" bigtext field and suggestion 2 to click the toggle and type the text. The final thing I want to have dumped in the .ace file is e.g. Northern "text" or In_situ "text". as long as we achieve that it does not make any difference :) D

We will have a multidropdown on the values above AND we will have bigtext fields for each of the values above. D&J decided this on March 21


  • Picture -- exp_picture -- Multiontology on Picture We will remove this tag: Picture objects will be created in Picture OA and XREF to Expr_pattern. They will not be entered here. Removed from OA -- J We removed Pictures form Expr_pattern as they are XREF'd to it
  • Picture flag -- exp_pictureflag -- toggle notify picture person with a cronjob every 2 weeks. We keep this even if we remove the Picture tag

notify picture person with a chronjob when when there is a new picture to curate Notify how ? -- J I put that note for myself but in the long run would be good to have a way to notify other curators when there is a new object they should curate. For the Expr_pattern OA this applies to Picture, transgene and antibody D It's still unclear to me how curators should get notified that there's a new value. We should probably talk about this. If this is something that "would be nice, but isn't important" but is still necessary for this field to exist, then okay, we don't have to talk about it. But if it turns out that we set it up in a way that won't work, I'm not going to want to talk about it after all the code's done and rewrite the code. Of course, we're not doing anything yet, we're just talking about how we will do this _eventually_ so there's no huge rush to talk about it -- J. Again, it will be best to talk in person about it but I think we could set up something like a "New object cgi" so that once that I see a new antibody that needs to be annotated I fill in a field and it generates a form that keeps track on the new objects that need attention. Hard to explain in a written form. Karen showed me something similar for phenotypes e.g. http://tazendra.caltech.edu/~postgres/cgi-bin/new_objects.cgi. We will see it 'live' the first time you step by. We set up flagging for Picture, Antibody and Transgene and the persons responsible for those data types are notified with a cronjob every 2 weeks. we will see with time if cronjob should be more frequent.

  • Antibody_info -- Antibody_info "exp_antibody" -- multiontology on antibodies
  • Antibody flag -- exp_antibodyflag -- toggle -> notify antibody person with a cronjob every 2 weeks
  • Pattern -- Pattern "exp_pattern" -- bigtext, details on tissue distribution. Multiline
  • Remark -- Remark "exp_remark" -- bigtext, if any comments required. Multiline
  • Transgene -- Transgene "exp_transgene" -- multiontology on transgenes.
  • Construct -- Construct "exp_construct" -- multiontology on constructs.
  • Transgene flag -- exp_transgeneflag -- toggle -> notify transgene person with a cronjob every 2 weeks
  • Sequence_feature -- exp_seqfeature -- multiontology on Features (WBsfIDs)
  • Curator -- exp_curator -- Multiontology on people
  • No dump -- exp_nodump -- Toggle Expr_pattern objects not to dump. If an Expr_pattern object is flagged as no dump, don't dump any data for that pgid, nor any other pgid that corresponds to the Expr_pattern object. (Read all exp_nodump + exp_name into a hash of Expr_patterns to not-dump.)

Tab3

  • Protein_description -- Protein_description "exp_protein" -- text (30 objects)

NB: for clone and strain we need to change the multiontology into free text because strain and clone lists are not maintained. 01.11.2012.

Clone and Strain lists could taken from spica from /home/citpub/arun/wb_entities/known_entities All of these don't have any Term Info (nor synonyms) if you need either of those you'd have to query WS for it, Karen probably knows how, she does it for other objects -- J ok, I don't think I'll need a term info and I need it mainly to parse old data which have a clone attached. so for now is fine as it is D ok, I'll change the parser to read these. After reading below it's unclear how I should change the parser for strain / clone, we should talk -- J Karen will generate a file with clone objects and term information, J will have to update some scripts when the file is ready -- J this lists are kept updated with what is in acedb daily using the following cronjob: 06 23 * * * cd /home/citpub/arun/wb_entities; ./01getEntities.pl

In July 2014 there was a change in Hinxton that affected the clone list. From Juancarlos:
Thanks Paul D.  I've switched the script to look at clones2.ace.gz
instead of  clones.ace.gz  and the data seems to have read in fine.  
Karen / Daniela, can you check that everything you want in there is
there (meaning in Term Info, I imagine all the objects are there).

The script is not in a repo, but I've symlinked it so it shows here 
http://tazendra.caltech.edu/~postgres/out/geneace/nightly_geneace.pl
so it will always be the current version there.  The best thing though
would probably be to look at the wiki, to see what the script is
supposed to do, and I don't know where or if there is a wiki for it.
Karen, do we have one ?

Clone and Strain lists could take from spica from /home/citpub/arun/wb_entities/known_entities this lists are kept updated with what is in acedb daily using the following cronjob: 06 23 * * * cd /home/citpub/arun/wb_entities; ./01getEntities.pl Note that Strain is a text field under transgene and phenotype -- J I see. We can leave it as text for now D Well, it's up to you, if you want Clones to be an autocomplete, I'll have to change the parser anyway, so we can make strains a multidropdown I'm just letting you know that it's text for other OA configs, so maybe you should talk with them about changing their data, or what they want to show in Term Info -- J Karen will generate a file with strain objects and term information, J will have to update some scripts when the file is ready -- J


  • Sequence -- Sequence "exp_sequence" -- text (13 objects) F54E2 2x (clone), R05D8 2x (clone), Y38B5A (clone), "Z28375" -C "EMBL Z28375" (sequence), "Z28376" -C "EMBL Z28376" (sequence), "Z28377" -C "EMBL Z28377" (sequence), R11H6 (clone), Y40H4A (clone), U14525, C47G2 (clone), Z32673 (sequence). We should consolidate these objects with Clones or Genes not sure what this consolidation means -- J. We will keep these objects as text in the beginning (this is to parse into old Expr_pattern data) but Wen and I have to find a way to get rid of this category in the long run and merge the Sequence with the clone, when possible. D okay, we're not working on this OA for a while yet, so if that gives you time to clean up this data, that'd be good. otherwise we can do it down the line -- J for the moment we will leave the text field and see how it goes. If I manage to clean it up before I'll tell you
  • MovieURL -- MovieURL "exp_movieurl" -- (32 objects) text ? -- J yes D
  • Laboratory -- Laboratory "exp_laboratory" -- ontology (17 objects) There's an ontology of laboratories used for 3 OAs, if you want to use that. The labs are not updated though, so if you want to use "new" labs, text is fine -- J great, then we can use the laboratory ontology D I've changed the type to ontology, if you want multiontology, go ahead and change it -- J ontology is good. No values with multiple labs found so far. Phenotype calls this 'Laboratory', but Antibody and Transgene call this 'Location', do you know why that is ? It would be nice to name them all the tables the same ; I don't recall who the curators are for antibody and transgene. -- J you are right :). In the expr_pattern model is Laboratory and honestly I like it much better than location -which is too ambiguous... Leave it as laboratory D Great =) I'd like to talk to antibody and transgene people to rename the postgres table -- J
  • Variation -- Variation "exp_variation". Multiontology on variations

Tab4

-Contact- ontology on persons -exp_contact -e-mail-small text- exp_email -Co-authors- multiontology on persons -exp_coaut -Micropublication exp_micropublication toggle -Funding- bigtext -exp_funding

Microarray_results

The field Microarray_results has been added to the model for WS247. This will allow mapping to gene for other species (remanei, briggsae, japonica) coming from the Yanai study. Hinxton is mapping Microarray_results to Gene on the fly.

Make sure in the future not to add Microarray_result to C elegans expression objects to avoid overwriting any curated Gene references- DR 01-07-2015


Model changes to consider for the future

In the future we will get rid of the following tags: CDS, Sequence, Pseudogene and Protein and we will propose a model change for that. We will also get rid of Protein_description, Cell, Cell_group. Cell and Cell_group tags deleted for WS250 -DR 06102015 We should add smFISH (single molecule Fluorescent In Situ Hybridization)

Daniela think if you want to have Author, Date and Curated_by removed from OA and stored in a separate file that will be read automatically whenever dumping the OA data. This will make OA faster as there will be less fields. Yes, we will keep Author, Date, and Curated_by in a separate file. D 052411

In July 2014 we discussed to remove the Curated_by data. Generated a -D file for Citace minus. Deposited on CitaceMinus for the WS245 upload.

/Users/danielaraciti/Desktop/Expression_pattern curation/Curated_by/Curated_by.ace.edited

Also in the OA all the objects that had a Curated_by HX tag are now assigned to Sylvia MArtinelli -the one who historically set the Curated_by tag in the model the list of pgids that were changed is here: /Users/danielaraciti/Desktop/Expression_pattern curation/Curated_by/Curated_by_hinxton.rtf

obsolete fields

  • Cell -- exp_cell -- text (26 objects)-> when this is live consolidate these objects with the Anatomy_term field We have no autocomplete on Cell, you'd have to create a list of objects / term info -- J ok leave it as text now, I will have to go to see those objects one by one and consolidate with anatomy terms. D We don't have to leave it as text, if you can come up with a list of Cell objects, the way you have strain and clone. You can talk to Karen about how she generates data from acedb or aql queries or something -- J we will not populate the Cell field at all. Daniela add manually terms associated with cell to Anatomy term field (file with mapping is cells under Files from Wen) done DR 06062011 or:

Expr_pattern : "Expr7477" Cell "P3.p" Certain Cell "P4.p" Certain Cell "P5.p" Certain Cell "P6.p" Certain Cell "P7.p" Certain Cell "P8.p" Certain

done DR 06062011

Expr_pattern : "Expr7595" Cell "CANL" Uncertain Cell "CANR" Uncertain

done DR 06062011

Expr_pattern : "Expr7605" Cell "M4" Certain


done DR 06062011

Expr_pattern : "Expr7632" Cell "AVG" Certain Cell "M5" Certain Cell "PVT" Certain Cell "PVCL" Uncertain Cell "PVCR" Uncertain Cell "PVNL" Uncertain Cell "PVNR" Uncertain Cell "PVQL" Uncertain Cell "PVQR" Uncertain

done DR 06062011

Expr_pattern : "Expr7691" Cell "P3.p" Certain Cell "P4.p" Certain Cell "P5.p" Certain Cell "P6.p" Certain Cell "P7.p" Certain Cell "P8.p" Certain

done DR 06062011

Expr_pattern : "Expr8715" Cell "M.dlpa" Certain Cell "M.drpa" Certain

done DR 06062011

  • Anatomy certain -- exp_certain -- multiontology. Controlled vocabulary found here: https://github.com/raymond91125/Wao/raw/master/WBbt.obo (same as in Picture OA). We need to have 3 different Anatomy term boxes, one for the Partial, one for the certain and one for the uncertain Qualifiers. We also have to think on how to inplement the text options for each anatomy term. normally the text would go for the partial qualifiers. What does the 3 boxes thing mean ? -- J We should discuss this matter in person because for each anatomy term that I will enter I should also add a qualifier. One box is called Certain, one is called Uncertain, and the other Partial. the meaning is related to the expression pattern: E.g. expression in a subset of neurons I will click Partial. Expression in the K cell I will click certain. Possible expression in the Pharynx I will click uncertain. D It sounds like you want 2 fields, this one + a dropdown field with 3 values, we'll talk in person / skype -- J
  • Anatomy Partial -- exp_partial -- multiontology.
  • Anatomy Uncertain -- exp_uncertain -- multiontology.
  • Anatomy no qualifier-- exp_noqualifier -- multiontology. We added this field because when we parsed the old expr_pattern data (WS226) 5518 anatomy_term lines did not have a #Qualifier.

The following fields will not be inported into OA but will remain in Citace Minus: Author, Date and Curated_by (also discussed with Wen). Attention When deleting Expr_pattern Objects from Citace minus be sure not to delete those.

  • Author -- exp_author -- Text separate authors by pipe What is this the author of ? Should it just be 'Person' ? -- J yes you are right :) but in the model is listed as author and I guess it was used for author submissions (large scale). Let's leave it as Author and put a multiontology on people. If in the model it's listed as Author to ?Person objects, that's kind of bad because it doesn't make sense. If in the model it's listed as Author to ?Author objects, then we can't enter people because we'd be entering WBPerson objects into the ?Author class -- J
  • Date -- exp_date -- text (2617 objects)
  • Curated_by -- exp_curatedby -- text (6228 objects) Not Curator, meaning WBPerson ? The curator field is required already, but this is a different thing ? -- J. no, this is a legacy thing, the values are only Hinxton and Caltech. Wen would like to get rid of it evntually but for the moment we are keeping it there D ah, ok -- J

Tags used only once that should be fixed

  • Expressed_in - text 1 entry. No info attached to this term. Left out DR 06062011
  • Protein - text 1 entry could be put in Protein_description. Expr1941 done DR 06062011
  • Pseudogene - text (1 object) Expr111 done DR 06062011
  • Homol_homol tag is used in Chronograms -> we will not include Chronograms in the OA.

Daniela will enter them in the remarks as there is only 1 entry per tag. done DR 06062011. Discussed it with Wen -May 9th. D. Daniela need to write which are Juancarlos, there is a bunch of objects that have a Strain attached in the Remarks instead of in the Strain tag. Would be good to consolidate them

Comments for Parsing ExprCitace226 into OA

Parsing files in /home/postgres/work/pgpopulation/exp_exprpattern Many entries for Anatomy_term don't have one of the Certain/Partial/Uncertain. We leave them without the qualifier.

Chronogram tags

Right_priority Localizome Show_up_strand GFF_source Width Picture Reporter_gene Reference Gene Allow_misalign GFF_feature Transgene Homol_homol Remark Strain Colour Curated_by


the script to get the tags (e.g. from ExprWS221.ace or from Chronograms.ace) was written by Yuling, is called get_tags.pl and is located under desktop/Varia_protocols/get_tags

We will not include Chronograms in Expr_OA anyway as they are one time large scale exp. We have 2084 chronograms

Notes

when J will write the dumper for Reporter_gene remember to split on | for bigtext fields and print out data in different tags. Daniela check if this applies to other entries. checked, applies to bigtext fields. Daniela, that makes sense for most bigtext fields, but does it make sense for Qualifier Text ? If you have multiple Qualifier Text values, you'd group it in different OA rows to match the multiontology Qualifiers, right ? None of the current data has pipes. -- J Juancarlos, let's leave the Qualifier text without pipes--D k

to fix manually

* INVALID DATA antibody [WBPaper00032450]:capg-1 Expr8708 * INVALID DATA antibody [cgc3002]:beta-filagenin Expr1442 * INVALID DATA antibody [cgc4387]:hsp-16.2 Expr1117 * INVALID DATA antibody [cgc6057]:daf-21 Expr2687

  • INVALID DATA goid GO:0000141 Expr3919 Done DR06062011
  • INVALID DATA goid GO:0008221 Expr7871 Done DR06062011
  • INVALID DATA transgene Is001 Expr2646 Done DR06062011
  • INVALID DATA transgene Is007 Expr2646 Done DR06062011
  • INVALID DATA transgene leals30 Expr9151 Done DR06062011
  • INVALID DATA transgene pZMI.1In1 Expr725 Done DR06062011
  • INVALID DATA transgene pZMI.1In2 Expr725 Done DR06062011

Need to correct the expression pattern transgene name

  • Is001 -> WBPaper00006024_Is001 for Expr2646 WBPaper00006024 Done DR06062011
  • Is007 -> WBPaper00006024_Is007 for Expr2646 WBPaper00006024 Done DR06062011
  • pZMI.1In1 -> WBPaper00002501_In1 for Expr725 WBPaper00002501 Done DR06062011
  • pZMI.1In2 -> WBPaper00002501_In2 for Expr725 WBPaper00002501 Done DR06062011
  • Add leals30 Expr9151 WBPaper00037728 Done DR06062011


Need to correct the expression pattern GO name

  • GO:0000141 is now GO:0032432 Done DR06062011
  • GO:0008221 is now GO:0016529 Done DR06062011

There was a list of Anatomy term objects with invalid IDs. this is the mapping for the new ids:

  • Old ID New ID
  • WBbt000:6748 WBbt:0006748
  • WBbt:0003852 WBbt:0003851
  • WBbt:0004397 WBbt:0008116
  • WBbt:0004398 WBbt:0008111
  • WBbt:0004401 WBbt:0004392
  • WBbt:0004459 WBbt:0003664
  • WBbt:0004514 WBbt:0008052
  • WBbt:0004515 WBbt:0008050
  • WBbt:0004717 WBbt:0008046
  • WBbt:0004718 WBbt:0008051
  • WBbt:0004719 WBbt:0008049
  • WBbt:0004720 WBbt:0008047
  • WBbt:0004721 WBbt:0008045
  • WBbt:0004722 WBbt:0008044
  • WBbt:0005099 WBbt:0005830
  • WBbt:0005211 WBbt:0005801
  • WBbt:0005228 WBbt:0005214
  • WBbt:0005323 WBbt:0005831
  • WBbt:0005814 WBbt:0006909
  • WBbt:6789 WBbt:0006789

all OK

Importing the large large scale Expression_pattern left on Citace Minus into OA

File is on tazendra WS232LargeScaleExpr.ace

-D file for the import generated by Juancarlos /home/postgres/work/pgpopulation/exp_exprpattern/20120502_largescale/DashDWS232LargeScaleExpr.ace

there were only "Date" data. not Curated_by nor Author. We kept the "Data" values on Citace minus as we did for the previous import. The other field we ignored was pictures but we did not keep them in Citace Minus as we get them via Picture OA.

-D file deposited in CitaceMinus Data_for_Citace_minus/Data_from_Daniela on May 9th 2012


Serial numbers for large scale imports

Itai Yanai WBPaper00041190 C elegans (Expr starting with 101 and 102)
         Expression Expr1010178           to     Expr1029229
         Picture    WBPicture0001011201   to     WBPicture0001030252  

David Miller Wormviz (Expr starting with 103 and 104)
         Expression Expr1030000           to     Expr1040545
         No pictures associated to the study

Itai Yanai WBPaper00041190 Other species (briggsae, japonica, remanei. They never transferred brenneri) (Expr starting with 105 till 111)
         Expression Expr1050000          to     Expr1118096
         Picture    WBPicture0001030253   to     WBPicture0001098349 
Gap in numbering expression objects and pictures (Expr1118097 till Expr1142791, WBPicture0001098350 till WBPicture0001123044) 
to leave the slot for the missing brenneri data, in case they will submit

Itai Yanai WBPaper00046121
         Expression Expr1142792          to     Expr1163308
         Picture    WBPicture0001123045  to WBPicture0001143561

Itai Yanai large scale import -WBPaper00041190

In order to display pictures of expression time course we needed to generate expression objects. The objects (Expression and Picture) will be deleted once Wen will finish curating microarray for all species described in the paper and once we will have in place a way to generate images of expression on the fly - data will be retrieved directly from SPELL.

For now Daniela and Juancarlos have generated 2 .ace files, one for pictures and one for expression. the files are on CitaceMinus. The files are called expr_pattern_Yanai.ace and pictures_Yanai.ace

Expression pattern and Picture objects were given high numbers so when the new display system will be in place those could be deleted without affecting anything in OA.

Expression objects go from Expr1010178 to Expr1029229

Picture objects go from WBPicture0001011201 to WBPicture0001030252

there are 19052 objects for each class.

Files are also located here /Users/danielaraciti/Desktop/Citace_upload/Citace Minus Yanai

on december 5th 2014 other species pictures have been added too briggsae, remanei and japonica

Expr from Expr1050000
Pictures from WBPicture0001030253 on

Additional info on Yanai_Instructions_other_species2 on Lario

total number of objects: 20294 briggsae+ 21908 japonica+ 25895 remanei = 68097

  • the objects will go in WS247

Hinxton will generate the WBGene name on the fly according to Microarray_results


TOTAL Yanai import elegans + other species: 87149

Itai Yanai 2015 large import -WBPaper00046121

Files here /Users/danielaraciti/Desktop/Expression_pattern curation/Large scale/yanai_2015

  • Uploaded 20.517 objects for expression and pictures
  • Expr from Expr1142792 till Expr1163308
  • Pictures from WBPicture0001123045 till WBPicture0001143561
  • NB: there is a gap in numbering expression objects (Expr1118097 till Expr1142791)> this is because Yanai's lab did not submit brenneri's pictures yet. We inquired few times but they were never transferred. We left the numbers available for the future. The brenneri.ace files to be transferred to spica once they submit the images are located here (Lario):
  • /Users/danielaraciti/Desktop/brenneri

TransgeneOme import

We are going to import expression data (Images, constructs, and annotations) from the TransgeneOme project -Sarov et al., Cell, 2012. WBPaper00041419.

TransgeneOme import

David Miller tiling arrays import -WBPaper00037950

We want to add links to Wormiz for each gene in order to display graphic expression profiling from tiling arrays. We are going to request a model change for Expr_pattern by adding a DB_INFO tag.

DB_INFO ?Database ?Database_field Text

We will also request the inclusion of Microarray and Tiling Array for Type


Type	Reporter_gene ?Text
			In_situ Text
			Antibody ?Text
			Northern Text // Wen [krb 030425]
                        Western Text  // Wen 
                        RT_PCR Text   // Wen
			Localizome ?Text //added by Wen
                        Microarray ?Microarray_experiment  // Daniela
                        Tiling_array ?Analysis// Daniela

In this way will be easier to filter out Yanai and Miller's dayta for being displayed in a separate widget possibly called 'Expression profiling graphs'. Model change requested on 10-09-2013. Daniela and Juancarlos have generated a .ace file that was tested and read fine in acedb. More info on how the file was generated here: /home/acedb/draciti/Expr_pattern/Miller_import. Please note that the script generates "Tiling Array" in the Type. The actual tag is Tiling_array. D have changed it manually in the .ace file. Since during the process of model approval it was suggested to add ?Analysis for the tiling Arrays D modified the file and replaced it on CitaceMinus. The file is also stored here on Lario /Users/danielaraciti/Desktop/Citace_upload/Citace Minus Miller/WBPaper00037950.ace

How the file looks like

Database : "Wormviz"
Name    "Wormviz"
URL     "http:\/\/www.vanderbilt.edu\/wormdoc\/wormmap\/Welcome.html"
URL_constructor "http:\/\/jsp.weigelworld.org\/wormviz\/tileviz.jsp?experiment=wormviz&normalization=absolute&probesetcsv=%s"

Expr_pattern : "Expr1030000"
Gene	"WBGene00000001"
Pattern	"Tiling arrays expression graphs"
Reference	"WBPaper00037950"
Tiling_array
DB_INFO	"Wormviz" "id" "WBGene00000001"

Expr_pattern : "Expr1030001"
Gene	"WBGene00000002"
Pattern	"Tiling arrays expression graphs"
Reference	"WBPaper00037950"
Tiling_array
DB_INFO	"Wormviz" "id" "WBGene00000002"

Object names from Expr1030000 to Expr1040545. 10,546 objects

EPIC detailed

Cell/time specific expression data have been generated by Wen /Users/danielaraciti/Desktop/Expression_pattern curation/Large scale/Murray/epic.ace

the folder contains also the digitized sulston tree, the files that John Murray sent with positive/negative calls and the lifestage.ace containing all the new life stages

the EPIC.ace file was uploaded on CitaceMinus for WS246

Deleting files from Citace Minus

After parsing the WS226 data into OA we dumped a .ace file for generating a -D file to delete objects from Citace Minus. To the file were added manually all the invalid objects found while parsing the data (e.g. old anatomy term IDs, old GO terms, invalid transgenes and antibody objects) See list in Data to fix manually in this wiki.

Expression-paper association

For papers curated:

find Expr_pattern; follow Reference

For genes related:

find Expr_pattern; follow Gene


Dumper

Sequence filed does not dump fine e.g. Expr_pattern : "Expr980" Sequence "R05D8|F54E2". Need to fix it. Fixed 06162011


Module located here: /home/postgres/work/citace_upload/expr_pattern/get_expr_pattern_ace.pm

Script that calls the module located here: /home/postgres/work/citace_upload/expr_pattern/use_package.pl*

use lib qw( /home/postgres/work/citace_upload/expr_pattern ); # this command line tells where to look for the module use get_expr_pattern_ace; # tells to use the module

my $outfile = 'expr_pattern.ace'; my $errfile = 'err.out'; # we did not set any rule for errors yet

open (OUT, ">$outfile") or die "Cannot create $outfile : $!\n"; open (ERR, ">$errfile") or die "Cannot create $errfile : $!\n";


my ($all_entry, $err_text) = &getExprPattern('all'); # uses the module to get all the Expr_pattern objects

print OUT "$all_entry\n"; # prints everything into the output expr_pattern file if ($err_text) { print ERR "$err_text\n"; } # prints error into the output error file

close (OUT) or die "Cannot close $outfile : $!"; close (ERR) or die "Cannot close $errfile : $!";


Module

package get_expr_pattern_ace; #name of the package require Exporter; # exports so that other perl scripts can use it


  • our @ISA = qw(Exporter);
  • our @EXPORT = qw( getExprPattern ); # we are only exporting the getExprPattern subroutine
  • our $VERSION = 1.00;


  • use strict;
  • use diagnostics;
  • use DBI;


  • my $dbh = DBI->connect ( "dbi:Pg:dbname=testdb", "", "") or die "Cannot connect to database!\n"; # connect to postgres and the testDB database
  • my $result;
  • my %theHash; # where all the data are going to be stored
  • my @tables = qw( name paper gene endogenous anatomy qualifier qualifiertext qualifierls goid subcellloc lifestage exprtype antibodytext reportergene insitu rtpcr northern western antibody pattern remark transgene construct curator nodump protein clone strain seqfeature sequence movieurl laboratory variation species ); # all the tables that have data
  • my @maintables = qw(qw( paper gene anatomy goid subcellloc lifestage qualifierls exprtype antibodytext reportergene insitu rtpcr northern western antibody pattern remark transgene construct protein clone strain seqfeature sequence movieurl laboratory variation species ); # tables that have .ace tags


  • my $all_entry = ; # where all the .ace data is going to go
  • my $err_text = ; # where all the error data is going to go
  • my %nameToIDs; #maps the expr_object id to PGID # type -> name -> ids -> count
  • my %ids; #list of PGIDs
  • my %pipeSplit; #tables that need to split on pipes
$pipeSplit{subcellloc}++;
$pipeSplit{antibodytext}++;
$pipeSplit{reportergene}++;
$pipeSplit{insitu}++;
$pipeSplit{rtpcr}++;
$pipeSplit{northern}++;
$pipeSplit{western}++;
$pipeSplit{pattern}++;
$pipeSplit{remark}++;
$pipeSplit{sequence}++;
  • my %tableToTag; #mapping table to the .ace tag
$tableToTag{paper}         = 'Reference';
$tableToTag{gene}          = 'Gene';
$tableToTag{anatomy}       = 'Anatomy_term';
$tableToTag{qualifierls}   = 'Life_stage';
$tableToTag{goid}          = 'GO_term';
$tableToTag{subcellloc}    = 'Subcellular_localization';
$tableToTag{lifestage}     = 'Life_stage';
$tableToTag{exprtype}      = 'Special';
$tableToTag{antibodytext}  = 'Antibody';
$tableToTag{reportergene}  = 'Reporter_gene';
$tableToTag{insitu}        = 'In_situ';
$tableToTag{rtpcr}         = 'RT_PCR';
$tableToTag{northern}      = 'Northern';
$tableToTag{western}       = 'Western';
$tableToTag{antibody}      = 'Antibody_info';
$tableToTag{pattern}       = 'Pattern';
$tableToTag{remark}        = 'Remark';
$tableToTag{transgene}     = 'Transgene';
$tableToTag{protein}       = 'Protein_description';
$tableToTag{clone}         = 'Clone';
$tableToTag{strain}        = 'Strain';
$tableToTag{sequence}      = 'Sequence';
$tableToTag{movieurl}      = 'MovieURL';
$tableToTag{laboratory}    = 'Laboratory';


  • my %ontologyIdToName; # mappings for ids to names (only for life stage)

1;

sub getExprPattern {

  • my ($flag) = shift; #can be all or the name for an expr_id

&populateOntIdToName(); #call the subroutine a thte bottom of the page for life stage name mapping

if ( $flag eq 'all' ) { $result = $dbh->prepare( "SELECT * FROM exp_name ; " ); } # get all entries for type else { $result = $dbh->prepare( "SELECT * FROM exp_name WHERE exp_name = '$flag' ;" ); } # get all entries for type of object name $result->execute(); # execute the query while (my @row = $result->fetchrow) { # it's going to do the following for every row of the query

$theHash{object}{$row[0]} = $row[1];  # it's going to map the PGIDs to the Expr_ID
$nameToIDs{object}{$row[1]}{$row[0]}++; # for every Expr_ID we will get all the corresponding PGIDs
$ids{$row[0]}++; } # list of all the PGIDs
  • my $ids = ; my $qualifier = ; # if it looks for a specific subset of Expr_pattern it searches only for that subset of PGIDs from the %ids

if ($flag ne 'all') { $ids = join"','", sort keys %ids; $qualifier = "WHERE joinkey IN ('$ids')"; } foreach my $table (@tables) {

$result = $dbh->prepare( "SELECT * FROM exp_$table $qualifier;" );		# get data for table with qualifier (or not if not)
$result->execute();	

while (my @row = $result->fetchrow) { $theHash{$table}{$row[0]} = $row[1]; } # loops for all the values and store them in the hash } # foreach my $table (@tables) my %e1 = &getData($table, $joinkey);

         my %e2 = &getData('qualifier', $joinkey);
         my %e3 = &getData('qualifiertext', $joinkey);
         my %e4 = &getData('qualifierls', $joinkey);
         my $l2_exists = 0; my $l3_exists = 0; my $l4_exists = 0;
         foreach my $e1 (sort keys %e1) {
           foreach my $e4 (sort keys %e4) {                            # dump anatomy to qualifierls in both directions for every crossproduct, if there is an anatomy.  2015 02 03
              $l4_exists++;
              $cur_entry{"$tag\t\"$e1\" Life_stage \"$e4\"\n"}++;
              $cur_entry{"Life_stage\t\"$e4\" Anatomy_term \"$e1\"\n"}++; }
           foreach my $e2 (sort keys %e2) {
             foreach my $e3 (sort keys %e3) {
               $cur_entry{"$tag\t\"$e1\" $e2 \"$e3\"\n"}++; $l3_exists++; }
             unless ($l3_exists) {
               $cur_entry{"$tag\t\"$e1\" $e2\n"}++; $l2_exists++; } }
           unless ( ($l2_exists) || ($l3_exists) || ($l4_exists) ) {
             $cur_entry{"$tag\t\"$e1\"\n"}++; } } }
       elsif ($table eq 'qualifierls') {                               # micropub data could have qualifierls without anatomy
         my %e1 = &getData($table, $joinkey);
         my %e2 = &getData('anatomy', $joinkey);
         if (scalar keys %e2 < 1) {                                    # if there is no anatomy data, dump each qualifierls (if there was anatomy it would have dumped above under the anatomy section)
           foreach my $e1 (sort keys %e1) {
             $cur_entry{"$tag\t\"$e1\"\n"}++; } } }

foreach my $name (sort keys %{ $nameToIDs{object} }) { #loops through all the names that are in the $nameToIDs{object}

  • my $entry = ; my $has_data; # entry has .ace data for that expr_object. $has_data is a flag for object that have data

$entry .= "\nExpr_pattern : \"$name\"\n"; # add o the .ace entry the header Expr_pattern : "Expr1234"

  • my %cur_entry; # is going to be a hash for filtering things (duplicated objects for qualifier -partial certain uncertain- excludes the duplicated rows that are overlapping

foreach my $joinkey (sort {$a<=>$b} keys %{ $nameToIDs{object}{$name} }) { # it loops through all the PGIDs for the current name next if ($theHash{nodump}{$joinkey}); # skips if the pgid has a NO DUMP flag foreach my $table (@maintables) { # it loops through the main tables (the ones with the .ace tag) next unless ($tableToTag{$table}); # it skips it if there's no tag

  • my $tag = $tableToTag{$table}; # gets the tag

if ($table eq 'anatomy') { # in case of anatomy it does the following

  • my %e1 = &getData($table, $joinkey); # gets the anatomy term list (based on the PGIDs)
  • my %e2 = &getData('qualifier', $joinkey); # gets the qualifier for the previous anatomy term list (based on the PGIDs)
  • my %e3 = &getData('qualifiertext', $joinkey); # gets the qualifier text for the previous anatomy term list (based on the PGIDs)
  • my $l2_exists = 0; my $l3_exists = 0; # by default there no qualifier and no qualifier text

foreach my $e1 (sort keys %e1) { # loops through all anatomy foreach my $e2 (sort keys %e2) { # loops through all qualifier foreach my $e3 (sort keys %e3) { # loops through all qualifier text $cur_entry{"$tag\t\"$e1\" $e2 \"$e3\"\n"}++; $l3_exists++; } # if it finds the qualifier and if it finds the qualifier text then it adds it to the filter for later printing and makes a note that it found a qualifier text unless ($l3_exists) { # if there is no qualifier text $cur_entry{"$tag\t\"$e1\" $e2\n"}++; $l2_exists++; } } # and it finds the qualifier then it adds it to the filter for later printing and makes a note that it found a qualifier unless ( ($l2_exists) || ($l3_exists) ) { # if there is no qualifier nor qualifier text $cur_entry{"$tag\t\"$e1\"\n"}++; } } } # then it adds it to the filter just the Anatomy tag and data (e.g. Anatomy_term^t "WBbt:1234567") elsif ($table eq 'exprtype') { # it checks for expr_type

  • my %entries = &getData($table, $joinkey); # gets data for expr_type and PGID

foreach my $entry (sort keys %entries) { $cur_entry{"$entry\n"}++; } } # for each data it adds the data to the filter but does not add the .ace tag else { my %entries = &getData($table, $joinkey); # gets data for every PGID and every other table that has a tag foreach my $entry (sort keys %entries) { $cur_entry{"$tag\t\"$entry\"\n"}++; } } # for each data it adds the data and the .ace tag to the filter } } # foreach my $joinkey (sort {$a<=>$b} keys %{ $nameToIDs{$type}{$name} }) foreach my $line (sort keys %cur_entry) { $entry .= $line; $has_data++; } # for each line in the filter it adds it to the .ace entry and it flag it has data if ($has_data) { $all_entry .= $entry; } # if it has data it adds this entry to all the entries } # foreach my $name (sort keys %{ $nameToIDs{$type} }) return( $all_entry, $err_text ); # it returns all the results to the use package script } # sub getExprPattern

sub getData { # get hash of values in this table

  • my ($table, $joinkey) = @_; # gets the tables and the PGID
  • my %entries; # it stores all the data for this tables and PGIDs

if ($theHash{$table}{$joinkey}) { # if it has data

  • my $data = $theHash{$table}{$joinkey}; # it gets the data

unless ($table eq 'remark') { if ($data =~ m/^\"/) { $data =~ s/^\"//; }    # it escapes with \ the " everywhere but not in the remarks field.     if ($data =~ m/\"$/) { $data =~ s/\"$//; } } if ($data =~ m/\//) { $data =~ s/\//\\\//g; } # it escapes / with \ //g; }data =~ s/ m/ # it strips the ^M lines if ($data =~ m/\n/) { $data =~ s/\n/ /g; } # it replaces line breaks with 2 spaces if ($data =~ m/^\s+/) { $data =~ s/^\s+//g; } if ($data =~ m/\s+$/) { $data =~ s/\s+$//g; } # if it begins or end with a space it gets rid of those

  • my @data; # this is an array for storing multiple data

if ($data =~ m/\",\"/) { @data = split/\",\"/, $data; } # if the data is a multiontology or multidropdown, it splits on the "," elsif ($pipeSplit{$table}) { @data = split/\|/, $data; } # otherwise if it is in the list of the pipe split tables it splits on the pipe else { push @data, $data; } # if it is neither of those treats the data as if it is the only entry foreach my $value (@data) { # for each of those multiple values if ($value =~ m/\"/) { $value =~ s/\"/\\\"/g; } # if there is a " it adds a backslash to neutralize it for acedb if ($value =~ m/^\s+/) { $value =~ s/^\s+//g; } # if the data begins with a space get rid of the space if ($value =~ m/\s+$/) { $value =~ s/\s+$//g; } # if the data ends with a space get rid of the space if ($table eq 'lifestage') { if ($ontologyIdToName{$table}{$value}) { $value = $ontologyIdToName{$table}{$value}; } } # convert life stage ids to lifestage names. 2011 05 13 # if it's a life stage and there's an ID to name mapping for these data then it uses the name instead of the ID if ($value) { $entries{$value}++; } # if after all of the above there is a value it adds to a filter of values } } return %entries; # it returns all the data that it got for this table and PGID } # sub getData

sub populateOntIdToName { # reads form obo_name_lifestage to get the mappings from life_stage id to name $result = $dbh->prepare( "SELECT * FROM obo_name_lifestage;" ); $result->execute(); while (my @row = $result->fetchrow) { $ontologyIdToName{'lifestage'}{$row[0]} = $row[1]; } } # sub populateOntIdToName

We have put an error check for dead genes and invalid papers on may 31st 2012:

      elsif ($table eq 'gene') {
        my %entries = &getData($table, $joinkey);
        foreach my $entry (sort keys %entries) {
          if ($deadObjects{gene}{$entry}) { $err_text .= "$name has dead gene $entry $deadObjects{gene}{$entry}\n"; }
            else { $cur_entry{"$tag\t\"$entry\"\n"}++; } } }
      elsif ($table eq 'paper') {
        my %entries = &getData($table, $joinkey);
        foreach my $entry (sort keys %entries) {
          if ($deadObjects{paper}{$entry}) { $err_text .= "$name has dead paper $entry $deadObjects{paper}{$entry}\n"; }
            else { $cur_entry{"$tag\t\"$entry\"\n"}++; } } }


sub populateDeadObjects {

$result = $dbh->prepare( "SELECT * FROM gin_dead;" ); $result->execute();
while (my @row = $result->fetchrow) { $deadObjects{gene}{"WBGene$row[0]"} = $row[1]; }
$result = $dbh->prepare( "SELECT * FROM pap_status WHERE pap_status = 'invalid';" ); $result->execute();
while (my @row = $result->fetchrow) { $deadObjects{paper}{"WBPaper$row[0]"} = $row[1]; }

} # sub populateDeadObjects

Historical Gene tag

Handling Dead Genes During Dump Process

The dumper script will now (as of May, 2013) run an automatic check for dead genes in any gene field. Any genes that are considered dead that are referenced in an Interaction object in the OA will be handled in the following manner:

1) If there is a replacement for the gene (i.e. the gene has merged into another gene), the dead gene will be dumped into a "Historical_gene" field in the .ACE file, the replacement gene will fill the original gene field. A comment will be added to the Historical_gene field via the #Evidence hash. The original gene field (now with the updated gene reference) will be printed with an "Inferred_automatically" tag after the gene. So, for example, if WBGene00001234 is now a dead gene that has been merged into WBGene00002345:

Gene  "WBGene00001234"

becomes

Gene  "WBGene00002345"  Inferred_automatically
Historical_gene  "WBGene00001234"  Remark  "Note: This object originally referred to WBGene00001234.
WBGene00001234 is now considered dead and has been merged into WBGene00002345. WBGene00002345 has 
replaced WBGene00001234 accordingly."

Notes:

Dead -> dead
Suppressed -> suppressed
merged_into WBGene -> merged
split_into -> split
looping through the genes where something happened to make sure they don't also point at something else
exp_gene
merged -> historical_gene + remark AND gene <gene> Inferred_automatically
dead -> historical_gene + remark
suppressed -> historical_gene + remark
split -> historical_gene + remark AND error message
normal ones -> just tag + value

Examples:

A split gene: WBGene00012507
A merged gene: WBGene0e0007524
A dead gene: WBGene00007814
A suppressed gene: WBGene00015490

Data parsing

File that was used for parsing is the WS226 dump and is located here: /home/postgres/work/pgpopulation/exp_exprpattern/ExprWS226.ace

There are 1802 objects without any Anatomy_term. I'm assuming this is okay -- J Yes, it is --D

What do we do with Marker objects ? Treat them the same as Expr_pattern objects ? -- J yes, treat the same --D

Life_stage in obo class have WBls:####### IDs, but data has lifestage names, is this bad data ? The OA only supports IDs (see phenotype, generegulation, picture OA) : can we convert the life stage names into WBls:#######? I asked Wen about this and she is fine with it --D Changed the parser to convert from name to ID, but still waiting until we talk to Karen

/home/postgres/work/pgpopulation/exp_exprpattern/invalid_ontology_values has many other objects that don't fit the ontologies. It would be best to either fix them in citace and redump, or to get mappings of bad-to-good values and put them in the parser. This was run on the sandbox, so if any values are real, the sandbox might not have all the values. -- J I see, there are many objects with invalid format for different classes. i will figure out what was the problem for each of them and get back to you --D. 20 Anatomy terms having old ids -> Daniela generated mapping with new IDs. 2 invalid objects for GO -> Alerted Ranjana, waiting for answer 5 Antibody objects -> alerted Xiaodong, 2 fixed, 3 waiting for Wen's answer (did she create the objects already or we should generate new ones?). 37 transgenes objects -> alerted Karen

Strain and Clone don't have ontologies yet, once we have those we'll see if any data is bad -- J ok --D

Only looking at WBPictureID pictures, if we need to dump both ways, it will get conversions from the WBPictureID's name. -- J I am not sure I get this..D we talked about it

-D file for Citace Minus

when tried to parse -D file into cite minus the following errors occurred;

  • Pattern: 2 objects Expr98 did not parse in 2 pattern descriptions. Not in OA

I will add them manually in OA and -D those Done DR 06142011

  • Anatomy_term: 14 objects
  • 2 in Expr 120 checked OK only extra space at the end
  • 1 in Expr 1269 checked OK only extra space at the end
  • 1 in Expr 1569 checked OK only extra space at the end
  • 8 in Expr2812 checked OK only extra space at the end
  • 1 in Expr3211 checked OK only extra space at the end
  • 1 in Expr7467 checked OK only extra space at the end

We can delete them from Citace Minus, text is fine in OA. Done DR 06142011

  • Antibody_info: 4 objects

they are already on my list. Xiaodong should generate the objects. Will delete them Done DR 06142011 and add them manually in OA when ready. TODO

  • Reference: 3 objects

Expr_pattern : "Expr2916" Reference "WBPaper00006518"

Expr_pattern : "Expr2994" Reference "WBPaper00013501"

Expr_pattern : "Expr3715" Reference "WBPaper00025175"

Wen looked into it and these are obsolete IDs. We delete them from Citace Minus Done DR 06152011

  • Gene: 30 objects. This happened because the Gene field is an ontology. Some Expr_pattenr objects are associated to multiple genes therefore it did not parse the data in correctly. not only this the problem. Wen is looking into it could be obsolete IDs. Wen checked. Are obdolete we can -D. DR06162011

.

  • Remark: 1 object Expr111 was not deleted as I added Pseudogene info in the remarks. Deleted from citace minus added into OA OK. Done DR 06142011
  • Pseudogene: 1 object

Expr_pattern : "Expr111" Pseudogene "F56D5.8" can fix this manually and delete it from Citace Minus. Done DR 06142011

  • Cell 26 objects

Expr_pattern : "Expr7477" Cell "P3.p" Certain Cell "P4.p" Certain Cell "P5.p" Certain Cell "P6.p" Certain Cell "P7.p" Certain Cell "P8.p" Certain

done DR 06142011

Expr_pattern : "Expr7595" Cell "CANL" Uncertain Cell "CANR" Uncertain

done DR 06142011

Expr_pattern : "Expr7605" Cell "M4" Certain


done DR 06142011

Expr_pattern : "Expr7632" Cell "AVG" Certain Cell "M5" Certain Cell "PVT" Certain Cell "PVCL" Uncertain Cell "PVCR" Uncertain Cell "PVNL" Uncertain Cell "PVNR" Uncertain Cell "PVQL" Uncertain Cell "PVQR" Uncertain

done DR 06142011

Expr_pattern : "Expr7691" Cell "P3.p" Certain Cell "P4.p" Certain Cell "P5.p" Certain Cell "P6.p" Certain Cell "P7.p" Certain Cell "P8.p" Certain

done DR 06142011

Expr_pattern : "Expr8715" Cell "M.dlpa" Certain Cell "M.drpa" Certain

done DR 06142011

  • Sequence

Expr_pattern : "Expr12" -D Sequence "Z28375" -C "EMBL Z28375" -D Sequence "Z28376" -C "EMBL Z28376" -D Sequence "Z28377" -C "EMBL Z28377"

Expr_pattern : "Expr52" -D Sequence "R11H6" -D Sequence "Y40H4A"

Expr_pattern : "Expr979" -D Sequence "F54E2" -D Sequence "R05D8"

Expr_pattern : "Expr980" -D Sequence "F54E2" -D Sequence "R05D8"

Done Dr 06142011


  • Picture. All picture bjects were -D DR06152011


Exporting Reporter Gene description from Expr_pattern OA to Transgene OA

IMPORTANT: whenever you curate an expr object fill in all the fields before duplicating the object itself. E.g if you need to put expression in the pharynx 'certain' intestine 'uncertain, make sure to generate an object with pharynx 'certain' fill in all the other info, e.g. WBPaper, reporter gene, pattern, ... and THEN duplicate the object. this is important also for the generation of new transgenes with the script below.


In the past transgene objects were generated only when authors did use standard nomenclature (e.g. adEx1256, acIs101). No new transgene objects were created for reporter fusions when there was no standard nomenclature.

From Jan 2012 we want to start generating transgene objects also for those reporter genes.

Action items:

Import all the transgene objects with no standard nomenclature present in Expression pattern OA into Transgene OA

In order to accomplish that we should

  • Generate a name for the objects that have exp_reportergene and no exp_transgene and assign it to the table exp_transgene in Expression pattern OA. The name should be: ExprID_Ex (e.g. Expr1234_Ex)
  • For all the ExprID_Ex that were generated in the previous step we should populate postgres tables in transgene OA as follows:
exp_transgene -> trp_name
exp_paper -> trp_paper 

435 expr objects don't have papers. transfer those objects ? SELECT * FROM exp_reportergene WHERE joinkey NOT IN (SELECT joinkey FROM exp_transgene) AND joinkey NOT IN (SELECT joinkey FROM exp_paper); -- J

I spot checked them. The ones I have seen are coming from Ian Hope large scale expression that he sent few years back to Wen but we need to check systematically if they all come from him. The objects have an Author field associated but the Author is not in OA. When we created the Expr_pattern OA we decided to keep Author, Date, and Curated_by in a separate file in Citace Minus as they were fields not used anymore (see wiki above for reference). For those objects we should put the author in trp_person. If it's hard to retrieve the author from Citace Minus we could get it from the file "ExprWS221.ace" on Tazendra in /home/acedb/draciti dir. D There are 4307 after filtering duplicates -- J

exp_reportergene -> trp_remark
trp_curator -> Daniela Raciti ( WBPerson12028 ). 
trp_nodump (for all Daniela Raciti) 

There is no trp_nodump table using trp_objpap_falsepos (Fail field) -- J Karen said is good. D

Attention: we will take from Expr_pattern OA all objects regardless of the curator -both Wen and Daniela- but when populating transgene OA we will populate the curator field just with Daniela.

Expr objects exists in multiple OA rows, so there are multiple pgids per Expr object, so multiple objects get created in the transgene OA. See Expr1416. Is this correct ? I don't know if Transgene objects already have multiple pgids, and whether the dumper handles it. This is also going to make the deletion script more complicated, do all pgids for a given transgene name have to be dumpable for deletion ? -- J right, we have multiple pgids for a single Expr_object but the exp_reportergene is the same for all. We should have in Transgene OA only one Expr object i.e. Expr1416_Ex pgid 9980 and get rid of the duplicates. D

13 non-"Hope IA" authors are on the sandbox at /home/postgres/work/pgpopulation/transgene/20120127_expr_to_transgene/bad_authors let me know the mapping of those authors to WBPerson#### . After the mapping is done, we'll see how many of the 48 Expr objects have neither person nor paper; e.g. Expr1684 doesn't. -- J

Mapping:

  • Arnold JM = WBPerson16468
  • Bauer PK = WBPerson5125
  • Britton C = WBPerson78
  • Hashmi S = WBPerson4368
  • Herbert R = WBPerson16472
  • Krause MW = WBPerson346
  • Lustigman S = WBPerson390
  • Lynch AS = WBPerson1232
  • McCarroll D = WBPerson16469
  • Mohler WA = WBPerson428
  • Mounsey A = WBPerson1716
  • Royall CM = WBPerson16473
  • Seydoux GC = WBPerson575

After mapping 3 objects had no paper nor person. All 3 are personal communications. J added them -> OK.

  • Expr1684 -> Catherine Wolkow ( WBPerson696 )
  • Expr1685 -> Massimo Hilliard ( WBPerson258 )
  • Expr2781 -> Aharon Solomon ( WBPerson3909 )


The transgene objects will be revised by Karen (in order to delete duplicates if any).

To run the population script do it from a dir where you have write permission e.g. /home/acedb/draciti/Expr_pattern and then give the command

  • ~postgres/work/pgpopulation/transgene/20120127_expr_to_transgene/transfer_expr_to_transgene.pl > outputlog

Populate script is /home/postgres/work/pgpopulation/transgene/20120127_expr_to_transgene/transfer_expr_to_transgene.pl -- J get expr objects that have a reportergene but no transgene : SELECT * FROM exp_reportergene WHERE joinkey NOT IN (SELECT joinkey FROM exp_transgene); for each of those get the exp_name and exp_paper Transgene name is ExprName plus _Ex Add to exp_trasnsgene and exp_transgene_hst as multiontology with doublequotes. Get highest transgene pgid, and for each new transgene, create a new transgene with that pgid, trp_name the new transgene name, trp_curator WBPerson12028, trp_objpap_falsepos Fail, trp_remark the exp_reportergene, if there's a paper trp_paper is the exp_paper, if there is no paper look at authors in ExprWS221.ace, and map to persons from Daniela's list into trp_person with doublequotes. If it's not in the list, tell Daniela to get WBPerson mappings. -- J

Author -> Person: Hope IA = WBPerson266

To run the deletion script do it from a dir where you have write permission e.g. /home/acedb/draciti/Expr_pattern and then give the command

  • ~postgres/work/pgpopulation/transgene/20120127_expr_to_transgene/delete_cleared_exp_reportergene.pl > output1

Deletion script is /home/postgres/work/pgpopulation/transgene/20120127_expr_to_transgene/delete_cleared_exp_reportergene.pl -- J It looks for SELECT * FROM trp_name WHERE trp_name ~ 'Expr.*_Ex' AND joinkey NOT IN (SELECT joinkey FROM trp_objpap_falsepos); transgene names that have not been set as Fail. "SELECT * FROM trp_synonym WHERE trp_synonym ~ 'Expr.*_Ex' AND joinkey NOT IN (SELECT joinkey FROM trp_objpap_falsepos);" It looks at expression transgenes that match Expr.*_Ex : SELECT * FROM exp_transgene WHERE exp_transgene ~ '"Expr.*_Ex"' AND joinkey IN (SELECT joinkey FROM exp_reportergene); It gets invidual transgenes and if any of them don't match Expr.*_Ex it gives an error message. If all of them match, it checks that all transgenes are dumpable. If all are dumpable, it deletes the exp_reportergene for that pgid and inserts a null into the exp_reportergene_hst for that pgid --- J

We added a rule that should check for 'Expr.*_Ex' in the trp_Synonym too because when the transgene already existed Karen added the Expr_name under synonym. we want that the reporter gene field for those objects in Expr_OA will be deleted as well therefore Juancarlos added a rule in the deletion script that will lok for Expr.*_Ex in the synonym and if it finds it will delete repoter gene field from Expr_pattern OA. Line that he added:

"SELECT * FROM trp_synonym WHERE trp_synonym ~ 'Expr.*_Ex' AND joinkey NOT IN (SELECT joinkey FROM trp_objpap_falsepos);"

  • There were >1000 objects that had a standard transgene name (eg adIs1783 and the reporter gene field filled with redundant information, e.g. [mpk-1::gfp]) Daniela went manually through them and deleted the redundant info in the reporter gene field in Expr_pattern OA. The info was already in transgene. Whenever the reporter gene field in Expr_pattern had more info (e.g. sequence) Daniela copied that info into the Transgene Remark field (in line with what we have done for objects above). also, whenever it was specified if transcriptional or translational fusion, that info was added to the reporter type in transgene OA. Double checked with Karen 02.16.2012 -> OK.

Whenever the info in the Reporter Gene field was more pertinent to Expr pattern it was left there. E.g. Expr1046: The larval expression pattern was studied by observing GFP expression in a strain carrying an APR-1::GFP reporter transgene on the integrated array zhIs2.

One example of redundant information was for kxEx74, Expr4687. The info present in the Expr reporter gene field were exactly the same as in the Transgene remark field.

  • February 29th 2012: Daniela run the population script on tazendra after having ested the system on Mangolassi (everything was fine there). Outputlog file with the results of the transfer was copied on Daniela's pc Desktop/Wormbase/Expr_pattern_to_Transgene_transfer
  • Daniela run the deletion script on september 5th 2012. In this way we deleted the reporter gene info in expression OA for all the objects that were transfered to transgene OA with the population script in february. Karen, Juancarlos and Daniela agreed that we had to set all transgene objects as 'dumpable' in order to delete the reporter gene field from expression OA. Daniela edited the transgenes Fail via the Batch mode. The results of the deletion script are on Lario in Desktop/Wormbase/Expr_pattern_to_Transgene_transfer

Moving forward we will use the new pipeline. In the new pipeline the script immediately deletes the reporter gene field in Expression OA.

  • On September 11th 2012 Daniela run the new population script for the first time. the output is on tazendra /home/acedb/draciti/Expr_pattern/Expr_to_Transgene newpopulationscript_09112012. Checked few examples, everything went as it should have
  • NB: there should be still some objects in Expression OA that have the reporter gene field AND the transgene. Those should be double checked. It can be that they will bear duplicated info. This happened for the object that Karen merged or looked at before the new pipeline was set in place.


The population script was run at the end of February 2012, from August on we started using the script that is described in the section below -"Current pipeline"

Even after running the deletion script we will not change anything in the Expression pattern page display. This is because there is already a link to the transgene page so all the information about the construct could be found there.

Addendum: Whenever in the reporter gene field in OA there are listed a transcriptional fusion and a translational fusion, only one name Expr123_Ex will be generated. that reporter gene will be transfered to transgene OA. Karen will add as synonym Expr123_Ex for both the transcriptional and translational fusion. If now we want to populate back Expr_OA with the real name of those transgenes we have to ask juancarlos to look for all objects that have the same Expr123_Ex name AND the same Paper but different transgene names and populate Expr_OA in the transgene field with both transgene names


Current Pipeline

For each Expression pattern object that does have a "Reporter Gene" NOT blank and a transgene field BLANK the following will happen:

  • For the ones that don't have an existing Expr_Ex in these 2 tables trp_publicname and trp_synonym it will:
    • generate an object in the transgene OA called WBTransgene000##### and put it in the trp_name
    • add in the trp_synonym the name Expr1234_Ex (the numbering of the Expr objects is after the expression pattern object name)
    • add WBPerson12028 in the trp_curator -this has to be changed whenever somebody else will take over expression patterns
    • add the reporter gene field text into the trp_remark field
    • add the WBPaper into the trp_paper field
  • Now all the Expression objects have a mapping to the transgene and the script will
    • add into the expr_transgene field the WBTransgene000##### name
    • delete the text in the exp_reportergene field


Differently from the old pipeline every object here will be set as dump.

the script is located here:

~postgres/work/pgpopulation/transgene/20120127_expr_to_transgene/transfer_expr_to_transgene.pl > outputlog

Daniela runs the script MANUALLY before each upload, the script is normally run from this dir on tazendra /home/acedb/draciti/Transgene_generation

  • the script was run for the first time on September 11th 2012. The output is on tazendra /home/acedb/draciti/Expr_pattern/Expr_to_Transgene newpopulationscript_09112012. Checked few examples, everything went as it should have.

In the script, sub readExprAce there is a subroutine that will get the mappings to Authors from the file /home/acedb/draciti/Expr_pattern/ExprWS221.ace. If the file is not there the script will fail. See above in this section for authors mapping.

sub populateTrpNameToId is getting trp_name (Transgene ID WBTransgene000#####) and map it to trp_publicname and is getting synonyms -trp_synonym- and map it to the trp_name.

For the synonym is splitting on pipes and removing spaces at the beginning and at the end

$trpNameToId{$syn} = $row[0]; this line stores into a hash the mappings of name into ID

merged into

Juancarlos added in transgene OA a "Merged into" box so that Karen will be able to merge transgenes. The transgene object that you merged into the other will be marked as invalid. Say that you are curating transgene2 and you see that is identical to transgene 1, you now click on to "merge into" and select transgene1. Transgene 2 becomes invalid. And Karen will have to add transgene2 into the synonym field of transgene1.


bad strains

to check if there are wrong strains in the OA run this script for a dir where you have permission

/home/postgres/work/pgpopulation/exp_exprpattern/20121011_find_bad_strain/find_bad_strain.pl > bad

Miller paper- tiling arrays

We have added in CitaceMinus a static file with the links to http://www.vanderbilt.edu/wormdoc/wormmap/Expressed_genes.html

The paper is Spencer WC, Zeller G, Watson JD, Henz SR, Watkins KL, McWhirter RD, Petersen S, Sreedharan VT, Widmer C, Jo J, Reinke V, Petrella L, Strome S, Von Stetina SE, Katz M, Shaham S, Rätsch G, Miller DM 3rd. A spatial and temporal map of C. elegans gene expression. Genome Res. 2011 Feb;21(2):325-41. Epub 2010 Dec 22. PubMed PMID: 21177967; PubMed Central PMCID: PMC3032935.

The file: miller_cell_type_expression.ace

We have added links to the site from the anatomy page.


Modelling large scale expression studies

In order to make the most out of the large scale expression studies we need to modify the expr_pattern model. we are starting to model using as an example Murray's paper [1] and we will extend to other studies as we go [2].

This is the current model -as of 02052013

?Expr_pattern	Expression_of	Gene ?Gene XREF Expr_pattern
				CDS        ?CDS XREF Expr_pattern // for coding genes
				Sequence   ?Sequence XREF Expr_pattern // for clones???
				Pseudogene ?Pseudogene XREF Expr_pattern // [030801 krb]
				Clone ?Clone XREF Expr_pattern
				Protein ?Protein XREF Expr_pattern
                                Protein_description Text   // stores information for Expr_patterns with unknown antigens [031105 krb]
		Homol Homol_homol ?Homol_data XREF Expr_homol ?Method Float Int UNIQUE Int Int UNIQUE Int #Homol_info //Expr_pattern mapping [060427 ar2]
		Expressed_in	Cell         ?Cell         XREF Expr_pattern #Qualifier
				Cell_group   ?Cell_group   XREF Expr_pattern #Qualifier
				Life_stage   ?Life_stage   XREF Expr_pattern #Qualifier
				Anatomy_term ?Anatomy_term XREF Expr_pattern #Qualifier
				GO_term      ?GO_term      XREF Expr_pattern #GR_condition
                Subcellular_localization ?Text
		Type	Reporter_gene ?Text
			In_situ Text
			Antibody ?Text
			Northern Text // Wen [krb 030425]
                        Western Text  // Wen 
                        RT_PCR Text   // Wen
			Localizome ?Text //added by Wen
		Expression_cluster ?Expression_cluster XREF Expr_pattern //added by Wen.for localizome
		Pattern ?Text
		Picture ?Picture XREF Expr_pattern
		MovieURL Text //Added by wen for link to movie URLs. 
		Movie    ?Movie    XREF Expr_pattern  //Added by Wen to curate Expr_pattern video
                Remark ?Text #Evidence
                Experiment	Laboratory ?Laboratory
				Author ?Author 
				Date UNIQUE DateType
				Strain UNIQUE ?Strain
		Reference ?Paper XREF Expr_pattern
                Transgene ?Transgene XREF Expr_pattern
		Associated_feature ?Feature XREF Associated_with_expression_pattern #Evidence
		Antibody_info ?Antibody XREF Expr_pattern // This applies to both Western & Antibody staining
                                                          // added [031120 krb]
		Curated_by UNIQUE Text // Hinxton (HX) or Caltech (CIT) Sylvia [010927 krb]
?Expr_pattern	Expression_of	Gene ?Gene XREF Expr_pattern
				CDS        ?CDS XREF Expr_pattern // for coding genes
				Sequence   ?Sequence XREF Expr_pattern // for clones???
				Pseudogene ?Pseudogene XREF Expr_pattern // [030801 krb]
				Clone ?Clone XREF Expr_pattern
				Protein ?Protein XREF Expr_pattern
                                Protein_description Text   // stores information for Expr_patterns with unknown antigens [031105 krb]
		Homol Homol_homol ?Homol_data XREF Expr_homol ?Method Float Int UNIQUE Int Int UNIQUE Int #Homol_info //Expr_pattern mapping [060427 ar2]
		Expressed_in	Positive_expression    ?Anatomy_term XREF Expr_pattern #Qualifier  UNIQUE Float  
                                                       ?Life_stage   XREF Expr_pattern #Qualifier
                                Negative_expression    ?Anatomy_term XREF Expr_pattern #Qualifier



Cell         ?Cell         XREF Expr_pattern #Qualifier
				Cell_group   ?Cell_group   XREF Expr_pattern #Qualifier
				Life_stage   ?Life_stage   XREF Expr_pattern #Qualifier
				Anatomy_term ?Anatomy_term XREF Expr_pattern #Qualifier
				GO_term      ?GO_term      XREF Expr_pattern #GR_condition
                Subcellular_localization ?Text
		Type	Reporter_gene ?Text
			In_situ Text
			Antibody ?Text
			Northern Text // Wen [krb 030425]
                        Western Text  // Wen 
                        RT_PCR Text   // Wen
			Localizome ?Text //added by Wen
		Expression_cluster ?Expression_cluster XREF Expr_pattern //added by Wen.for localizome
		Pattern ?Text
		Picture ?Picture XREF Expr_pattern
		MovieURL Text //Added by wen for link to movie URLs. 
		Movie    ?Movie    XREF Expr_pattern  //Added by Wen to curate Expr_pattern video
                Remark ?Text #Evidence
                Experiment	Laboratory ?Laboratory
				Author ?Author 
				Date UNIQUE DateType
				Strain UNIQUE ?Strain
		Reference ?Paper XREF Expr_pattern
                Transgene ?Transgene XREF Expr_pattern
		Associated_feature ?Feature XREF Associated_with_expression_pattern #Evidence
		Antibody_info ?Antibody XREF Expr_pattern // This applies to both Western & Antibody staining
                                                          // added [031120 krb]
		Curated_by UNIQUE Text // Hinxton (HX) or Caltech (CIT) Sylvia [010927 krb]


Modelling the Highlighter for daily curation

We want to test if the GO highlighter could be implemented for daily curation. We chose 4 papers that were already curated for expression pattern. The papers were chosen so that different types, classes were represented:

  • WBPaper00041734 PMID:23152612 J Neurosci (The Society of Neuroscience) has gene, anatomy, qualifier, qualifier text, GO term, subcellular localization text, reporter gene, reporter gene text, pattern
  • WBPaper00041279, PMID:22770216 Cell (Elsevier) has gene, go term, subcellular localization text, antibody, reporter gene, reporter gene text, antibody info, pattern
  • WBPaper00041737 PMID:23154983 Genes Dev (CSHLP) has gene, anatomy, qualifier, go term, reporter, pattern, transgene
  • WBPaper00041513 pmid22956537 J Cell Sci (The Company of Biologists) has gene, anatomy, qualifier, go term, life stage, antibody AND reporter, reporter gene text, pattern


Generating graphs for microarray data

First trial: WBPaper00005767. The tab delimited file was generated by wen (00005767ExprGraph.csv)

The script that generates the graphs is here: /home/azurebrd/work/parsings/daniela/20130412_graphs/generate_png.pl

http://tazendra.caltech.edu/~azurebrd/var/work/20130412_graphs/


Other nematodes SVM analysis for gene expression

From Yuling (Nov 7th 2013) Results here:

Looks like only 1% is deemed positive...

  • 146 positives
  • 15253 negatives

Daniela will go through the list and evaluate

Alternative approach: we can check how many papers have been curated for other species and use those as positive training set. The list of papers is below

Other species

there is a script on tazendra to check the objects curated to non elegans genes (the script looks in to the gin_synonyms tables and check what is not CELE)

/home/acedb/draciti/Expr_pattern/20140516_non_cele

find_non_cele.pl*

the output on May 16th 2014 was

WBGene00001198 not CELE_ in pgid 348
WBGene00001198 not CELE_ in pgid 350
WBGene00001198 not CELE_ in pgid 351
WBGene00001198 not CELE_ in pgid 352
WBGene00002126 not CELE_ in pgid 1424
WBGene00009821 not CELE_ in pgid 1520
WBGene00012263 not CELE_ in pgid 2914
WBGene00043408 not CELE_ in pgid 2914
WBGene00009175 not CELE_ in pgid 2914
WBGene00016878 not CELE_ in pgid 2914
WBGene00020512 not CELE_ in pgid 2914
WBGene00019252 not CELE_ in pgid 2914
WBGene00015732 not CELE_ in pgid 3033
WBGene00003454 not CELE_ in pgid 3330
WBGene00003440 not CELE_ in pgid 3330
WBGene00003441 not CELE_ in pgid 3330
WBGene00003427 not CELE_ in pgid 3330
WBGene00003461 not CELE_ in pgid 3330
WBGene00003459 not CELE_ in pgid 3330
WBGene00003428 not CELE_ in pgid 3330
WBGene00003447 not CELE_ in pgid 3330
WBGene00003455 not CELE_ in pgid 3330
WBGene00003453 not CELE_ in pgid 3330
WBGene00003436 not CELE_ in pgid 3330
WBGene00003439 not CELE_ in pgid 3330
WBGene00023572 not CELE_ in pgid 4733
WBGene00023572 not CELE_ in pgid 4734
WBGene00023575 not CELE_ in pgid 4737
WBGene00023572 not CELE_ in pgid 4739
WBGene00023572 not CELE_ in pgid 4740
WBGene00037006 not CELE_ in pgid 4743
WBGene00037006 not CELE_ in pgid 4744
WBGene00037005 not CELE_ in pgid 4747
WBGene00037005 not CELE_ in pgid 4748
WBGene00030970 not CELE_ in pgid 4750
WBGene00041435 not CELE_ in pgid 4753
WBGene00041435 not CELE_ in pgid 4754
WBGene00015274 not CELE_ in pgid 4829
WBGene00015274 not CELE_ in pgid 4830
WBGene00009175 not CELE_ in pgid 5797
WBGene00009175 not CELE_ in pgid 5798
WBGene00000600 not CELE_ in pgid 5823
WBGene00000604 not CELE_ in pgid 5894
WBGene00000605 not CELE_ in pgid 5910
WBGene00000607 not CELE_ in pgid 5990
WBGene00012263 not CELE_ in pgid 6244
WBGene00004041 not CELE_ in pgid 6776
WBGene00002126 not CELE_ in pgid 7395
WBGene00032753 not CELE_ in pgid 7603
WBGene00018677 not CELE_ in pgid 7886
WBGene00018677 not CELE_ in pgid 7887
WBGene00018677 not CELE_ in pgid 7888
WBGene00018677 not CELE_ in pgid 7889
WBGene00004041 not CELE_ in pgid 8272
WBGene00002485 not CELE_ in pgid 9192
WBGene00117029 not CELE_ in pgid 9327
WBGene00043222 not CELE_ in pgid 10105
WBGene00043320 not CELE_ in pgid 10671
WBGene00043320 not CELE_ in pgid 10672
WBGene00010154 not CELE_ in pgid 10897
WBGene00019581 not CELE_ in pgid 11074
WBGene00020312 not CELE_ in pgid 11399
WBGene00021255 not CELE_ in pgid 11756
WBGene00021255 not CELE_ in pgid 11757
WBGene00045485 not CELE_ in pgid 12323
WBGene00029022 not CELE_ in pgid 13372
WBGene00027230 not CELE_ in pgid 13550
WBGene00025707 not CELE_ in pgid 13949
WBGene00023404 not CELE_ in pgid 14000
WBGene00023404 not CELE_ in pgid 14001
WBGene00033342 not CELE_ in pgid 14026
WBGene00059989 not CELE_ in pgid 14027
WBGene00195119 not CELE_ in pgid 14036
WBGene00101073 not CELE_ in pgid 14036
WBGene00025707 not CELE_ in pgid 14037
WBGene00034222 not CELE_ in pgid 14037
WBGene00224104 not CELE_ in pgid 14038
WBGene00233940 not CELE_ in pgid 14039
WBGene00231085 not CELE_ in pgid 14040
WBGene00042594 not CELE_ in pgid 14109

some of these are dead genes and some came up because they do not have CELE_ in the synonyms (that is how the script identifies non elegans)

the 'clean' list is:

WBGene00023572 not CELE_ in pgid 4733				briggsae	WBPaper00028961
WBGene00023572 not CELE_ in pgid 4734				briggsae        WBPaper00028961
WBGene00023575 not CELE_ in pgid 4737				briggsae        WBPaper00028961
WBGene00023572 not CELE_ in pgid 4739				briggsae        WBPaper00028961
WBGene00023572 not CELE_ in pgid 4740				briggsae        WBPaper00028961
WBGene00037006 not CELE_ in pgid 4743				briggsae        WBPaper00028961
WBGene00037006 not CELE_ in pgid 4744				briggsae        WBPaper00028961
WBGene00037005 not CELE_ in pgid 4747				briggsae        WBPaper00028961
WBGene00037005 not CELE_ in pgid 4748				briggsae        WBPaper00028961
WBGene00030970 not CELE_ in pgid 4750				briggsae        WBPaper00028961
WBGene00041435 not CELE_ in pgid 4753				briggsae        WBPaper00028961
WBGene00041435 not CELE_ in pgid 4754				briggsae	WBPaper00028961
WBGene00032753 not CELE_ in pgid 7603				briggsae	WBPaper00035320
WBGene00117029 not CELE_ in pgid 9327				pacificus	WBPaper00040360
WBGene00029022 not CELE_ in pgid 13372				briggsae	WBPaper00004520
WBGene00027230 not CELE_ in pgid 13550				briggsae	WBPaper00043890
WBGene00025707 not CELE_ in pgid 13949				briggsae	WBPaper00044493
WBGene00033342 not CELE_ in pgid 14026				briggsae	WBPaper00004832
WBGene00059989 not CELE_ in pgid 14027				remanei	        WBPaper00004832
WBGene00195119 not CELE_ in pgid 14036				pacificus	WBPaper00040023
WBGene00101073 not CELE_ in pgid 14036				pacificus	WBPaper00040023
WBGene00025707 not CELE_ in pgid 14037				briggsae	WBPaper00040859
WBGene00034222 not CELE_ in pgid 14037				briggsae	WBPaper00040859
WBGene00224104 not CELE_ in pgid 14038				brugia   	WBPaper00041825
WBGene00233940 not CELE_ in pgid 14039				brugia     	WBPaper00041825
WBGene00231085 not CELE_ in pgid 14040				brugia   	WBPaper00041825
WBGene00042594 not CELE_ in pgid 14109				briggsae	WBPaper00044831
WBGene00054802 in pgid 14251			          	remanei 	WBPaper00041071


the following were validated positive, not yet curated
WBPaper00004561 Haemoncus
WBPaper00004962 Volvulus
WBPaper00005646 Brugia
WBPaper00039907 Ascaris
WBPaper00041323 Brugia 
WBPaper00041714 Stercoralis
WBPaper00041951 Haemoncus
WBPaper00042037 Haemoncus
WBPaper00044651 Ascaris Suum

an additional list that can be checked is the following I got from Wen:

//Special Expr_pattern paper. Usually they contain expression patterns of un-specified genes.
 
cgc2796
cgc2714
cgc2559
cgc2475
cgc2449
cgc2274
cgc2005
cgc1542
cgc1984
cgc4994		ges-1 expression in C. briggasae
cgc4821		Expression of cpz-1 in O. volvulus
cgc4837		Expression of a Drosophila transposon in C.elegans using glh-2 promoter. 
cgc4895		nud-1 expression in other species.
cgc5831		expression of Od-mpp1 promoter corresponded to that produced by the T03F1.5 or the W09C3.6 promoter in C. elegans.
cgc5943		2-D protein gel dev. stage assay, too ambiguous to curate. 
pmid14504223	antibody 1CB4 staining with unknown antigen
cgc6097		expression in other species.
cgc6393		expression in briggasae.
cgc6588		expression in briggasae.
cgc6591		expression in briggasae
cgc6690		Curated as Gene_regulation.
pmid15826643	expression pattern in other species.
pmid15862576	expression pattern in other species.
pmid15630478	expression pattern in other species. In contrast with FOG-2, a highly conserved GLD-1 ortholog is present in C. briggsae (Table 1) and has a germline expression pattern essentially identical to that of C. elegans (Figure 5A, top right and middle right).
WBPaper00026965	expression pattern in other species.
WBPaper00028902 expression pattern in other species
00025105	expression pattern in other species
00025000	expression pattern in other species
00028902	expression pattern in other species
00032298	expression of lin-11 in three species.
WBPaper00035037 expression pattern of 

the following is a list of validated negatives that can be used for SVM training (randomly selected from http://131.215.52.209/daniela/nematode/summaryN_id_nematode)

22922012	negative
22922533	negative
22923372	negative
22924021	negative
22930820	negative
22932059	negative
22933846	negative
22935096	negative
22936386	negative
23315190	negative
22947621	negative
22949749	negative
22949753	negative
23307236	negative
22949756	negative
22949757	negative
22951972	negative
22952671	negative
23306387	negative
22952792	negative
22952922	negative
23300895	negative
2295622	        negative
22961235	negative
22961310	negative
22967068	negative
22969260	negative
22973231	negative
22983796	negative
22983799	negative
22983801	negative
22984141	negative
22984446	negative
22984536	negative
22992226	negative
22992297	negative
22992897	negative
23107597	negative
23107821	negative
23110936	negative
23110962	negative
23111012	negative
23111089	negative
23111398	negative
23112818	negative
23291463	negative
23029059	negative
23029330	negative
23029423	negative
23029572	negative
23289015	negative
2310180	        negative

parasitic nematodes papers containing expression data

sent to Jane Lomax on September 9 2015

O volvulus
http://www.ncbi.nlm.nih.gov/pubmed/11606224

H contortus
http://www.ncbi.nlm.nih.gov/pubmed/14698436
http://www.ncbi.nlm.nih.gov/pubmed/15003846
http://www.ncbi.nlm.nih.gov/pubmed/12062493
http://www.ncbi.nlm.nih.gov/pubmed/23360558
http://www.ncbi.nlm.nih.gov/pubmed/23416426
http://www.ncbi.nlm.nih.gov/pubmed/25128369
http://www.ncbi.nlm.nih.gov/pubmed/25388625

A Caninum
http://www.ncbi.nlm.nih.gov/pubmed/11755191

A suum
http://www.ncbi.nlm.nih.gov/pubmed/12387846
http://www.ncbi.nlm.nih.gov/pubmed/21685128
http://www.ncbi.nlm.nih.gov/pubmed/24374308

S stercoralis
http://www.ncbi.nlm.nih.gov/pubmed/14572516
http://www.ncbi.nlm.nih.gov/pubmed/23145190

2A viral technology

Proof of principle described in

  • Simultaneous expression of multiple proteins under a single promoter in C. elegans via a versatile 2A-based toolkit

Arnaud Ahier & Sophie Jarriault, Genetics

'We report the use of viral 2A peptides, which trigger a “ribosomal-skip” or “STOP&GO” mechanism during translation, to express multiple proteins from a single vector in C. elegans. Although none of the viruses known to infect C. elegans contain 2A-like sequences, our results show that 2A peptides allow the production of separate functional proteins in all cell types and at all developmental stages tested in the worm. In addition, we constructed a toolkit including a 2A- based polycistronic plasmid and reagents to generate 2A-tagged fosmids. 2A peptides constitute an important tool to ensure the delivery of multiple polypeptides in specific cells enabling several novel applications, such as the reconstitution of multi-subunit complexes.'

Will keep an eye if it will be used more extensively and eventually change the model


SVM analysis for gene expression

051812_042012. The retraining for this batch was done by incorporating the curated results from 2009 till 2012.


old                        re-train      
14/37 = 37.8%              17/26  = 65.3% 

        

From this batch on we have started to manually manipulate the features by adding some and deleting others. The files are stored on Lario/Desktop/SVM

06/08-05/18 2012.


old                 re-train      feature_manipulation
29.30%              44.40%         55.00%
28.90%              55%            
 
        

September 21 2012


old                 re-train           feature_manipulation        section_model
54/98=55%           41/56=73.2%        54/79=68.3%                 34/47 = 72.3% (159 out of 459 papers have results)
        


November 02 2012


old                 re-train           feature_manipulation       
9/13=69%            16/22=72.7%        28/46=60.8%
        

June 14 2013


old                 re-train          section_model      
6/10 = 60%          6/6 = 100%        5/5 = 100%
        

In July 2014 Yuling retrained SVM with the latest results. He got the results from the curation status from:

validated positive
http://tazendra.caltech.edu/~postgres/cgi-bin/curation_status.cgi?action=listCurationStatisticsPapersPage&select_curator=two12028&listDatatype=otherexpr&method=allval%20pos%20cur&  checkbox_cfp=on&checkbox_afp=on&checkbox_svm=on
validated negative
http://tazendra.caltech.edu/~postgres/cgi-bin/curation_status.cgi?action=listCurationStatisticsPapersPage&select_curator=two12028&listDatatype=otherexpr&method=allval%20neg&checkbox_cfp=on&checkbox_afp=on&checkbox_svm=on

also, Yuling is trying to see what was the difference between SVM and feature manipulation on this paper range 44200 till 44700 roughly between october 2013 and january 2014

http://131.215.52.209/celegans/svm_results/20131004/ http://131.215.52.209/celegans/svm_results/20140124/012414_011014_otherexpr

in the list we will take into account the curation negative SVM positive which are listed here: http://tazendra.caltech.edu/~postgres/cgi-bin/referenceform.cgi SELECT * FROM cur_curdata WHERE cur_selcomment ~ '1';

the results of the analisys are as follows (oct2013-jan2014)

testing papers 	total 176	69 positive	107 negatives								

		   true positive   false positive      precision		recall		F score	"=2*(precision*recall)/(precision+recall)"
feature manipulation		47	15		75.80%		47/69	68.10%		0.717439889	
current SVM model		32	7		82%		32/69	46.30%		0.591831645	
new testing SVM model		42	18		70%		42/69	60.80%		0.650764526	

Jan2014-Dec2014 -using feature manipulation models

testing papers 	total flagged 226	172 positive	54 negatives	precision 76.1%

Expression pattern remodel

Expression pattern remodel

uP

User data submission

the submission form is here:

http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/expr_pattern.cgi

and it appends data here

http://tazendra.caltech.edu/~azurebrd/cgi-bin/data/expr.ace

Numbers

  • 9260 Expression objects in WS180 ( 7176 + 2084 Chronograms)
  • 12110 Expression objects in WS233 (10026 + 2084 Chronograms)
  • 13380 Expression objects in WS243 (11296 + 2084 Chronograms)
  • 14273 Expression objects in WS243 (12189 + 2084 Chronograms)

Constructs

Over 4000 Construct objects that did not have a 'standard nomenclature' name as transgenes -e.g. hkdEx1202 for extrachromosomal arrays- were imported into Construct OA.

In order to atomize curation details into different fields -from the construction summary descritpion- Juancarlos wrote a script

the script is located here:


/home/postgres/work/pgpopulation/cns_construct/20141110_daniela_constructs/update_cns_by_daniela.pl*

More information/files on Lario in the folder Construct/Clone stuff

new allele request

go to the name server and log in: http://www.sanger.ac.uk/sanger/Worm_NameServer

check if the new variation already exists by clicking on find variation.

If it does: generate an ID in the OA: http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/generic.cgi?action=TempVariationObo putting in the public name and the ID

if it doesn't: on the name server click on 'request a new variation ID' put in the public name and the paper -with additional info

then go to the OA as above and generate an ID

Expression tables

Gene ExprID Assay Type Description Expressed in Subcellular localization GO Life stage Transgene Construction summary Reference Images
lin-3 Expr9878 In situ hybridization The expression pattern of lin-3 in wild-type animals was determined at the late L2 to early L3 stage when vulval induction occurs.. Anchor cell, pharynx, tail, germ line L2 larva, L3 larva Saffer et al., 2011 yes
lin-3 Expr506 Reporter gene Expressed in Anchor Cell (AC) at L3. Anchor cell L3 larva Expr506_Ex [lin-3::lacZ] translational fusion. Hill RJ et al. (1992) yes
lin-3 Expr9352 Reporter gene During larval stages, LIN-3 expression was detected.. Anchor cell, intestinal cell, pharynx, hypodermal cell Expressed in the nucleus Nucleus L2 larva, L3 larva syIs107 [unc-119(+); Plin-3::pes-10::gfp] Liu et al. (2011) yes