Difference between revisions of "Expression Pattern"
(614 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
== Expression Pattern== | == Expression Pattern== | ||
− | Tags | + | Current model (WS280) |
+ | |||
+ | <pre> | ||
+ | |||
+ | |||
+ | ?Expr_pattern Expression_of Gene ?Gene XREF Expr_pattern #Evidence | ||
+ | Reflects_endogenous_expression_of ?Gene | ||
+ | CDS ?CDS XREF Expr_pattern // for coding genes | ||
+ | Sequence ?Sequence XREF Expr_pattern // for clones??? | ||
+ | Pseudogene ?Pseudogene XREF Expr_pattern // [030801 krb] | ||
+ | Clone ?Clone XREF Expr_pattern | ||
+ | Protein ?Protein XREF Expr_pattern | ||
+ | Protein_description Text // information for Expr_patterns with unknown antigens [031105 krb] | ||
+ | Homol Homol_homol ?Homol_data XREF Expr_homol ?Method Float Int UNIQUE Int Int UNIQUE Int #Homol_info //Expr_pattern mapping [060427 ar2] | ||
+ | Expression_data Life_stage ?Life_stage XREF Expr_pattern #Qualifier | ||
+ | Anatomy_term ?Anatomy_term XREF Expr_pattern #Qualifier | ||
+ | GO_term ?GO_term XREF Expr_pattern #GR_condition | ||
+ | Not_in_Life_stage ?Life_stage #Qualifier | ||
+ | Not_in_Anatomy_term ?Anatomy_term #Qualifier | ||
+ | Not_in_GO_term ?GO_term #GR_condition | ||
+ | Subcellular_localization ?Text | ||
+ | Type Antibody ?Text | ||
+ | Cis_regulatory_element Text | ||
+ | EPIC ?Text | ||
+ | Genome_editing ?Text | ||
+ | In_situ Text | ||
+ | Localizome ?Text | ||
+ | Microarray ?Microarray_experiment | ||
+ | Northern Text | ||
+ | Reporter_gene ?Text | ||
+ | RNASeq ?Analysis | ||
+ | RT_PCR Text | ||
+ | Tiling_array ?Analysis | ||
+ | Western Text | ||
+ | Expression_cluster ?Expression_cluster XREF Expr_pattern //added for localizome | ||
+ | Microarray_results ?Microarray_results XREF Expr_Pattern | ||
+ | Pattern ?Text | ||
+ | Picture ?Picture XREF Expr_pattern | ||
+ | MovieURL Text //Added by wen for link to movie URLs. | ||
+ | Movie ?Movie XREF Expr_pattern //Added by Wen to curate Expr_pattern video | ||
+ | Species UNIQUE ?Species | ||
+ | Remark ?Text #Evidence | ||
+ | DB_info ?Database ?Database_field Text | ||
+ | Experiment Laboratory ?Laboratory | ||
+ | Author ?Author | ||
+ | Date UNIQUE DateType | ||
+ | Strain UNIQUE ?Strain | ||
+ | Reference ?Paper XREF Expr_pattern | ||
+ | Transgene ?Transgene XREF Expr_pattern | ||
+ | Variation ?Variation XREF Expr_pattern | ||
+ | Construct ?Construct XREF Expression_pattern | ||
+ | Associated_feature ?Feature XREF Associated_with_expression_pattern #Evidence | ||
+ | Antibody_info ?Antibody XREF Expr_pattern // This applies to both Western & Antibody staining | ||
+ | // added [031120 krb] | ||
+ | Curated_by UNIQUE Text // Hinxton (HX) or Caltech (CIT) Sylvia [010927 krb] | ||
+ | Historical_gene ?Gene Text | ||
+ | |||
+ | |||
+ | //Qualifer hash will be used for Expr_pattern curation to specify the reliability of data. | ||
+ | |||
+ | #Qualifier Certain | ||
+ | Uncertain //For faint or variable expression | ||
+ | Partial //For expression of unidentified cell in a cell group | ||
+ | Anatomy_term ?Anatomy_term //combines life stage with anatomy term in expr pattern annotation | ||
+ | Life_stage ?Life_stage //combines life stage with anatomy term in expr pattern annotation | ||
+ | Remark Text //New tag to take the optional text from the Certain/Uncertain/Partial nested tags above. | ||
+ | |||
+ | |||
+ | |||
+ | </pre> | ||
+ | |||
+ | |||
+ | Tags used in Expr_pattern objects (WS221): | ||
Laboratory | Laboratory | ||
Line 43: | Line 115: | ||
* multiontology / multidropdown : allows multiple values | * multiontology / multidropdown : allows multiple values | ||
* toggle : on / off | * toggle : on / off | ||
+ | |||
+ | == Genes with expression == | ||
+ | to check the number of genes that do have expression objects you should run this script on tazendra: | ||
+ | |||
+ | /home/postgres/work/get_stuff/for_daniela/20140715_exp_gene_distinct/get_exp_gene_distinct.pl | ||
+ | |||
+ | *5575 as of August 2014 | ||
+ | *5734 as of January 2016 | ||
+ | |||
+ | == WS248 numbers == | ||
+ | for expression in WS248 we have 19052 objects coming from Yanai elegans; 68097 for Yanai other species, 10545 for Miller tiling arrays and 13877 manually curated -> 111571 total | ||
+ | for pictures in WS248 we have 19052 objects coming from Yanai elegans; 68097 for Yanai other species,and 13912 manually curated for a total of 101061. | ||
+ | |||
+ | These are the statistics from citace that Wen pulled on May 11th 2015: | ||
+ | Here are the changes from WS243 to WS248: | ||
+ | |||
+ | <pre> | ||
+ | find Antibody: 2525 --> 2785, 260 added. | ||
+ | find Anatomy_term: 6839 --> 6842, 3 added. | ||
+ | find Anatomy_function: 598 --> 924, 326 added. | ||
+ | find DO_term: 6350 --> 6571, 221 added. | ||
+ | find Expr_pattern: 42979 --> 111571, 68592 added. | ||
+ | find Picture: 32636 --> 101061, 68425 added. | ||
+ | |||
+ | </pre> | ||
== OA interface == | == OA interface == | ||
− | OA editor label -- postgres table name -- type of table and description. | + | OA editor label -- postgres table name -- type of table and description. |
+ | |||
+ | === Dumper === | ||
+ | |||
+ | On February 2015 we ahve added the qualifier life stage field so we could capture anatomy and life stage associated to each other. | ||
+ | We have added a qualifier_lifestage field and modified the dumper so that whenever there is an anatomy and a life stage in the qualifier life stage it will dump: | ||
+ | |||
+ | <pre> | ||
+ | Expr_pattern : "Expr12000" | ||
+ | Anatomy_term "WBbt:0004575" Life_stage "WBls:0000264" | ||
+ | Life_stage "WBls:0000264" Anatomy_term "WBbt:0004575" | ||
+ | </pre> | ||
+ | |||
+ | we also set it up in a way that if only the qualifier life stage is filled it will dump it-this is because data entered in the life stage field in the micropublication form will go into exp_qualifierls | ||
=== Tab1 === | === Tab1 === | ||
*''Pgdbid'' -- no table -- postgres database ID, generates automatically upon entry. | *''Pgdbid'' -- no table -- postgres database ID, generates automatically upon entry. | ||
− | *''Expr_pattern'' -- exp_name -- | + | *''Expr_pattern'' -- Expr_pattern : "exp_name" -- text -- Expression Pattern ID is generated when creating a new object. Take the highest Expr_patternID and increase by one ''When making a new row, the OA looks at all entries in exp_name that begin with "Expr", then captures the numbers, finding the highest number, adds 1 to it, puts 'Expr' in front, and uses that as the new name. |
− | + | *''Reference'' -- Reference "exp_paper" -- multiontology on paper WBPaperID - multiontology there are Expr objects with multiple papers associated. A query for that is: testdb=> SELECT * FROM exp_paper WHERE exp_paper ~ ','; and the result: | |
− | + | 302 | "WBPaper00001926","WBPaper00001469" | 2011-05-31 11:37:14.153562-07 | |
− | *''Reference'' -- exp_paper -- | + | 5478 | "WBPaper00002573","WBPaper00002922" | 2011-05-31 12:04:27.053611-07 |
− | *''Gene'' -- exp_gene -- | + | 5479 | "WBPaper00001785","WBPaper00002922" | 2011-05-31 12:04:27.284807-07 |
− | *''Anatomy'' -- exp_anatomy -- multiontology. Daniela will associate different Anatomy-qualifier-qualifier_text in different OA rows, so some Expr objects will have multiple rows / multiple pgids. When querying by any of these fields, if editing a different field, the curator should query by Expr to make sure all pgids for that object have that other field edited. | + | 5501 | "WBPaper00003285","WBPaper00001812" | 2011-05-31 12:04:34.649968-07 |
− | *''Qualifier'' -- exp_qualifier -- dropdown -- Certain / Uncertain / Partial | + | 5502 | "WBPaper00003285","WBPaper00001812" | 2011-05-31 12:04:34.893549-07 |
− | *''Qualifier Text'' -- | + | 5557 | "WBPaper00001014","WBPaper00002319","WBPaper00000647" | 2011-05-31 12:04:50.273589-07 |
− | *''GO_term'' -- exp_goid -- | + | 5558 | "WBPaper00001014","WBPaper00002319","WBPaper00000647" | 2011-05-31 12:04:50.51613-07 |
− | *''Subcellular_localization'' -- exp_subcellloc -- bigtext, details on subcellular localization. | + | 5559 | "WBPaper00001014","WBPaper00002319","WBPaper00000647" | 2011-05-31 12:04:50.724796-07 |
− | *''Life_stage'' -- exp_lifestage -- multiontology like in the phenotype OA and picture OA | + | 5689 | "WBPaper00002573","WBPaper00002922" | 2011-05-31 12:05:33.692997-07 |
+ | 5707 | "WBPaper00001014","WBPaper00002319","WBPaper00000647" | 2011-05-31 12:05:42.104837-07 | ||
+ | 8260 | "WBPaper00031556","WBPaper00032077" | 2011-05-31 12:18:44.374796-07 | ||
+ | |||
+ | *''Person'' -- "exp_person" 'multiontology on WBPersons, used to capture people for personal communications and to capture what was stored in the 'Author' field for old annotations. Added on feb 8th, 2021 | ||
+ | *''Gene'' -- Gene "exp_gene" -- multiontology on genes WBGeneID - show WBID, locus, and synonym in term info | ||
+ | *''Endogenous'' -- "exp_endogenous" toggle tag in ace file Reflects_endogenous_expression_of | ||
+ | *''Rel Anatomy'' -- "exp_relanatomy" dropdown on part_of | ||
+ | *''Anatomy'' -- Anatomy_term "exp_anatomy" exp_qualifier "exp_qualifiertext" -- multiontology. Daniela will associate different Anatomy-qualifier-qualifier_text in different OA rows, so some Expr objects will have multiple rows / multiple pgids. When querying by any of these fields, if editing a different field, the curator should query by Expr to make sure all pgids for that object have that other field edited. | ||
+ | *''Qualifier'' -- exp_qualifier -- dropdown -- Certain / Uncertain / Partial / NOT (NOT is not dumping for now. Added feb 2015 to capture negative expression) | ||
+ | *''Anatomy certain'' -- exp_certain -- multiontology. Controlled vocabulary found here: https://github.com/raymond91125/Wao/raw/master/WBbt.obo (same as in Picture OA). We need to have 3 different Anatomy term boxes, one for the Partial, one for the certain and one for the uncertain Qualifiers. | ||
+ | *''Anatomy Partial'' -- exp_partial -- multiontology. | ||
+ | *''Anatomy Uncertain'' -- exp_uncertain -- multiontology. | ||
+ | *''Anatomy no qualifier''-- exp_noqualifier -- multiontology. We added this field because when we parsed the old expr_pattern data (WS226) 5518 anatomy_term lines did not have a #Qualifier. | ||
+ | *''Qualifier Text'' -- exp_qualifiertext -- bigtext | ||
+ | *''GO_term'' -- GO_term "exp_goid" -- multiontology of GO_Term like gop_goid. | ||
+ | *''Subcellular_localization'' -- Subcellular_localization "exp_subcellloc"-- bigtext, details on subcellular localization. | ||
+ | *''Rel LS'' -- "exp_rellifestage" dropdown on part_of and happens_during | ||
+ | *''Life_stage'' -- Life_stage "exp_lifestage" Convert the life stage IDs into names from the obo_name_lifestage -- multiontology like in the phenotype OA and picture OA | ||
+ | *''Species'' -- "exp_species" | ||
− | Juancarlos parsed .ace dump from | + | on Nov 3rd 2014 we have added 4 fields that will not be dumped yet but will be used to aid granular curation and to port annotation extension into GO (and implemented also relations on feb 2015) |
+ | * Qualifier LS -> multiontology on life stages -> exp_qualifierls dependent_on | ||
+ | * GR Anatomy -> multiontology on anatomy terms -> exp_granatomy | ||
+ | * GR LS -> multiontology on life stages -> exp_grlifestage | ||
+ | * Rel Cell Cycle -- "exp_relcellcycle" dropdown on part_of, independent_of, happens_during, dependent_on | ||
+ | * GR Cell Cycle -> multiontology on GO -> exp_grcellcycle | ||
+ | |||
+ | Juancarlos parsed .ace dump from WS226: 5518 anatomy_term lines without a #Qualifier at all in | ||
expr_no_qualifier | expr_no_qualifier | ||
Line 69: | Line 205: | ||
796 unique text-expr linked to various anat_terms in | 796 unique text-expr linked to various anat_terms in | ||
expr_data_with_extra_anatomy | expr_data_with_extra_anatomy | ||
− | for example, look at "Expressed | + | for example, look at "Expressed in ventral male specific muscles." |
which has a unique Expr to multiple anat_terms ; or "1 neuron" linked | which has a unique Expr to multiple anat_terms ; or "1 neuron" linked | ||
to multiple different expr / anat_term | to multiple different expr / anat_term | ||
=== Tab2 === | === Tab2 === | ||
− | *''Type'' -- exp_exprtype -- multidropdown select from: Antibody, Reporter_gene, In_situ, RT_PCR, Northern, Western | + | *''Type'' -- exp_exprtype -- multidropdown select from: Antibody, Reporter_gene, In_situ, RT_PCR, Northern, Western |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | * Antibody_Text -- Antibody "exp_antibodytext" -- bigtext " this tag was used 462 times in WS221 | |
− | + | * Reporter_gene_Text -- Reporter_gene "exp_reportergene" -- bigtext " this tag was used 7273 times in WS221 and has been used twice for the same object -> lines are | separated. Details on reporter gene construct. Multiline, the dumper dumps multiple lines | |
+ | * In_Situ -- In_Situ "exp_insitu" -- bigtext " this tag was used 434 times in WS221 | ||
+ | * RT_PCR -- RT_PCR "exp_rtpcr" -- bigtext " this tag was used 165 times in WS221 | ||
+ | * Northern -- Northern "exp_northern" -- bigtext " this tag was used 347 times in WS221 | ||
+ | * Western -- Western "exp_western" -- bigtext " this tag was used 19 times in WS221 | ||
− | We | + | We have a multidropdown on the values above AND we have bigtext fields for each of the values above. D&J decided this on March 21 |
− | *''Picture'' -- exp_picture -- Multiontology on Picture | + | *''Picture'' -- exp_picture -- Multiontology on Picture We will remove this tag: Picture objects will be created in Picture OA and XREF to Expr_pattern. They will not be entered here. Removed from OA -- J We removed Pictures form Expr_pattern as they are XREF'd to it |
− | *''Picture flag'' -- exp_pictureflag -- toggle notify picture person with a cronjob every 2 weeks | + | *''Picture flag'' -- exp_pictureflag -- toggle notify picture person with a cronjob every 2 weeks. We keep this even if we remove the Picture tag (not currently used) |
− | + | *''Antibody_info'' -- Antibody_info "exp_antibody" -- multiontology on antibodies | |
− | |||
− | |||
− | *''Antibody_info'' -- exp_antibody -- multiontology on antibodies | ||
*''Antibody flag'' -- exp_antibodyflag -- toggle -> notify antibody person with a cronjob every 2 weeks | *''Antibody flag'' -- exp_antibodyflag -- toggle -> notify antibody person with a cronjob every 2 weeks | ||
− | *'' | + | *''Pattern'' -- Pattern "exp_pattern" -- bigtext, details on tissue distribution. Multiline |
− | *'' | + | *''Remark'' -- Remark "exp_remark" -- bigtext, if any comments required. Multiline |
− | *'' | + | *''Transgene'' -- Transgene "exp_transgene" -- multiontology on transgenes. |
+ | *''Construct'' -- Construct "exp_construct" -- multiontology on constructs. | ||
*''Transgene flag'' -- exp_transgeneflag -- toggle -> notify transgene person with a cronjob every 2 weeks | *''Transgene flag'' -- exp_transgeneflag -- toggle -> notify transgene person with a cronjob every 2 weeks | ||
+ | *''Sequence_feature'' -- exp_seqfeature -- multiontology on Features (WBsfIDs) | ||
*''Curator'' -- exp_curator -- Multiontology on people | *''Curator'' -- exp_curator -- Multiontology on people | ||
− | *''No dump'' -- exp_nodump -- Toggle | + | *''No dump'' -- exp_nodump -- Toggle Expr_pattern objects not to dump. If an Expr_pattern object is flagged as no dump, don't dump any data for that pgid, nor any other pgid that corresponds to the Expr_pattern object. (Read all exp_nodump + exp_name into a hash of Expr_patterns to not-dump.) |
=== Tab3 === | === Tab3 === | ||
− | *''Protein_description'' -- exp_protein -- text (30 objects) | + | *''Protein_description'' -- Protein_description "exp_protein" -- text (30 objects)- cleaned up data for Alliance harmonization, asked to remove tag on Feb 2022. Deleted field from OA on Feb 25 2022 |
− | *''Clone'' -- exp_clone -- multiontology on clones (341 objects) (when OA is in place discuss with Chris on the clone class). Is there a better place to get clones than http://textpresso-dev.caltech.edu/gsa/worm/known_entities/Clone? | + | |
− | Clone and Strain lists could taken from spica from /home/citpub/arun/wb_entities/known_entities All of these don't have any Term Info (nor synonyms) if you need either of those you'd have to query WS for it, Karen probably knows how, she does it for other objects -- J ok, I don't think I'll need a term info and I need it mainly to parse old data which have a clone attached. so for now is fine as it is D ok, I'll change the parser to read these. | + | *''Clone'' -- Clone "exp_clone" -- multiontology on clones (341 objects) (when OA is in place discuss with Chris on the clone class). Is there a better place to get clones than http://textpresso-dev.caltech.edu/gsa/worm/known_entities/Clone? |
+ | Clone and Strain lists could taken from spica from /home/citpub/arun/wb_entities/known_entities All of these don't have any Term Info (nor synonyms) if you need either of those you'd have to query WS for it, Karen probably knows how, she does it for other objects -- J ok, I don't think I'll need a term info and I need it mainly to parse old data which have a clone attached. so for now is fine as it is D ok, I'll change the parser to read these. | ||
this lists are kept updated with what is in acedb daily using the following cronjob: | this lists are kept updated with what is in acedb daily using the following cronjob: | ||
06 23 * * * cd /home/citpub/arun/wb_entities; ./01getEntities.pl | 06 23 * * * cd /home/citpub/arun/wb_entities; ./01getEntities.pl | ||
− | *''Strain'' -- exp_strain -- multiontology on strains (812 objects). is there a better place to take the strain list then http://textpresso-dev.caltech.edu/gsa/worm/known_entities/Strain? ''' | + | |
+ | <pre> | ||
+ | In July 2014 there was a change in Hinxton that affected the clone list. From Juancarlos: | ||
+ | Thanks Paul D. I've switched the script to look at clones2.ace.gz | ||
+ | instead of clones.ace.gz and the data seems to have read in fine. | ||
+ | |||
+ | The script is not in a repo, but I've symlinked it so it shows here | ||
+ | http://tazendra.caltech.edu/~postgres/out/geneace/nightly_geneace.pl | ||
+ | so it will always be the current version there. The best thing though | ||
+ | would probably be to look at the wiki, to see what the script is | ||
+ | supposed to do, and I don't know where or if there is a wiki for it. | ||
+ | Karen, do we have one ? | ||
+ | |||
+ | </pre> | ||
+ | |||
+ | *''Strain'' -- Strain "exp_strain" -- multiontology on strains (812 objects). is there a better place to take the strain list then http://textpresso-dev.caltech.edu/gsa/worm/known_entities/Strain? ''' | ||
Clone and Strain lists could take from spica from /home/citpub/arun/wb_entities/known_entities | Clone and Strain lists could take from spica from /home/citpub/arun/wb_entities/known_entities | ||
this lists are kept updated with what is in acedb daily using the following cronjob: | this lists are kept updated with what is in acedb daily using the following cronjob: | ||
06 23 * * * cd /home/citpub/arun/wb_entities; ./01getEntities.pl | 06 23 * * * cd /home/citpub/arun/wb_entities; ./01getEntities.pl | ||
− | |||
− | |||
− | |||
− | |||
+ | *''Sequence'' -- Sequence "exp_sequence" -- text (13 objects) F54E2 2x (clone), R05D8 2x (clone), Y38B5A (clone), "Z28375" -C "EMBL Z28375" (sequence), "Z28376" -C "EMBL Z28376" (sequence), "Z28377" -C "EMBL Z28377" (sequence), R11H6 (clone), Y40H4A (clone), U14525, C47G2 (clone), Z32673 (sequence). Obsoleted February 2021 | ||
+ | *''MovieURL'' -- MovieURL "exp_movieurl" -- (32 objects) text Deleted field from OA on Feb 25 2022 | ||
+ | *''Laboratory'' -- Laboratory "exp_laboratory" -- ontology (17 objects) cleaned up data for Alliance harmonization, asked to remove tag on Feb 2022 Deleted field from OA on Feb 25 2022 | ||
+ | *''Variation'' -- Variation "exp_variation". Multiontology on variations | ||
+ | |||
+ | === Tab4 === | ||
+ | * Micropublication exp_micropublication toggle | ||
+ | * removed from OA on Feb 8 2021, the fields below were initially put in for micropubs but are no longer needed: | ||
+ | ** Contact- ontology on persons -exp_contact | ||
+ | ** e-mail exp_email | ||
+ | ** Co-authors exp_coaut | ||
+ | ** Funding exp_funding | ||
+ | |||
+ | === Protein field addition to OA === | ||
+ | Since we will have expression data dumped into protein-to-GO, we need a protein field to capture the right protein isoform whenever authors specify isoform-specific subcellular localization. | ||
+ | |||
+ | On Oct 23rd 2017 Juancarlos moved data that was present in the 'exp_protein' postgres table to a new table called 'exp_proteindesc'. | ||
+ | He then used the protein table to store the Protein IDs. | ||
+ | He added a new field on tab 3 of Expression OA called exp_protein, that autocompletes on protein names. It is a multi ontology field - it could be that 2 isoforms share the same pattern/reagents. | ||
+ | The field is dumped as Protein "WP:CE06704" | ||
+ | |||
+ | cleaned up data for Alliance harmonization, asked to remove tag on Feb 2022 | ||
+ | |||
+ | === Microarray_results === | ||
+ | |||
+ | The field Microarray_results has been added to the model for WS247. This will allow mapping to gene for other species (remanei, briggsae, japonica) coming from the Yanai study. | ||
+ | Hinxton is mapping Microarray_results to Gene on the fly. | ||
+ | |||
+ | Make sure in the future not to add Microarray_result to C elegans expression objects to avoid overwriting any curated Gene references- DR 01-07-2015 | ||
+ | |||
+ | === Curated_by === | ||
+ | * Curated_by -- exp_curatedby -- text (6228 objects) This is a legacy thing, the values are only Hinxton and Caltech. | ||
+ | In July 2014 we discussed to remove the Curated_by data. Generated a -D file for Citace minus. Deposited on CitaceMinus for the WS245 upload. | ||
+ | /Users/danielaraciti/Desktop/Expression_pattern curation/Curated_by/Curated_by.ace.edited. In Expression OA all the objects that had a Curated_by HX tag are now assigned to Sylvia MArtinelli -the one who historically set the Curated_by tag in the model the list of pgids that were changed is here: /Users/danielaraciti/Desktop/Expression_pattern curation/Curated_by/Curated_by_hinxton.rtf | ||
+ | * Curated_by is currently used just by chronograms | ||
+ | |||
+ | == obsolete fields == | ||
− | + | February 2021: Request a model change to get rid of the following tags, more info below: | |
+ | * Cell -removed | ||
+ | * Cell_group -removed | ||
+ | * Author -- exp_author - removed | ||
+ | * Date -- exp_date (2617 objects) - removed | ||
+ | * Protein_description - asked to remove Feb 2022. Deleted field from OA on Feb 25 2022 | ||
+ | * Expressed_in - text 1 entry. No info attached to this term. Left out DR 06062011 - removed | ||
+ | * Protein - text 1 entry could be put in Protein_description. Expr1941 done DR 06062011 -asked to remove Feb 2022 Deleted field from OA on Feb 25 2022 | ||
+ | * Pseudogene - text (1 object) Expr111 done DR 06062011 - asked to remove Feb 2022 | ||
+ | * CDS | ||
+ | * Sequence | ||
− | |||
− | -- | + | *''Cell'' -- exp_cell -- text (26 objects)-> Consolidate these objects with the Anatomy_term field done DR 06062011: |
− | |||
Expr_pattern : "Expr7477" | Expr_pattern : "Expr7477" | ||
Cell "P3.p" Certain | Cell "P3.p" Certain | ||
Line 136: | Line 324: | ||
Cell "P7.p" Certain | Cell "P7.p" Certain | ||
Cell "P8.p" Certain | Cell "P8.p" Certain | ||
+ | |||
+ | done DR 06062011 | ||
Expr_pattern : "Expr7595" | Expr_pattern : "Expr7595" | ||
Cell "CANL" Uncertain | Cell "CANL" Uncertain | ||
Cell "CANR" Uncertain | Cell "CANR" Uncertain | ||
+ | |||
+ | done DR 06062011 | ||
Expr_pattern : "Expr7605" | Expr_pattern : "Expr7605" | ||
Cell "M4" Certain | Cell "M4" Certain | ||
+ | |||
+ | |||
+ | done DR 06062011 | ||
Expr_pattern : "Expr7632" | Expr_pattern : "Expr7632" | ||
Line 154: | Line 349: | ||
Cell "PVQL" Uncertain | Cell "PVQL" Uncertain | ||
Cell "PVQR" Uncertain | Cell "PVQR" Uncertain | ||
+ | |||
+ | done DR 06062011 | ||
Expr_pattern : "Expr7691" | Expr_pattern : "Expr7691" | ||
Line 162: | Line 359: | ||
Cell "P7.p" Certain | Cell "P7.p" Certain | ||
Cell "P8.p" Certain | Cell "P8.p" Certain | ||
+ | |||
+ | done DR 06062011 | ||
Expr_pattern : "Expr8715" | Expr_pattern : "Expr8715" | ||
Line 167: | Line 366: | ||
Cell "M.drpa" Certain | Cell "M.drpa" Certain | ||
− | * | + | done DR 06062011 |
− | * | + | |
− | + | * Authors and Date: Data stored in ?Author is legacy data. File was on citace minus. Wen created a -D file to delete such data for WS281. We decided on 02.11.2021 during the WB meeting that we will import names of authors present in the ?Author tag that are not listed as coauthors of the publication attached tot he Expr_pattern object. For those, we will also capture the date. These info was added to the remark field for the following objects on 02.12.2021: | |
− | + | ||
+ | WBPaper00005281: Expr26, Expr92, Expr94, Expr107, Expr112 | ||
+ | WBPaper00001469: Expr55 | ||
+ | WBPaper00002318: Expr133, Expr134, Expr135, Expr136, Expr137, Expr27, Expr60, Expr61, Expr62, Expr63, Expr64, Expr65, Expr66, Expr67, Expr68 | ||
+ | WBPaper00001752: Expr46, Expr47, Expr48, Expr49, Expr50, Expr51, Expr52 | ||
+ | WBPaper00001358: Expr56 | ||
+ | WBPaper00002049: Expr59 | ||
+ | WBPaper00001456: Expr86 | ||
+ | WBPaper00002551: Expr87 | ||
+ | |||
+ | To see a list of Expr_objects that included Author and Date Data, check the WS279.ace file on the WB FTP site. | ||
+ | |||
+ | ** To DO still: upload -D file on Spica (done), request a model change (done) dump the author data (todo) | ||
+ | |||
+ | * Protein_description 33 objects -> Decision: Move to remarks -> Done DR 2021/02/26. Redundant info such as CPL-1 in Protein_description and CPL-1 in gene name were omitted. | ||
+ | <pre>Example: Expr_pattern : "Expr450" | ||
+ | Gene "WBGene00000776" | ||
+ | Protein_description "CPL-1" | ||
+ | |||
+ | Expr_pattern : "Expr552" | ||
+ | Gene "WBGene00006528" | ||
+ | Protein_description "Tubulin alpha"</pre> | ||
+ | |||
+ | * Sequence 12 objects -> Decision: Move to remarks. Done DR 2021/03/15 | ||
+ | <pre>Example: Expr_pattern : "Expr12" | ||
+ | Gene "WBGene00003976" | ||
+ | Sequence "Z28377|Z28375|Z28376"</pre> | ||
− | * | + | ** Laboratory 23 objects -> can infer via publication -> Decision: good to ignore. Cleaned up OA. Deleted field from OA on Feb 25 2022 |
+ | <pre>Example: Expr_pattern : "Expr87" | ||
+ | … | ||
+ | Laboratory "ML" | ||
+ | Gene "WBGene00003012"</pre> | ||
− | |||
− | |||
=== Tags used only once that should be fixed === | === Tags used only once that should be fixed === | ||
− | |||
− | |||
− | |||
*''Homol_homol'' tag is used in Chronograms -> we will not include Chronograms in the OA. | *''Homol_homol'' tag is used in Chronograms -> we will not include Chronograms in the OA. | ||
− | |||
− | |||
− | |||
==== Comments for Parsing ExprCitace226 into OA ==== | ==== Comments for Parsing ExprCitace226 into OA ==== | ||
+ | Parsing files in /home/postgres/work/pgpopulation/exp_exprpattern | ||
Many entries for Anatomy_term don't have one of the Certain/Partial/Uncertain. We leave them without the qualifier. | Many entries for Anatomy_term don't have one of the Certain/Partial/Uncertain. We leave them without the qualifier. | ||
Line 216: | Line 438: | ||
We will not include Chronograms in Expr_OA anyway as they are one time large scale exp. | We will not include Chronograms in Expr_OA anyway as they are one time large scale exp. | ||
+ | We have 2084 chronograms | ||
+ | |||
+ | === to fix manually === | ||
+ | '''* INVALID DATA antibody [WBPaper00032450]:capg-1 Expr8708''' | ||
+ | '''* INVALID DATA antibody [cgc3002]:beta-filagenin Expr1442''' | ||
+ | '''* INVALID DATA antibody [cgc4387]:hsp-16.2 Expr1117''' | ||
+ | '''* INVALID DATA antibody [cgc6057]:daf-21 Expr2687''' | ||
+ | * INVALID DATA goid GO:0000141 Expr3919 Done DR06062011 | ||
+ | * INVALID DATA goid GO:0008221 Expr7871 Done DR06062011 | ||
+ | * INVALID DATA transgene Is001 Expr2646 Done DR06062011 | ||
+ | * INVALID DATA transgene Is007 Expr2646 Done DR06062011 | ||
+ | * INVALID DATA transgene leals30 Expr9151 Done DR06062011 | ||
+ | * INVALID DATA transgene pZMI.1In1 Expr725 Done DR06062011 | ||
+ | * INVALID DATA transgene pZMI.1In2 Expr725 Done DR06062011 | ||
+ | |||
+ | Need to correct the expression pattern transgene name | ||
+ | * Is001 -> WBPaper00006024_Is001 for Expr2646 WBPaper00006024 Done DR06062011 | ||
+ | * Is007 -> WBPaper00006024_Is007 for Expr2646 WBPaper00006024 Done DR06062011 | ||
+ | * pZMI.1In1 -> WBPaper00002501_In1 for Expr725 WBPaper00002501 Done DR06062011 | ||
+ | * pZMI.1In2 -> WBPaper00002501_In2 for Expr725 WBPaper00002501 Done DR06062011 | ||
+ | * Add leals30 Expr9151 WBPaper00037728 Done DR06062011 | ||
+ | |||
+ | |||
+ | Need to correct the expression pattern GO name | ||
+ | |||
+ | *GO:0000141 is now GO:0032432 Done DR06062011 | ||
+ | *GO:0008221 is now GO:0016529 Done DR06062011 | ||
+ | |||
+ | There was a list of Anatomy term objects with invalid IDs. this is the mapping for the new ids: | ||
+ | |||
+ | * Old ID New ID | ||
+ | * WBbt000:6748 WBbt:0006748 | ||
+ | * WBbt:0003852 WBbt:0003851 | ||
+ | * WBbt:0004397 WBbt:0008116 | ||
+ | * WBbt:0004398 WBbt:0008111 | ||
+ | * WBbt:0004401 WBbt:0004392 | ||
+ | * WBbt:0004459 WBbt:0003664 | ||
+ | * WBbt:0004514 WBbt:0008052 | ||
+ | * WBbt:0004515 WBbt:0008050 | ||
+ | * WBbt:0004717 WBbt:0008046 | ||
+ | * WBbt:0004718 WBbt:0008051 | ||
+ | * WBbt:0004719 WBbt:0008049 | ||
+ | * WBbt:0004720 WBbt:0008047 | ||
+ | * WBbt:0004721 WBbt:0008045 | ||
+ | * WBbt:0004722 WBbt:0008044 | ||
+ | * WBbt:0005099 WBbt:0005830 | ||
+ | * WBbt:0005211 WBbt:0005801 | ||
+ | * WBbt:0005228 WBbt:0005214 | ||
+ | * WBbt:0005323 WBbt:0005831 | ||
+ | * WBbt:0005814 WBbt:0006909 | ||
+ | * WBbt:6789 WBbt:0006789 | ||
+ | |||
+ | all OK | ||
+ | |||
+ | |||
+ | |||
+ | ===Clean up of objects that did not have a reference nor Author=== | ||
+ | February 2021 | ||
+ | * Clean up of expr_pattern data that did not have a reference nor a Person associated. These are empty old pattern objects. They have been deleted from postgres. Can track if necessary by looking at .ace files older than WS280 with a 'Merged to' search. | ||
+ | |||
+ | <pre>Expr_pattern : "Expr1996" | ||
+ | Remark "Merged to Expr2436."</pre> | ||
+ | |||
+ | == Importing the large large scale Expression_pattern left on Citace Minus into OA == | ||
+ | |||
+ | File is on tazendra WS232LargeScaleExpr.ace | ||
+ | |||
+ | -D file for the import generated by Juancarlos | ||
+ | /home/postgres/work/pgpopulation/exp_exprpattern/20120502_largescale/DashDWS232LargeScaleExpr.ace | ||
+ | |||
+ | there were only "Date" data. not Curated_by nor Author. | ||
+ | We kept the "Data" values on Citace minus as we did for the previous import. The other field we ignored was pictures but we did not keep them in Citace Minus as we get them via Picture OA. | ||
+ | |||
+ | -D file deposited in CitaceMinus Data_for_Citace_minus/Data_from_Daniela on May 9th 2012 | ||
+ | |||
+ | == Serial numbers for large scale imports == | ||
+ | |||
+ | <pre> | ||
+ | Itai Yanai WBPaper00041190 C elegans (Expr starting with 101 and 102) | ||
+ | Expression Expr1010178 to Expr1029229 | ||
+ | Picture WBPicture0001011201 to WBPicture0001030252 | ||
+ | |||
+ | David Miller Wormviz (Expr starting with 103 and 104) | ||
+ | Expression Expr1030000 to Expr1040545 | ||
+ | No pictures associated to the study | ||
+ | |||
+ | Itai Yanai WBPaper00041190 Other species (briggsae, japonica, remanei. They never transferred brenneri) (Expr starting with 105 till 111) | ||
+ | Expression Expr1050000 to Expr1118096 | ||
+ | Picture WBPicture0001030253 to WBPicture0001098349 | ||
+ | Gap in numbering expression objects and pictures (Expr1118097 till Expr1142791, WBPicture0001098350 till WBPicture0001123044) | ||
+ | to leave the slot for the missing brenneri data, in case they will submit | ||
+ | |||
+ | Itai Yanai WBPaper00046121 | ||
+ | Expression Expr1142792 to Expr1163308 | ||
+ | Picture WBPicture0001123045 to WBPicture0001143561 | ||
+ | |||
+ | TransgeneOme project WBPaper00041419 | ||
+ | Reserved Expression Expr1200000 to Expr1300000 | ||
+ | Reserved Picture WBPicture0002000000 to WBPicture0003000000 | ||
+ | the first expression object generated will always be Expr1200000, and Picture WBPicture0002000000 | ||
+ | The system is dynamic, we will always have a different number of objects every release, according to what new they add in TransgeneOme DB | ||
+ | |||
+ | Endrov Hench Paper WBPaper00046864 | ||
+ | Expression from Expr1170000 till Expr1170087 | ||
+ | Pictures from WBPicture0001150000 till WBPicture0001150087 | ||
+ | Movies WBMovie0000100000 till WBMovie0000100087 | ||
+ | |||
+ | Waterston paper (Paker 2019) | ||
+ | Expression from Expr2000000 to Expr2036352 | ||
+ | Pictures from WBPicture0001160000 till WBPicture0001196352 | ||
+ | |||
+ | </pre> | ||
+ | |||
+ | == Itai Yanai large scale import -WBPaper00041190 == | ||
+ | |||
+ | In order to display pictures of expression time course we needed to generate expression objects. The objects (Expression and Picture) will be deleted once Wen will finish curating microarray for all species described in the paper and once we will have in place a way to generate images of expression on the fly - data will be retrieved directly from SPELL. | ||
+ | |||
+ | For now Daniela and Juancarlos have generated 2 .ace files, one for pictures and one for expression. the files are on CitaceMinus. The files are called expr_pattern_Yanai.ace and pictures_Yanai.ace | ||
+ | |||
+ | Expression pattern and Picture objects were given high numbers so when the new display system will be in place those could be deleted without affecting anything in OA. | ||
+ | |||
+ | Expression objects go from Expr1010178 to Expr1029229 | ||
+ | |||
+ | Picture objects go from WBPicture0001011201 to WBPicture0001030252 | ||
+ | |||
+ | there are 19052 objects for each class. | ||
+ | |||
+ | Files are also located here | ||
+ | /Users/danielaraciti/Desktop/Citace_upload/Citace Minus Yanai | ||
+ | |||
+ | on december 5th 2014 other species pictures have been added too briggsae, remanei and japonica | ||
+ | Expr from Expr1050000 | ||
+ | Pictures from WBPicture0001030253 on | ||
+ | |||
+ | Additional info on Yanai_Instructions_other_species2 on Lario | ||
+ | |||
+ | total number of objects: 20294 briggsae+ 21908 japonica+ 25895 remanei = 68097 | ||
+ | *the objects will go in WS247 | ||
+ | Hinxton will generate the WBGene name on the fly according to Microarray_results | ||
+ | |||
+ | |||
+ | TOTAL Yanai import elegans + other species: 87149 | ||
+ | |||
+ | == Itai Yanai 2015 large import -WBPaper00046121 == | ||
+ | Files here | ||
+ | /Users/danielaraciti/Desktop/Expression_pattern curation/Large scale/yanai_2015 | ||
+ | |||
+ | *Uploaded 20.517 objects for expression and pictures | ||
+ | *Expr from Expr1142792 till Expr1163308 | ||
+ | *Pictures from WBPicture0001123045 till WBPicture0001143561 | ||
+ | |||
+ | *NB: there is a gap in numbering expression objects (Expr1118097 till Expr1142791)> this is because Yanai's lab did not submit brenneri's pictures yet. We inquired few times but they were never transferred. We left the numbers available for the future. The brenneri.ace files to be transferred to spica once they submit the images are located here (Lario): | ||
+ | |||
+ | */Users/danielaraciti/Desktop/brenneri | ||
+ | |||
+ | == TransgeneOme import == | ||
+ | |||
+ | We are going to import expression data (Images, constructs, and annotations) from the TransgeneOme project -Sarov et al., Cell, 2012. WBPaper00041419. | ||
+ | |||
+ | [[TransgeneOme import]] | ||
+ | |||
+ | == David Miller tiling arrays import -WBPaper00037950 == | ||
+ | |||
+ | We want to add links to Wormiz for each gene in order to display graphic expression profiling from tiling arrays. | ||
+ | We are going to request a model change for Expr_pattern by adding a DB_INFO tag. | ||
+ | |||
+ | DB_INFO ?Database ?Database_field Text | ||
+ | |||
+ | We will also request the inclusion of Microarray and Tiling Array for Type | ||
+ | |||
+ | <pre> | ||
+ | |||
+ | Type Reporter_gene ?Text | ||
+ | In_situ Text | ||
+ | Antibody ?Text | ||
+ | Northern Text // Wen [krb 030425] | ||
+ | Western Text // Wen | ||
+ | RT_PCR Text // Wen | ||
+ | Localizome ?Text //added by Wen | ||
+ | Microarray ?Microarray_experiment // Daniela | ||
+ | Tiling_array ?Analysis// Daniela | ||
+ | |||
+ | </pre> | ||
+ | |||
+ | In this way will be easier to filter out Yanai and Miller's dayta for being displayed in a separate widget possibly called 'Expression profiling graphs'. | ||
+ | Model change requested on 10-09-2013. | ||
+ | Daniela and Juancarlos have generated a .ace file that was tested and read fine in acedb. More info on how the file was generated here: /home/acedb/draciti/Expr_pattern/Miller_import. Please note that the script generates "Tiling Array" in the Type. The actual tag is Tiling_array. D have changed it manually in the .ace file. Since during the process of model approval it was suggested to add ?Analysis for the tiling Arrays D modified the file and replaced it on CitaceMinus. | ||
+ | The file is also stored here on Lario | ||
+ | /Users/danielaraciti/Desktop/Citace_upload/Citace Minus Miller/WBPaper00037950.ace | ||
+ | |||
+ | How the file looks like | ||
+ | |||
+ | <pre> | ||
+ | Database : "Wormviz" | ||
+ | Name "Wormviz" | ||
+ | URL "http:\/\/www.vanderbilt.edu\/wormdoc\/wormmap\/Welcome.html" | ||
+ | URL_constructor "http:\/\/jsp.weigelworld.org\/wormviz\/tileviz.jsp?experiment=wormviz&normalization=absolute&probesetcsv=%s" | ||
+ | |||
+ | Expr_pattern : "Expr1030000" | ||
+ | Gene "WBGene00000001" | ||
+ | Pattern "Tiling arrays expression graphs" | ||
+ | Reference "WBPaper00037950" | ||
+ | Tiling_array | ||
+ | DB_INFO "Wormviz" "id" "WBGene00000001" | ||
+ | |||
+ | Expr_pattern : "Expr1030001" | ||
+ | Gene "WBGene00000002" | ||
+ | Pattern "Tiling arrays expression graphs" | ||
+ | Reference "WBPaper00037950" | ||
+ | Tiling_array | ||
+ | DB_INFO "Wormviz" "id" "WBGene00000002" | ||
+ | </pre> | ||
+ | |||
+ | Object names from Expr1030000 to Expr1040545. 10,546 objects | ||
+ | |||
+ | ==Hench large scale import -Endrov== | ||
+ | WBPaper00046864 | ||
+ | |||
+ | Files sent to Juancarlos to create the .ace file: on tazendra: /home/acedb/draciti/Hench | ||
+ | to rerun the script, in the draciti dir: /home/azurebrd/work/parsings/daniela/20210922_hench/hench_ls_set.pl | ||
+ | |||
+ | .ace file generated here: on tazendra /home/azurebrd/work/parsings/daniela/20210922_hench/ | ||
+ | |||
+ | |||
+ | Expr from Expr1170000 till Expr1170087 | ||
+ | Pictures from WBPicture0001150000 till WBPicture0001150087 | ||
+ | Movies WBMovie0000100000 till WBMovie0000100087 | ||
+ | |||
+ | ==Paker 2019== | ||
+ | single cell RNA seq embryonic data | ||
+ | Files sent to Juancarlos to generate the .ace file: On tazendra at /home/acedb/draciti/Paker2019 | ||
+ | |||
+ | .ace file generated here /home/azurebrd/work/parsings/daniela/20220418_waterston/paker.ace | ||
+ | |||
+ | Expression from Expr2000000 to Expr2036352 | ||
+ | Pictures from WBPicture0001160000 till WBPicture0001196352 | ||
+ | * May 2023: discovered the image files had wrong mappings between gene name public ID and WBGeneID | ||
+ | ** generated a file 'Rename' that has the list of the correct mappings. On tazendra at /home/azurebrd/work/parsings/Daniela/20220418_waterston. | ||
+ | ** renamed Canopus the image files on Canopus /home/daniela/OICR/Pictures/WBPerson1562 | ||
+ | ** generated a -D file for Citace Minus: | ||
+ | |||
+ | <pre> Picture : WBPicture0001171211 | ||
+ | -D Name "egl-1_WBGene00001186_embryo_terminal.jpg" | ||
+ | egl-1_WBGene00001170_embryo_terminal.jpg | ||
+ | |||
+ | Expression | ||
+ | Expr_pattern : Expr2011211 | ||
+ | -D Gene "WBGene00001186" | ||
+ | -D Reflects_endogenous_expression_of "WBGene00001186" | ||
+ | Gene "WBGene00001170" | ||
+ | Reflects_endogenous_expression_of "WBGene00001170" | ||
+ | </pre> | ||
+ | ** Tested on spica, all good | ||
+ | ** Patch the file for WS289 | ||
+ | |||
+ | == Reilly 2020 - WBPaper00060123 == | ||
+ | The large scale dataset from Reilly, 2020, WBPaper00060123 was imported into OA via script. | ||
+ | Files here: /Users/draciti/Desktop/Reilly/Reilly_202109_for_Juancarlos | ||
+ | |||
+ | Script to parse the supplemental tables- tazendra | ||
+ | /home/postgres/work/pgpopulation/exp_exprpattern/20210915_reilly_set/to_populate | ||
+ | |||
+ | starting at pgid 19442 Expr15560-Expr15660 | ||
+ | |||
+ | == EPIC detailed == | ||
+ | |||
+ | Cell/time specific expression data have been generated by Wen | ||
+ | /Users/danielaraciti/Desktop/Expression_pattern curation/Large scale/Murray/epic.ace | ||
+ | |||
+ | the folder contains also the digitized sulston tree, the files that John Murray sent with positive/negative calls and the lifestage.ace containing all the new life stages | ||
+ | |||
+ | the EPIC.ace file was uploaded on CitaceMinus for WS246 | ||
+ | |||
+ | == Deleting files from Citace Minus == | ||
+ | |||
+ | After parsing the WS226 data into OA we dumped a .ace file for generating a -D file to delete objects from Citace Minus. To the file were added manually all the invalid objects found while parsing the data (e.g. old anatomy term IDs, old GO terms, invalid transgenes and antibody objects) | ||
+ | See list in Data to fix manually in this wiki. | ||
+ | |||
+ | == Expression-paper association == | ||
+ | |||
+ | For papers curated: | ||
+ | |||
+ | find Expr_pattern; follow Reference | ||
+ | |||
+ | For genes related: | ||
+ | |||
+ | find Expr_pattern; follow Gene | ||
+ | |||
+ | |||
+ | == Dumper == | ||
+ | |||
+ | Sequence filed does not dump fine e.g. Expr_pattern : "Expr980" Sequence "R05D8|F54E2". Need to fix it. Fixed 06162011 | ||
+ | |||
+ | |||
+ | Module located here: /home/postgres/work/citace_upload/expr_pattern/get_expr_pattern_ace.pm | ||
+ | |||
+ | Script that calls the module located here: /home/postgres/work/citace_upload/expr_pattern/use_package.pl* | ||
+ | |||
+ | use lib qw( /home/postgres/work/citace_upload/expr_pattern ); # this command line tells where to look for the module | ||
+ | use get_expr_pattern_ace; # tells to use the module | ||
+ | |||
+ | my $outfile = 'expr_pattern.ace'; | ||
+ | my $errfile = 'err.out'; # we did not set any rule for errors yet | ||
+ | |||
+ | open (OUT, ">$outfile") or die "Cannot create $outfile : $!\n"; | ||
+ | open (ERR, ">$errfile") or die "Cannot create $errfile : $!\n"; | ||
+ | |||
+ | |||
+ | my ($all_entry, $err_text) = &getExprPattern('all'); # uses the module to get all the Expr_pattern objects | ||
+ | |||
+ | print OUT "$all_entry\n"; # prints everything into the output expr_pattern file | ||
+ | if ($err_text) { print ERR "$err_text\n"; } # prints error into the output error file | ||
+ | |||
+ | close (OUT) or die "Cannot close $outfile : $!"; | ||
+ | close (ERR) or die "Cannot close $errfile : $!"; | ||
+ | |||
+ | |||
+ | === Module === | ||
+ | |||
+ | package get_expr_pattern_ace; #name of the package | ||
+ | require Exporter; # exports so that other perl scripts can use it | ||
+ | |||
+ | |||
+ | *our @ISA = qw(Exporter); | ||
+ | *our @EXPORT = qw( getExprPattern ); # we are only exporting the getExprPattern subroutine | ||
+ | *our $VERSION = 1.00; | ||
+ | |||
+ | |||
+ | *use strict; | ||
+ | *use diagnostics; | ||
+ | *use DBI; | ||
+ | |||
+ | |||
+ | *my $dbh = DBI->connect ( "dbi:Pg:dbname=testdb", "", "") or die "Cannot connect to database!\n"; # connect to postgres and the testDB database | ||
+ | |||
+ | *my $result; | ||
+ | |||
+ | *my %theHash; # where all the data are going to be stored | ||
+ | |||
+ | *my @tables = qw( name paper gene endogenous anatomy qualifier qualifiertext qualifierls goid subcellloc lifestage exprtype antibodytext reportergene insitu rtpcr northern western antibody pattern remark transgene construct curator nodump protein clone strain seqfeature sequence movieurl laboratory variation species ); # all the tables that have data | ||
+ | |||
+ | *my @maintables = qw(qw( paper gene anatomy goid subcellloc lifestage qualifierls exprtype antibodytext reportergene insitu rtpcr northern western antibody pattern remark transgene construct protein clone strain seqfeature sequence movieurl laboratory variation species ); # tables that have .ace tags | ||
+ | |||
+ | |||
+ | *my $all_entry = ''; # where all the .ace data is going to go | ||
+ | *my $err_text = ''; # where all the error data is going to go | ||
+ | |||
+ | *my %nameToIDs; #maps the expr_object id to PGID # type -> name -> ids -> count | ||
+ | *my %ids; #list of PGIDs | ||
+ | |||
+ | *my %pipeSplit; #tables that need to split on pipes | ||
+ | $pipeSplit{subcellloc}++; | ||
+ | $pipeSplit{antibodytext}++; | ||
+ | $pipeSplit{reportergene}++; | ||
+ | $pipeSplit{insitu}++; | ||
+ | $pipeSplit{rtpcr}++; | ||
+ | $pipeSplit{northern}++; | ||
+ | $pipeSplit{western}++; | ||
+ | $pipeSplit{pattern}++; | ||
+ | $pipeSplit{remark}++; | ||
+ | $pipeSplit{sequence}++; | ||
+ | |||
+ | *my %tableToTag; #mapping table to the .ace tag | ||
+ | $tableToTag{paper} = 'Reference'; | ||
+ | $tableToTag{gene} = 'Gene'; | ||
+ | $tableToTag{anatomy} = 'Anatomy_term'; | ||
+ | $tableToTag{qualifierls} = 'Life_stage'; | ||
+ | $tableToTag{goid} = 'GO_term'; | ||
+ | $tableToTag{subcellloc} = 'Subcellular_localization'; | ||
+ | $tableToTag{lifestage} = 'Life_stage'; | ||
+ | $tableToTag{exprtype} = 'Special'; | ||
+ | $tableToTag{antibodytext} = 'Antibody'; | ||
+ | $tableToTag{reportergene} = 'Reporter_gene'; | ||
+ | $tableToTag{insitu} = 'In_situ'; | ||
+ | $tableToTag{rtpcr} = 'RT_PCR'; | ||
+ | $tableToTag{northern} = 'Northern'; | ||
+ | $tableToTag{western} = 'Western'; | ||
+ | $tableToTag{antibody} = 'Antibody_info'; | ||
+ | $tableToTag{pattern} = 'Pattern'; | ||
+ | $tableToTag{remark} = 'Remark'; | ||
+ | $tableToTag{transgene} = 'Transgene'; | ||
+ | $tableToTag{protein} = 'Protein_description'; | ||
+ | $tableToTag{clone} = 'Clone'; | ||
+ | $tableToTag{strain} = 'Strain'; | ||
+ | $tableToTag{sequence} = 'Sequence'; | ||
+ | $tableToTag{movieurl} = 'MovieURL'; | ||
+ | $tableToTag{laboratory} = 'Laboratory'; | ||
+ | |||
+ | |||
+ | *my %ontologyIdToName; # mappings for ids to names (only for life stage) | ||
+ | |||
+ | 1; | ||
+ | |||
+ | sub getExprPattern { | ||
+ | *my ($flag) = shift; #can be all or the name for an expr_id | ||
+ | |||
+ | &populateOntIdToName(); #call the subroutine a thte bottom of the page for life stage name mapping | ||
+ | |||
+ | if ( $flag eq 'all' ) { $result = $dbh->prepare( "SELECT * FROM exp_name ; " ); } # get all entries for type | ||
+ | else { $result = $dbh->prepare( "SELECT * FROM exp_name WHERE exp_name = '$flag' ;" ); } # get all entries for type of object name | ||
+ | $result->execute(); # execute the query | ||
+ | while (my @row = $result->fetchrow) { # it's going to do the following for every row of the query | ||
+ | $theHash{object}{$row[0]} = $row[1]; # it's going to map the PGIDs to the Expr_ID | ||
+ | $nameToIDs{object}{$row[1]}{$row[0]}++; # for every Expr_ID we will get all the corresponding PGIDs | ||
+ | $ids{$row[0]}++; } # list of all the PGIDs | ||
+ | *my $ids = ''; my $qualifier = ''; # if it looks for a specific subset of Expr_pattern it searches only for that subset of PGIDs from the %ids | ||
+ | if ($flag ne 'all') { $ids = join"','", sort keys %ids; $qualifier = "WHERE joinkey IN ('$ids')"; } | ||
+ | foreach my $table (@tables) { | ||
+ | $result = $dbh->prepare( "SELECT * FROM exp_$table $qualifier;" ); # get data for table with qualifier (or not if not) | ||
+ | $result->execute(); | ||
+ | while (my @row = $result->fetchrow) { $theHash{$table}{$row[0]} = $row[1]; } # loops for all the values and store them in the hash | ||
+ | } # foreach my $table (@tables) | ||
+ | my %e1 = &getData($table, $joinkey); | ||
+ | my %e2 = &getData('qualifier', $joinkey); | ||
+ | my %e3 = &getData('qualifiertext', $joinkey); | ||
+ | my %e4 = &getData('qualifierls', $joinkey); | ||
+ | my $l2_exists = 0; my $l3_exists = 0; my $l4_exists = 0; | ||
+ | foreach my $e1 (sort keys %e1) { | ||
+ | foreach my $e4 (sort keys %e4) { # dump anatomy to qualifierls in both directions for every crossproduct, if there is an anatomy. 2015 02 03 | ||
+ | $l4_exists++; | ||
+ | $cur_entry{"$tag\t\"$e1\" Life_stage \"$e4\"\n"}++; | ||
+ | $cur_entry{"Life_stage\t\"$e4\" Anatomy_term \"$e1\"\n"}++; } | ||
+ | foreach my $e2 (sort keys %e2) { | ||
+ | foreach my $e3 (sort keys %e3) { | ||
+ | $cur_entry{"$tag\t\"$e1\" $e2 \"$e3\"\n"}++; $l3_exists++; } | ||
+ | unless ($l3_exists) { | ||
+ | $cur_entry{"$tag\t\"$e1\" $e2\n"}++; $l2_exists++; } } | ||
+ | unless ( ($l2_exists) || ($l3_exists) || ($l4_exists) ) { | ||
+ | $cur_entry{"$tag\t\"$e1\"\n"}++; } } } | ||
+ | elsif ($table eq 'qualifierls') { # micropub data could have qualifierls without anatomy | ||
+ | my %e1 = &getData($table, $joinkey); | ||
+ | my %e2 = &getData('anatomy', $joinkey); | ||
+ | if (scalar keys %e2 < 1) { # if there is no anatomy data, dump each qualifierls (if there was anatomy it would have dumped above under the anatomy section) | ||
+ | foreach my $e1 (sort keys %e1) { | ||
+ | $cur_entry{"$tag\t\"$e1\"\n"}++; } } } | ||
+ | foreach my $name (sort keys %{ $nameToIDs{object} }) { #loops through all the names that are in the $nameToIDs{object} | ||
+ | *my $entry = ''; my $has_data; # entry has .ace data for that expr_object. $has_data is a flag for object that have data | ||
+ | $entry .= "\nExpr_pattern : \"$name\"\n"; # add o the .ace entry the header Expr_pattern : "Expr1234" | ||
+ | |||
+ | *my %cur_entry; # is going to be a hash for filtering things (duplicated objects for qualifier -partial certain uncertain- excludes the duplicated rows that are overlapping | ||
+ | foreach my $joinkey (sort {$a<=>$b} keys %{ $nameToIDs{object}{$name} }) { # it loops through all the PGIDs for the current name | ||
+ | next if ($theHash{nodump}{$joinkey}); # skips if the pgid has a NO DUMP flag | ||
+ | foreach my $table (@maintables) { # it loops through the main tables (the ones with the .ace tag) | ||
+ | next unless ($tableToTag{$table}); # it skips it if there's no tag | ||
+ | *my $tag = $tableToTag{$table}; # gets the tag | ||
+ | if ($table eq 'anatomy') { # in case of anatomy it does the following | ||
+ | *my %e1 = &getData($table, $joinkey); # gets the anatomy term list (based on the PGIDs) | ||
+ | *my %e2 = &getData('qualifier', $joinkey); # gets the qualifier for the previous anatomy term list (based on the PGIDs) | ||
+ | *my %e3 = &getData('qualifiertext', $joinkey); # gets the qualifier text for the previous anatomy term list (based on the PGIDs) | ||
+ | |||
+ | *my $l2_exists = 0; my $l3_exists = 0; # by default there no qualifier and no qualifier text | ||
+ | foreach my $e1 (sort keys %e1) { # loops through all anatomy | ||
+ | foreach my $e2 (sort keys %e2) { # loops through all qualifier | ||
+ | foreach my $e3 (sort keys %e3) { # loops through all qualifier text | ||
+ | $cur_entry{"$tag\t\"$e1\" $e2 \"$e3\"\n"}++; $l3_exists++; } # if it finds the qualifier and if it finds the qualifier text then it adds it to the filter for later printing and makes a note that it found a qualifier text | ||
+ | unless ($l3_exists) { # if there is no qualifier text | ||
+ | $cur_entry{"$tag\t\"$e1\" $e2\n"}++; $l2_exists++; } } # and it finds the qualifier then it adds it to the filter for later printing and makes a note that it found a qualifier | ||
+ | unless ( ($l2_exists) || ($l3_exists) ) { # if there is no qualifier nor qualifier text | ||
+ | $cur_entry{"$tag\t\"$e1\"\n"}++; } } } # then it adds it to the filter just the Anatomy tag and data (e.g. Anatomy_term^t "WBbt:1234567") | ||
+ | elsif ($table eq 'exprtype') { # it checks for expr_type | ||
+ | *my %entries = &getData($table, $joinkey); # gets data for expr_type and PGID | ||
+ | foreach my $entry (sort keys %entries) { $cur_entry{"$entry\n"}++; } } # for each data it adds the data to the filter but does not add the .ace tag | ||
+ | else { | ||
+ | my %entries = &getData($table, $joinkey); # gets data for every PGID and every other table that has a tag | ||
+ | foreach my $entry (sort keys %entries) { $cur_entry{"$tag\t\"$entry\"\n"}++; } } # for each data it adds the data and the .ace tag to the filter | ||
+ | } | ||
+ | } # foreach my $joinkey (sort {$a<=>$b} keys %{ $nameToIDs{$type}{$name} }) | ||
+ | foreach my $line (sort keys %cur_entry) { $entry .= $line; $has_data++; } # for each line in the filter it adds it to the .ace entry and it flag it has data | ||
+ | if ($has_data) { $all_entry .= $entry; } # if it has data it adds this entry to all the entries | ||
+ | } # foreach my $name (sort keys %{ $nameToIDs{$type} }) | ||
+ | return( $all_entry, $err_text ); # it returns all the results to the use package script | ||
+ | } # sub getExprPattern | ||
+ | |||
+ | sub getData { # get hash of values in this table | ||
+ | *my ($table, $joinkey) = @_; # gets the tables and the PGID | ||
+ | *my %entries; # it stores all the data for this tables and PGIDs | ||
+ | if ($theHash{$table}{$joinkey}) { # if it has data | ||
+ | *my $data = $theHash{$table}{$joinkey}; # it gets the data | ||
+ | unless ($table eq 'remark') { | ||
+ | if ($data =~ m/^\"/) { $data =~ s/^\"//; } # it escapes with \ the " everywhere but not in the remarks field. | ||
+ | if ($data =~ m/\"$/) { $data =~ s/\"$//; } } | ||
+ | if ($data =~ m/\//) { $data =~ s/\//\\\//g; } # it escapes / with \ | ||
+ | //g; }data =~ s/ m/ # it strips the ^M lines | ||
+ | if ($data =~ m/\n/) { $data =~ s/\n/ /g; } # it replaces line breaks with 2 spaces | ||
+ | if ($data =~ m/^\s+/) { $data =~ s/^\s+//g; } if ($data =~ m/\s+$/) { $data =~ s/\s+$//g; } # if it begins or end with a space it gets rid of those | ||
+ | *my @data; # this is an array for storing multiple data | ||
+ | if ($data =~ m/\",\"/) { @data = split/\",\"/, $data; } # if the data is a multiontology or multidropdown, it splits on the "," | ||
+ | elsif ($pipeSplit{$table}) { @data = split/\|/, $data; } # otherwise if it is in the list of the pipe split tables it splits on the pipe | ||
+ | else { push @data, $data; } # if it is neither of those treats the data as if it is the only entry | ||
+ | foreach my $value (@data) { # for each of those multiple values | ||
+ | if ($value =~ m/\"/) { $value =~ s/\"/\\\"/g; } # if there is a " it adds a backslash to neutralize it for acedb | ||
+ | if ($value =~ m/^\s+/) { $value =~ s/^\s+//g; } # if the data begins with a space get rid of the space | ||
+ | if ($value =~ m/\s+$/) { $value =~ s/\s+$//g; } # if the data ends with a space get rid of the space | ||
+ | if ($table eq 'lifestage') { if ($ontologyIdToName{$table}{$value}) { $value = $ontologyIdToName{$table}{$value}; } } # convert life | ||
+ | stage ids to lifestage names. 2011 05 13 # if it's a life stage and there's an ID to name mapping for these data then it uses the name instead of the ID | ||
+ | if ($value) { $entries{$value}++; } # if after all of the above there is a value it adds to a filter of values | ||
+ | } | ||
+ | } | ||
+ | return %entries; # it returns all the data that it got for this table and PGID | ||
+ | } # sub getData | ||
+ | |||
+ | sub populateOntIdToName { # reads form obo_name_lifestage to get the mappings from life_stage id to name | ||
+ | $result = $dbh->prepare( "SELECT * FROM obo_name_lifestage;" ); $result->execute(); | ||
+ | while (my @row = $result->fetchrow) { $ontologyIdToName{'lifestage'}{$row[0]} = $row[1]; } | ||
+ | } # sub populateOntIdToName | ||
+ | |||
+ | We have put an error check for dead genes and invalid papers on may 31st 2012: | ||
+ | |||
+ | elsif ($table eq 'gene') { | ||
+ | my %entries = &getData($table, $joinkey); | ||
+ | foreach my $entry (sort keys %entries) { | ||
+ | if ($deadObjects{gene}{$entry}) { $err_text .= "$name has dead gene $entry $deadObjects{gene}{$entry}\n"; } | ||
+ | else { $cur_entry{"$tag\t\"$entry\"\n"}++; } } } | ||
+ | elsif ($table eq 'paper') { | ||
+ | my %entries = &getData($table, $joinkey); | ||
+ | foreach my $entry (sort keys %entries) { | ||
+ | if ($deadObjects{paper}{$entry}) { $err_text .= "$name has dead paper $entry $deadObjects{paper}{$entry}\n"; } | ||
+ | else { $cur_entry{"$tag\t\"$entry\"\n"}++; } } } | ||
+ | |||
+ | |||
+ | sub populateDeadObjects { | ||
+ | $result = $dbh->prepare( "SELECT * FROM gin_dead;" ); $result->execute(); | ||
+ | while (my @row = $result->fetchrow) { $deadObjects{gene}{"WBGene$row[0]"} = $row[1]; } | ||
+ | $result = $dbh->prepare( "SELECT * FROM pap_status WHERE pap_status = 'invalid';" ); $result->execute(); | ||
+ | while (my @row = $result->fetchrow) { $deadObjects{paper}{"WBPaper$row[0]"} = $row[1]; } | ||
+ | } # sub populateDeadObjects | ||
+ | |||
+ | ===Historical Gene tag=== | ||
+ | |||
+ | === Handling Dead Genes During Dump Process === | ||
+ | |||
+ | The dumper script will now (as of May, 2013) run an automatic check for dead genes in any gene field. Any genes that are considered dead that are referenced in an Interaction object in the OA will be handled in the following manner: | ||
+ | |||
+ | 1) If there is a replacement for the gene (i.e. the gene has merged into another gene), the dead gene will be dumped into a "Historical_gene" field in the .ACE file, the replacement gene will fill the original gene field. A comment will be added to the Historical_gene field via the #Evidence hash. The original gene field (now with the updated gene reference) will be printed with an "Inferred_automatically" tag after the gene. So, for example, if WBGene00001234 is now a dead gene that has been merged into WBGene00002345: | ||
+ | |||
+ | <pre> | ||
+ | Gene "WBGene00001234" | ||
+ | </pre> | ||
+ | |||
+ | becomes | ||
+ | |||
+ | <pre> | ||
+ | Gene "WBGene00002345" Inferred_automatically | ||
+ | Historical_gene "WBGene00001234" Remark "Note: This object originally referred to WBGene00001234. | ||
+ | WBGene00001234 is now considered dead and has been merged into WBGene00002345. WBGene00002345 has | ||
+ | replaced WBGene00001234 accordingly." | ||
+ | </pre> | ||
+ | |||
+ | Notes: | ||
+ | |||
+ | Dead -> dead | ||
+ | Suppressed -> suppressed | ||
+ | merged_into WBGene -> merged | ||
+ | split_into -> split | ||
+ | looping through the genes where something happened to make sure they don't also point at something else | ||
+ | exp_gene | ||
+ | merged -> historical_gene + remark AND gene <gene> Inferred_automatically | ||
+ | dead -> historical_gene + remark | ||
+ | suppressed -> historical_gene + remark | ||
+ | split -> historical_gene + remark AND error message | ||
+ | normal ones -> just tag + value | ||
+ | |||
+ | Examples: | ||
+ | |||
+ | A split gene: WBGene00012507 | ||
+ | A merged gene: WBGene0e0007524 | ||
+ | A dead gene: WBGene00007814 | ||
+ | A suppressed gene: WBGene00015490 | ||
+ | |||
+ | == Data parsing == | ||
+ | |||
+ | File that was used for parsing is the WS226 dump and is located here: /home/postgres/work/pgpopulation/exp_exprpattern/ExprWS226.ace | ||
+ | |||
+ | There are 1802 objects without any Anatomy_term. I'm assuming this is okay -- J Yes, it is --D | ||
+ | |||
+ | What do we do with Marker objects ? Treat them the same as Expr_pattern objects ? -- J yes, treat the same --D | ||
+ | |||
+ | Life_stage in obo class have WBls:####### IDs, but data has lifestage names, is this bad data ? The OA only supports IDs (see phenotype, generegulation, picture OA) : can we convert the life stage names into WBls:#######? I asked Wen about this and she is fine with it --D Changed the parser to convert from name to ID, but still waiting until we talk to Karen | ||
+ | |||
+ | * WS gene reg obj : http://wormbase.org/db/misc/etree?name=WBPaper00036764_lin-28.b;class=Gene_regulation | ||
+ | |||
+ | * WS expr pat obj : http://wormbase.org/db/misc/etree?name=Expr2201;class=Expr_pattern | ||
+ | |||
+ | /home/postgres/work/pgpopulation/exp_exprpattern/invalid_ontology_values has many other objects that don't fit the ontologies. It would be best to either fix them in citace and redump, or to get mappings of bad-to-good values and put them in the parser. This was run on the sandbox, so if any values are real, the sandbox might not have all the values. -- J I see, there are many objects with invalid format for different classes. i will figure out what was the problem for each of them and get back to you --D. | ||
+ | 20 Anatomy terms having old ids -> Daniela generated mapping with new IDs. 2 invalid objects for GO -> Alerted Ranjana, waiting for answer 5 Antibody objects -> alerted Xiaodong, 2 fixed, 3 waiting for Wen's answer (did she create the objects already or we should generate new ones?). 37 transgenes objects -> alerted Karen | ||
+ | |||
+ | Strain and Clone don't have ontologies yet, once we have those we'll see if any data is bad -- J ok --D | ||
+ | |||
+ | Only looking at WBPictureID pictures, if we need to dump both ways, it will get conversions from the WBPictureID's name. -- J I am not sure I get this..D we talked about it | ||
+ | |||
+ | ==-D file for Citace Minus== | ||
+ | |||
+ | when tried to parse -D file into cite minus the following errors occurred; | ||
+ | |||
+ | * Pattern: 2 objects Expr98 did not parse in 2 pattern descriptions. Not in OA | ||
+ | |||
+ | I will add them manually in OA and -D those Done DR 06142011 | ||
+ | |||
+ | * Anatomy_term: 14 objects | ||
+ | * 2 in Expr 120 checked OK only extra space at the end | ||
+ | * 1 in Expr 1269 checked OK only extra space at the end | ||
+ | * 1 in Expr 1569 checked OK only extra space at the end | ||
+ | * 8 in Expr2812 checked OK only extra space at the end | ||
+ | * 1 in Expr3211 checked OK only extra space at the end | ||
+ | * 1 in Expr7467 checked OK only extra space at the end | ||
+ | |||
+ | We can delete them from Citace Minus, text is fine in OA. Done DR 06142011 | ||
+ | |||
+ | * Antibody_info: 4 objects | ||
+ | they are already on my list. Xiaodong should generate the objects. Will delete them Done DR 06142011 and add them manually in OA when ready. '''TODO''' | ||
+ | |||
+ | * Reference: 3 objects | ||
+ | Expr_pattern : "Expr2916" | ||
+ | Reference "WBPaper00006518" | ||
+ | |||
+ | Expr_pattern : "Expr2994" | ||
+ | Reference "WBPaper00013501" | ||
+ | |||
+ | Expr_pattern : "Expr3715" | ||
+ | Reference "WBPaper00025175" | ||
+ | |||
+ | Wen looked into it and these are obsolete IDs. We delete them from Citace Minus Done DR 06152011 | ||
+ | |||
+ | * Gene: 30 objects. This happened because the Gene field is an ontology. Some Expr_pattenr objects are associated to multiple genes therefore it did not parse the data in correctly. not only this the problem. Wen is looking into it could be obsolete IDs. Wen checked. Are obdolete we can -D. DR06162011 | ||
+ | . | ||
+ | |||
+ | * Remark: 1 object Expr111 was not deleted as I added Pseudogene info in the remarks. Deleted from citace minus added into OA OK. Done DR 06142011 | ||
+ | |||
+ | * Pseudogene: 1 object | ||
+ | Expr_pattern : "Expr111" | ||
+ | Pseudogene "F56D5.8" | ||
+ | can fix this manually and delete it from Citace Minus. Done DR 06142011 | ||
+ | |||
+ | * Cell 26 objects | ||
+ | |||
+ | Expr_pattern : "Expr7477" | ||
+ | Cell "P3.p" Certain | ||
+ | Cell "P4.p" Certain | ||
+ | Cell "P5.p" Certain | ||
+ | Cell "P6.p" Certain | ||
+ | Cell "P7.p" Certain | ||
+ | Cell "P8.p" Certain | ||
+ | |||
+ | done DR 06142011 | ||
+ | |||
+ | Expr_pattern : "Expr7595" | ||
+ | Cell "CANL" Uncertain | ||
+ | Cell "CANR" Uncertain | ||
+ | |||
+ | done DR 06142011 | ||
+ | |||
+ | Expr_pattern : "Expr7605" | ||
+ | Cell "M4" Certain | ||
+ | |||
+ | |||
+ | done DR 06142011 | ||
+ | |||
+ | Expr_pattern : "Expr7632" | ||
+ | Cell "AVG" Certain | ||
+ | Cell "M5" Certain | ||
+ | Cell "PVT" Certain | ||
+ | Cell "PVCL" Uncertain | ||
+ | Cell "PVCR" Uncertain | ||
+ | Cell "PVNL" Uncertain | ||
+ | Cell "PVNR" Uncertain | ||
+ | Cell "PVQL" Uncertain | ||
+ | Cell "PVQR" Uncertain | ||
+ | |||
+ | done DR 06142011 | ||
+ | |||
+ | Expr_pattern : "Expr7691" | ||
+ | Cell "P3.p" Certain | ||
+ | Cell "P4.p" Certain | ||
+ | Cell "P5.p" Certain | ||
+ | Cell "P6.p" Certain | ||
+ | Cell "P7.p" Certain | ||
+ | Cell "P8.p" Certain | ||
+ | |||
+ | done DR 06142011 | ||
+ | |||
+ | Expr_pattern : "Expr8715" | ||
+ | Cell "M.dlpa" Certain | ||
+ | Cell "M.drpa" Certain | ||
+ | |||
+ | done DR 06142011 | ||
+ | |||
+ | * Sequence | ||
+ | Expr_pattern : "Expr12" | ||
+ | -D Sequence "Z28375" -C "EMBL Z28375" | ||
+ | -D Sequence "Z28376" -C "EMBL Z28376" | ||
+ | -D Sequence "Z28377" -C "EMBL Z28377" | ||
+ | |||
+ | Expr_pattern : "Expr52" | ||
+ | -D Sequence "R11H6" | ||
+ | -D Sequence "Y40H4A" | ||
+ | |||
+ | Expr_pattern : "Expr979" | ||
+ | -D Sequence "F54E2" | ||
+ | -D Sequence "R05D8" | ||
+ | |||
+ | Expr_pattern : "Expr980" | ||
+ | -D Sequence "F54E2" | ||
+ | -D Sequence "R05D8" | ||
+ | |||
+ | Done Dr 06142011 | ||
+ | |||
+ | |||
+ | * Picture. All picture bjects were -D DR06152011 | ||
+ | |||
+ | |||
+ | ==Exporting Reporter Gene description from Expr_pattern OA to Transgene OA== | ||
+ | |||
+ | IMPORTANT: whenever you curate an expr object fill in all the fields before duplicating the object itself. E.g if you need to put expression in the pharynx 'certain' intestine 'uncertain, make sure to generate an object with pharynx 'certain' fill in all the other info, e.g. WBPaper, reporter gene, pattern, ... and THEN duplicate the object. | ||
+ | this is important also for the generation of new transgenes with the script below. | ||
+ | |||
+ | |||
+ | |||
+ | In the past transgene objects were generated only when authors did use standard nomenclature (e.g. adEx1256, acIs101). No new transgene objects were created for reporter fusions when there was no standard nomenclature. | ||
+ | |||
+ | From Jan 2012 we want to start generating transgene objects also for those reporter genes. | ||
+ | |||
+ | '''Action items:''' | ||
+ | |||
+ | Import all the transgene objects with no standard nomenclature present in Expression pattern OA into Transgene OA | ||
+ | |||
+ | In order to accomplish that we should | ||
+ | |||
+ | * Generate a name for the objects that have exp_reportergene and no exp_transgene and assign it to the table exp_transgene in Expression pattern OA. The name should be: ExprID_Ex (e.g. Expr1234_Ex) | ||
+ | |||
+ | * For all the ExprID_Ex that were generated in the previous step we should populate postgres tables in transgene OA as follows: | ||
+ | |||
+ | exp_transgene -> trp_name | ||
+ | |||
+ | exp_paper -> trp_paper | ||
+ | |||
+ | 435 expr objects don't have papers. transfer those objects ? SELECT * FROM exp_reportergene WHERE joinkey NOT IN (SELECT joinkey FROM exp_transgene) AND joinkey NOT IN (SELECT joinkey FROM exp_paper); -- J | ||
+ | |||
+ | I spot checked them. The ones I have seen are coming from Ian Hope large scale expression that he sent few years back to Wen but we need to check systematically if they all come from him. The objects have an Author field associated but the Author is not in OA. When we created the Expr_pattern OA we decided to keep Author, Date, and Curated_by in a separate file in Citace Minus as they were fields not used anymore (see wiki above for reference). For those objects we should put the author in trp_person. If it's hard to retrieve the author from Citace Minus we could get it from the file "ExprWS221.ace" on Tazendra in /home/acedb/draciti dir. D There are 4307 after filtering duplicates -- J | ||
+ | |||
+ | exp_reportergene -> trp_remark | ||
+ | |||
+ | trp_curator -> Daniela Raciti ( WBPerson12028 ). | ||
+ | |||
+ | trp_nodump (for all Daniela Raciti) | ||
+ | |||
+ | There is no trp_nodump table using trp_objpap_falsepos (Fail field) -- J Karen said is good. D | ||
+ | |||
+ | Attention: we will take from Expr_pattern OA all objects regardless of the curator -both Wen and Daniela- but when populating transgene OA we will populate the curator field just with Daniela. | ||
+ | |||
+ | Expr objects exists in multiple OA rows, so there are multiple pgids per Expr object, so multiple objects get created in the transgene OA. See Expr1416. Is this correct ? I don't know if Transgene objects already have multiple pgids, and whether the dumper handles it. This is also going to make the deletion script more complicated, do all pgids for a given transgene name have to be dumpable for deletion ? -- J right, we have multiple pgids for a single Expr_object but the exp_reportergene is the same for all. We should have in Transgene OA only one Expr object i.e. Expr1416_Ex pgid 9980 and get rid of the duplicates. D | ||
+ | |||
+ | 13 non-"Hope IA" authors are on the sandbox at /home/postgres/work/pgpopulation/transgene/20120127_expr_to_transgene/bad_authors let me know the mapping of those authors to WBPerson#### . After the mapping is done, we'll see how many of the 48 Expr objects have neither person nor paper; e.g. Expr1684 doesn't. -- J | ||
+ | |||
+ | Mapping: | ||
+ | |||
+ | *Arnold JM = WBPerson16468 | ||
+ | *Bauer PK = WBPerson5125 | ||
+ | *Britton C = WBPerson78 | ||
+ | *Hashmi S = WBPerson4368 | ||
+ | *Herbert R = WBPerson16472 | ||
+ | *Krause MW = WBPerson346 | ||
+ | *Lustigman S = WBPerson390 | ||
+ | *Lynch AS = WBPerson1232 | ||
+ | *McCarroll D = WBPerson16469 | ||
+ | *Mohler WA = WBPerson428 | ||
+ | *Mounsey A = WBPerson1716 | ||
+ | *Royall CM = WBPerson16473 | ||
+ | *Seydoux GC = WBPerson575 | ||
+ | |||
+ | After mapping 3 objects had no paper nor person. All 3 are personal communications. J added them -> OK. | ||
+ | |||
+ | *Expr1684 -> Catherine Wolkow ( WBPerson696 ) | ||
+ | *Expr1685 -> Massimo Hilliard ( WBPerson258 ) | ||
+ | *Expr2781 -> Aharon Solomon ( WBPerson3909 ) | ||
+ | |||
+ | |||
+ | The transgene objects will be revised by Karen (in order to delete duplicates if any). | ||
+ | |||
+ | To run the population script do it from a dir where you have write permission e.g. /home/acedb/draciti/Expr_pattern and then give the command | ||
+ | |||
+ | * ~postgres/work/pgpopulation/transgene/20120127_expr_to_transgene/transfer_expr_to_transgene.pl > outputlog | ||
+ | |||
+ | Populate script is /home/postgres/work/pgpopulation/transgene/20120127_expr_to_transgene/transfer_expr_to_transgene.pl -- J | ||
+ | get expr objects that have a reportergene but no transgene : SELECT * FROM exp_reportergene WHERE joinkey NOT IN (SELECT joinkey FROM exp_transgene); for each of those get the exp_name and exp_paper | ||
+ | Transgene name is ExprName plus _Ex Add to exp_trasnsgene and exp_transgene_hst as multiontology with doublequotes. | ||
+ | Get highest transgene pgid, and for each new transgene, create a new transgene with that pgid, trp_name the new transgene name, trp_curator WBPerson12028, trp_objpap_falsepos Fail, trp_remark the exp_reportergene, if there's a paper trp_paper is the exp_paper, if there is no paper look at authors in ExprWS221.ace, and map to persons from Daniela's list into trp_person with doublequotes. If it's not in the list, tell Daniela to get WBPerson mappings. -- J | ||
+ | |||
+ | Author -> Person: | ||
+ | Hope IA = WBPerson266 | ||
+ | |||
+ | To run the deletion script do it from a dir where you have write permission e.g. /home/acedb/draciti/Expr_pattern and then give the command | ||
+ | |||
+ | * ~postgres/work/pgpopulation/transgene/20120127_expr_to_transgene/delete_cleared_exp_reportergene.pl > output1 | ||
+ | |||
+ | Deletion script is /home/postgres/work/pgpopulation/transgene/20120127_expr_to_transgene/delete_cleared_exp_reportergene.pl -- J | ||
+ | It looks for SELECT * FROM trp_name WHERE trp_name ~ 'Expr.*_Ex' AND joinkey NOT IN (SELECT joinkey FROM trp_objpap_falsepos); transgene names that have not been set as Fail. | ||
+ | "SELECT * FROM trp_synonym WHERE trp_synonym ~ 'Expr.*_Ex' AND joinkey NOT IN (SELECT joinkey FROM trp_objpap_falsepos);" | ||
+ | It looks at expression transgenes that match Expr.*_Ex : SELECT * FROM exp_transgene WHERE exp_transgene ~ '"Expr.*_Ex"' AND joinkey IN (SELECT joinkey FROM exp_reportergene); | ||
+ | It gets invidual transgenes and if any of them don't match Expr.*_Ex it gives an error message. If all of them match, it checks that all transgenes are dumpable. If all are dumpable, it deletes the exp_reportergene for that pgid and inserts a null into the exp_reportergene_hst for that pgid --- J | ||
+ | |||
+ | We added a rule that should check for 'Expr.*_Ex' in the trp_Synonym too because when the transgene already existed Karen added the Expr_name under synonym. we want that the reporter gene field for those objects in Expr_OA will be deleted as well therefore Juancarlos added a rule in the deletion script that will lok for Expr.*_Ex in the synonym and if it finds it will delete repoter gene field from Expr_pattern OA. | ||
+ | Line that he added: | ||
+ | |||
+ | "SELECT * FROM trp_synonym WHERE trp_synonym ~ 'Expr.*_Ex' AND joinkey NOT IN (SELECT joinkey FROM trp_objpap_falsepos);" | ||
+ | |||
+ | * There were >1000 objects that had a standard transgene name (eg adIs1783 and the reporter gene field filled with redundant information, e.g. [mpk-1::gfp]) Daniela went manually through them and deleted the redundant info in the reporter gene field in Expr_pattern OA. The info was already in transgene. Whenever the reporter gene field in Expr_pattern had more info (e.g. sequence) Daniela copied that info into the Transgene Remark field (in line with what we have done for objects above). also, whenever it was specified if transcriptional or translational fusion, that info was added to the reporter type in transgene OA. Double checked with Karen 02.16.2012 -> OK. | ||
+ | Whenever the info in the Reporter Gene field was more pertinent to Expr pattern it was left there. E.g. Expr1046: The larval expression pattern was studied by observing GFP expression in a strain carrying an APR-1::GFP reporter transgene on the integrated array zhIs2. | ||
+ | |||
+ | One example of redundant information was for kxEx74, Expr4687. The info present in the Expr reporter gene field were exactly the same as in the Transgene remark field. | ||
+ | |||
+ | * '''February 29th 2012: Daniela run the population script on tazendra after having ested the system on Mangolassi (everything was fine there). Outputlog file with the results of the transfer was copied on Daniela's pc Desktop/Wormbase/Expr_pattern_to_Transgene_transfer''' | ||
+ | * '''Daniela run the deletion script on september 5th 2012. In this way we deleted the reporter gene info in expression OA for all the objects that were transfered to transgene OA with the population script in february. Karen, Juancarlos and Daniela agreed that we had to set all transgene objects as 'dumpable' in order to delete the reporter gene field from expression OA. Daniela edited the transgenes Fail via the Batch mode. The results of the deletion script are on Lario in Desktop/Wormbase/Expr_pattern_to_Transgene_transfer''' | ||
+ | '''Moving forward we will use the new pipeline. In the new pipeline the script immediately deletes the reporter gene field in Expression OA.''' | ||
+ | |||
+ | * '''On September 11th 2012 Daniela run the new population script for the first time. the output is on tazendra /home/acedb/draciti/Expr_pattern/Expr_to_Transgene newpopulationscript_09112012. Checked few examples, everything went as it should have''' | ||
+ | |||
+ | * '''NB: there should be still some objects in Expression OA that have the reporter gene field AND the transgene. Those should be double checked. It can be that they will bear duplicated info. This happened for the object that Karen merged or looked at before the new pipeline was set in place.''' | ||
+ | |||
+ | |||
+ | The population script was run at the end of February 2012, from August on we started using the script that is described in the section below -"Current pipeline" | ||
+ | |||
+ | Even after running the deletion script we will not change anything in the Expression pattern page display. This is because there is already a link to the transgene page so all the information about the construct could be found there. | ||
+ | |||
+ | Addendum: Whenever in the reporter gene field in OA there are listed a transcriptional fusion and a translational fusion, only one name Expr123_Ex will be generated. that reporter gene will be transfered to transgene OA. Karen will add as synonym Expr123_Ex for both the transcriptional and translational fusion. If now we want to populate back Expr_OA with the real name of those transgenes we have to ask juancarlos to look for all objects that have the same Expr123_Ex name AND the same Paper but different transgene names and populate Expr_OA in the transgene field with both transgene names | ||
+ | |||
+ | |||
+ | ===Current Pipeline=== | ||
+ | |||
+ | For each Expression pattern object that does have a "Reporter Gene" NOT blank and a transgene field BLANK the following will happen: | ||
+ | |||
+ | * For the ones that don't have an existing Expr_Ex in these 2 tables trp_publicname and trp_synonym it will: | ||
+ | |||
+ | **generate an object in the transgene OA called WBTransgene000##### and put it in the trp_name | ||
+ | **add in the trp_synonym the name Expr1234_Ex (the numbering of the Expr objects is after the expression pattern object name) | ||
+ | **add WBPerson12028 in the trp_curator -this has to be changed whenever somebody else will take over expression patterns | ||
+ | **add the reporter gene field text into the trp_remark field | ||
+ | **add the WBPaper into the trp_paper field | ||
+ | |||
+ | *Now all the Expression objects have a mapping to the transgene and the script will | ||
+ | |||
+ | **add into the expr_transgene field the WBTransgene000##### name | ||
+ | **delete the text in the exp_reportergene field | ||
+ | |||
+ | |||
+ | Differently from the old pipeline every object here will be set as dump. | ||
+ | |||
+ | the script is located here: | ||
+ | |||
+ | '''~postgres/work/pgpopulation/transgene/20120127_expr_to_transgene/transfer_expr_to_transgene.pl > outputlog''' | ||
+ | |||
+ | '''Daniela runs the script MANUALLY before each upload, the script is normally run from this dir on tazendra /home/acedb/draciti/Transgene_generation''' | ||
+ | |||
+ | * the script was run for the first time on September 11th 2012. The output is on tazendra /home/acedb/draciti/Expr_pattern/Expr_to_Transgene newpopulationscript_09112012. Checked few examples, everything went as it should have. | ||
+ | |||
+ | In the script, sub readExprAce there is a subroutine that will get the mappings to Authors from the file /home/acedb/draciti/Expr_pattern/ExprWS221.ace. If the file is not there the script will fail. See above in this section for authors mapping. | ||
+ | |||
+ | sub populateTrpNameToId is getting trp_name (Transgene ID WBTransgene000#####) and map it to trp_publicname and is getting synonyms -trp_synonym- and map it to the trp_name. | ||
+ | |||
+ | For the synonym is splitting on pipes and removing spaces at the beginning and at the end | ||
+ | |||
+ | $trpNameToId{$syn} = $row[0]; this line stores into a hash the mappings of name into ID | ||
+ | |||
+ | ===merged into=== | ||
+ | |||
+ | Juancarlos added in transgene OA a "Merged into" box so that Karen will be able to merge transgenes. The transgene object that you merged into the other will be marked as invalid. Say that you are curating transgene2 and you see that is identical to transgene 1, you now click on to "merge into" and select transgene1. Transgene 2 becomes invalid. And Karen will have to add transgene2 into the synonym field of transgene1. | ||
+ | |||
+ | |||
+ | ===bad strains=== | ||
+ | |||
+ | to check if there are wrong strains in the OA run this script for a dir where you have permission | ||
+ | |||
+ | /home/postgres/work/pgpopulation/exp_exprpattern/20121011_find_bad_strain/find_bad_strain.pl | ||
+ | > bad | ||
+ | |||
+ | ==Miller paper- tiling arrays== | ||
+ | |||
+ | We have added in CitaceMinus a static file with the links to http://www.vanderbilt.edu/wormdoc/wormmap/Expressed_genes.html | ||
+ | |||
+ | The paper is | ||
+ | Spencer WC, Zeller G, Watson JD, Henz SR, Watkins KL, McWhirter RD, Petersen | ||
+ | S, Sreedharan VT, Widmer C, Jo J, Reinke V, Petrella L, Strome S, Von Stetina SE, | ||
+ | Katz M, Shaham S, Rätsch G, Miller DM 3rd. A spatial and temporal map of C. | ||
+ | elegans gene expression. Genome Res. 2011 Feb;21(2):325-41. Epub 2010 Dec 22. | ||
+ | PubMed PMID: 21177967; PubMed Central PMCID: PMC3032935. | ||
+ | |||
+ | The file: miller_cell_type_expression.ace | ||
+ | |||
+ | We have added links to the site from the anatomy page. | ||
+ | |||
+ | |||
+ | == Other nematodes SVM analysis for gene expression== | ||
+ | |||
+ | From Yuling (Nov 7th 2013) | ||
+ | Results here: | ||
+ | *http://131.215.52.209/daniela/nematode/ | ||
+ | |||
+ | Looks like only 1% is deemed positive... | ||
+ | |||
+ | *146 positives | ||
+ | *15253 negatives | ||
+ | |||
+ | Daniela will go through the list and evaluate | ||
+ | |||
+ | Alternative approach: we can check how many papers have been curated for other species and use those as positive training set. | ||
+ | The list of papers is below | ||
+ | |||
+ | == Other species == | ||
+ | |||
+ | there is a script on tazendra to check the objects curated to non elegans genes (the script looks in to the gin_synonyms tables and check what is not CELE) | ||
+ | |||
+ | /home/acedb/draciti/Expr_pattern/20140516_non_cele | ||
+ | |||
+ | find_non_cele.pl* | ||
+ | |||
+ | the output on May 16th 2014 was | ||
+ | <pre> | ||
+ | WBGene00001198 not CELE_ in pgid 348 | ||
+ | WBGene00001198 not CELE_ in pgid 350 | ||
+ | WBGene00001198 not CELE_ in pgid 351 | ||
+ | WBGene00001198 not CELE_ in pgid 352 | ||
+ | WBGene00002126 not CELE_ in pgid 1424 | ||
+ | WBGene00009821 not CELE_ in pgid 1520 | ||
+ | WBGene00012263 not CELE_ in pgid 2914 | ||
+ | WBGene00043408 not CELE_ in pgid 2914 | ||
+ | WBGene00009175 not CELE_ in pgid 2914 | ||
+ | WBGene00016878 not CELE_ in pgid 2914 | ||
+ | WBGene00020512 not CELE_ in pgid 2914 | ||
+ | WBGene00019252 not CELE_ in pgid 2914 | ||
+ | WBGene00015732 not CELE_ in pgid 3033 | ||
+ | WBGene00003454 not CELE_ in pgid 3330 | ||
+ | WBGene00003440 not CELE_ in pgid 3330 | ||
+ | WBGene00003441 not CELE_ in pgid 3330 | ||
+ | WBGene00003427 not CELE_ in pgid 3330 | ||
+ | WBGene00003461 not CELE_ in pgid 3330 | ||
+ | WBGene00003459 not CELE_ in pgid 3330 | ||
+ | WBGene00003428 not CELE_ in pgid 3330 | ||
+ | WBGene00003447 not CELE_ in pgid 3330 | ||
+ | WBGene00003455 not CELE_ in pgid 3330 | ||
+ | WBGene00003453 not CELE_ in pgid 3330 | ||
+ | WBGene00003436 not CELE_ in pgid 3330 | ||
+ | WBGene00003439 not CELE_ in pgid 3330 | ||
+ | WBGene00023572 not CELE_ in pgid 4733 | ||
+ | WBGene00023572 not CELE_ in pgid 4734 | ||
+ | WBGene00023575 not CELE_ in pgid 4737 | ||
+ | WBGene00023572 not CELE_ in pgid 4739 | ||
+ | WBGene00023572 not CELE_ in pgid 4740 | ||
+ | WBGene00037006 not CELE_ in pgid 4743 | ||
+ | WBGene00037006 not CELE_ in pgid 4744 | ||
+ | WBGene00037005 not CELE_ in pgid 4747 | ||
+ | WBGene00037005 not CELE_ in pgid 4748 | ||
+ | WBGene00030970 not CELE_ in pgid 4750 | ||
+ | WBGene00041435 not CELE_ in pgid 4753 | ||
+ | WBGene00041435 not CELE_ in pgid 4754 | ||
+ | WBGene00015274 not CELE_ in pgid 4829 | ||
+ | WBGene00015274 not CELE_ in pgid 4830 | ||
+ | WBGene00009175 not CELE_ in pgid 5797 | ||
+ | WBGene00009175 not CELE_ in pgid 5798 | ||
+ | WBGene00000600 not CELE_ in pgid 5823 | ||
+ | WBGene00000604 not CELE_ in pgid 5894 | ||
+ | WBGene00000605 not CELE_ in pgid 5910 | ||
+ | WBGene00000607 not CELE_ in pgid 5990 | ||
+ | WBGene00012263 not CELE_ in pgid 6244 | ||
+ | WBGene00004041 not CELE_ in pgid 6776 | ||
+ | WBGene00002126 not CELE_ in pgid 7395 | ||
+ | WBGene00032753 not CELE_ in pgid 7603 | ||
+ | WBGene00018677 not CELE_ in pgid 7886 | ||
+ | WBGene00018677 not CELE_ in pgid 7887 | ||
+ | WBGene00018677 not CELE_ in pgid 7888 | ||
+ | WBGene00018677 not CELE_ in pgid 7889 | ||
+ | WBGene00004041 not CELE_ in pgid 8272 | ||
+ | WBGene00002485 not CELE_ in pgid 9192 | ||
+ | WBGene00117029 not CELE_ in pgid 9327 | ||
+ | WBGene00043222 not CELE_ in pgid 10105 | ||
+ | WBGene00043320 not CELE_ in pgid 10671 | ||
+ | WBGene00043320 not CELE_ in pgid 10672 | ||
+ | WBGene00010154 not CELE_ in pgid 10897 | ||
+ | WBGene00019581 not CELE_ in pgid 11074 | ||
+ | WBGene00020312 not CELE_ in pgid 11399 | ||
+ | WBGene00021255 not CELE_ in pgid 11756 | ||
+ | WBGene00021255 not CELE_ in pgid 11757 | ||
+ | WBGene00045485 not CELE_ in pgid 12323 | ||
+ | WBGene00029022 not CELE_ in pgid 13372 | ||
+ | WBGene00027230 not CELE_ in pgid 13550 | ||
+ | WBGene00025707 not CELE_ in pgid 13949 | ||
+ | WBGene00023404 not CELE_ in pgid 14000 | ||
+ | WBGene00023404 not CELE_ in pgid 14001 | ||
+ | WBGene00033342 not CELE_ in pgid 14026 | ||
+ | WBGene00059989 not CELE_ in pgid 14027 | ||
+ | WBGene00195119 not CELE_ in pgid 14036 | ||
+ | WBGene00101073 not CELE_ in pgid 14036 | ||
+ | WBGene00025707 not CELE_ in pgid 14037 | ||
+ | WBGene00034222 not CELE_ in pgid 14037 | ||
+ | WBGene00224104 not CELE_ in pgid 14038 | ||
+ | WBGene00233940 not CELE_ in pgid 14039 | ||
+ | WBGene00231085 not CELE_ in pgid 14040 | ||
+ | WBGene00042594 not CELE_ in pgid 14109 | ||
+ | |||
+ | </pre> | ||
+ | |||
+ | some of these are dead genes and some came up because they do not have CELE_ in the synonyms (that is how the script identifies non elegans) | ||
+ | |||
+ | the 'clean' list is: | ||
+ | |||
+ | <pre> | ||
+ | WBGene00023572 not CELE_ in pgid 4733 briggsae WBPaper00028961 | ||
+ | WBGene00023572 not CELE_ in pgid 4734 briggsae WBPaper00028961 | ||
+ | WBGene00023575 not CELE_ in pgid 4737 briggsae WBPaper00028961 | ||
+ | WBGene00023572 not CELE_ in pgid 4739 briggsae WBPaper00028961 | ||
+ | WBGene00023572 not CELE_ in pgid 4740 briggsae WBPaper00028961 | ||
+ | WBGene00037006 not CELE_ in pgid 4743 briggsae WBPaper00028961 | ||
+ | WBGene00037006 not CELE_ in pgid 4744 briggsae WBPaper00028961 | ||
+ | WBGene00037005 not CELE_ in pgid 4747 briggsae WBPaper00028961 | ||
+ | WBGene00037005 not CELE_ in pgid 4748 briggsae WBPaper00028961 | ||
+ | WBGene00030970 not CELE_ in pgid 4750 briggsae WBPaper00028961 | ||
+ | WBGene00041435 not CELE_ in pgid 4753 briggsae WBPaper00028961 | ||
+ | WBGene00041435 not CELE_ in pgid 4754 briggsae WBPaper00028961 | ||
+ | WBGene00032753 not CELE_ in pgid 7603 briggsae WBPaper00035320 | ||
+ | WBGene00117029 not CELE_ in pgid 9327 pacificus WBPaper00040360 | ||
+ | WBGene00029022 not CELE_ in pgid 13372 briggsae WBPaper00004520 | ||
+ | WBGene00027230 not CELE_ in pgid 13550 briggsae WBPaper00043890 | ||
+ | WBGene00025707 not CELE_ in pgid 13949 briggsae WBPaper00044493 | ||
+ | WBGene00033342 not CELE_ in pgid 14026 briggsae WBPaper00004832 | ||
+ | WBGene00059989 not CELE_ in pgid 14027 remanei WBPaper00004832 | ||
+ | WBGene00195119 not CELE_ in pgid 14036 pacificus WBPaper00040023 | ||
+ | WBGene00101073 not CELE_ in pgid 14036 pacificus WBPaper00040023 | ||
+ | WBGene00025707 not CELE_ in pgid 14037 briggsae WBPaper00040859 | ||
+ | WBGene00034222 not CELE_ in pgid 14037 briggsae WBPaper00040859 | ||
+ | WBGene00224104 not CELE_ in pgid 14038 brugia WBPaper00041825 | ||
+ | WBGene00233940 not CELE_ in pgid 14039 brugia WBPaper00041825 | ||
+ | WBGene00231085 not CELE_ in pgid 14040 brugia WBPaper00041825 | ||
+ | WBGene00042594 not CELE_ in pgid 14109 briggsae WBPaper00044831 | ||
+ | WBGene00054802 in pgid 14251 remanei WBPaper00041071 | ||
+ | |||
+ | |||
+ | </pre> | ||
+ | |||
+ | <pre> | ||
+ | the following were validated positive, not yet curated | ||
+ | WBPaper00004561 Haemoncus | ||
+ | WBPaper00004962 Volvulus | ||
+ | WBPaper00005646 Brugia | ||
+ | WBPaper00039907 Ascaris | ||
+ | WBPaper00041323 Brugia | ||
+ | WBPaper00041714 Stercoralis | ||
+ | WBPaper00041951 Haemoncus | ||
+ | WBPaper00042037 Haemoncus | ||
+ | WBPaper00044651 Ascaris Suum | ||
+ | |||
+ | </pre> | ||
+ | |||
+ | an additional list that can be checked is the following I got from Wen: | ||
+ | |||
+ | <pre> | ||
+ | //Special Expr_pattern paper. Usually they contain expression patterns of un-specified genes. | ||
+ | |||
+ | cgc2796 | ||
+ | cgc2714 | ||
+ | cgc2559 | ||
+ | cgc2475 | ||
+ | cgc2449 | ||
+ | cgc2274 | ||
+ | cgc2005 | ||
+ | cgc1542 | ||
+ | cgc1984 | ||
+ | cgc4994 ges-1 expression in C. briggasae | ||
+ | cgc4821 Expression of cpz-1 in O. volvulus | ||
+ | cgc4837 Expression of a Drosophila transposon in C.elegans using glh-2 promoter. | ||
+ | cgc4895 nud-1 expression in other species. | ||
+ | cgc5831 expression of Od-mpp1 promoter corresponded to that produced by the T03F1.5 or the W09C3.6 promoter in C. elegans. | ||
+ | cgc5943 2-D protein gel dev. stage assay, too ambiguous to curate. | ||
+ | pmid14504223 antibody 1CB4 staining with unknown antigen | ||
+ | cgc6097 expression in other species. | ||
+ | cgc6393 expression in briggasae. | ||
+ | cgc6588 expression in briggasae. | ||
+ | cgc6591 expression in briggasae | ||
+ | cgc6690 Curated as Gene_regulation. | ||
+ | pmid15826643 expression pattern in other species. | ||
+ | pmid15862576 expression pattern in other species. | ||
+ | pmid15630478 expression pattern in other species. In contrast with FOG-2, a highly conserved GLD-1 ortholog is present in C. briggsae (Table 1) and has a germline expression pattern essentially identical to that of C. elegans (Figure 5A, top right and middle right). | ||
+ | WBPaper00026965 expression pattern in other species. | ||
+ | WBPaper00028902 expression pattern in other species | ||
+ | 00025105 expression pattern in other species | ||
+ | 00025000 expression pattern in other species | ||
+ | 00028902 expression pattern in other species | ||
+ | 00032298 expression of lin-11 in three species. | ||
+ | WBPaper00035037 expression pattern of | ||
+ | </pre> | ||
+ | |||
+ | the following is a list of validated negatives that can be used for SVM training (randomly selected from http://131.215.52.209/daniela/nematode/summaryN_id_nematode) | ||
+ | |||
+ | <pre> | ||
+ | 22922012 negative | ||
+ | 22922533 negative | ||
+ | 22923372 negative | ||
+ | 22924021 negative | ||
+ | 22930820 negative | ||
+ | 22932059 negative | ||
+ | 22933846 negative | ||
+ | 22935096 negative | ||
+ | 22936386 negative | ||
+ | 23315190 negative | ||
+ | 22947621 negative | ||
+ | 22949749 negative | ||
+ | 22949753 negative | ||
+ | 23307236 negative | ||
+ | 22949756 negative | ||
+ | 22949757 negative | ||
+ | 22951972 negative | ||
+ | 22952671 negative | ||
+ | 23306387 negative | ||
+ | 22952792 negative | ||
+ | 22952922 negative | ||
+ | 23300895 negative | ||
+ | 2295622 negative | ||
+ | 22961235 negative | ||
+ | 22961310 negative | ||
+ | 22967068 negative | ||
+ | 22969260 negative | ||
+ | 22973231 negative | ||
+ | 22983796 negative | ||
+ | 22983799 negative | ||
+ | 22983801 negative | ||
+ | 22984141 negative | ||
+ | 22984446 negative | ||
+ | 22984536 negative | ||
+ | 22992226 negative | ||
+ | 22992297 negative | ||
+ | 22992897 negative | ||
+ | 23107597 negative | ||
+ | 23107821 negative | ||
+ | 23110936 negative | ||
+ | 23110962 negative | ||
+ | 23111012 negative | ||
+ | 23111089 negative | ||
+ | 23111398 negative | ||
+ | 23112818 negative | ||
+ | 23291463 negative | ||
+ | 23029059 negative | ||
+ | 23029330 negative | ||
+ | 23029423 negative | ||
+ | 23029572 negative | ||
+ | 23289015 negative | ||
+ | 2310180 negative | ||
+ | |||
+ | </pre> | ||
+ | |||
+ | ==parasitic nematodes papers containing expression data== | ||
+ | sent to Jane Lomax on September 9 2015 | ||
+ | |||
+ | <pre> | ||
+ | O volvulus | ||
+ | http://www.ncbi.nlm.nih.gov/pubmed/11606224 | ||
+ | |||
+ | H contortus | ||
+ | http://www.ncbi.nlm.nih.gov/pubmed/14698436 | ||
+ | http://www.ncbi.nlm.nih.gov/pubmed/15003846 | ||
+ | http://www.ncbi.nlm.nih.gov/pubmed/12062493 | ||
+ | http://www.ncbi.nlm.nih.gov/pubmed/23360558 | ||
+ | http://www.ncbi.nlm.nih.gov/pubmed/23416426 | ||
+ | http://www.ncbi.nlm.nih.gov/pubmed/25128369 | ||
+ | http://www.ncbi.nlm.nih.gov/pubmed/25388625 | ||
+ | |||
+ | A Caninum | ||
+ | http://www.ncbi.nlm.nih.gov/pubmed/11755191 | ||
+ | |||
+ | A suum | ||
+ | http://www.ncbi.nlm.nih.gov/pubmed/12387846 | ||
+ | http://www.ncbi.nlm.nih.gov/pubmed/21685128 | ||
+ | http://www.ncbi.nlm.nih.gov/pubmed/24374308 | ||
+ | |||
+ | S stercoralis | ||
+ | http://www.ncbi.nlm.nih.gov/pubmed/14572516 | ||
+ | http://www.ncbi.nlm.nih.gov/pubmed/23145190 | ||
+ | |||
+ | </pre> | ||
+ | |||
+ | == 2A viral technology == | ||
+ | |||
+ | Proof of principle described in | ||
+ | |||
+ | *Simultaneous expression of multiple proteins under a single promoter in C. elegans via a versatile 2A-based toolkit | ||
+ | |||
+ | Arnaud Ahier & Sophie Jarriault, Genetics | ||
+ | |||
+ | 'We report the use of viral 2A peptides, which trigger a “ribosomal-skip” or “STOP&GO” mechanism during translation, to express multiple proteins from a single vector in C. elegans. Although none of the viruses known to infect C. elegans contain 2A-like sequences, our results show that 2A peptides allow the production of separate functional proteins in all cell types and at all developmental stages tested in the worm. In addition, we constructed a toolkit including a 2A- based polycistronic plasmid and reagents to generate 2A-tagged fosmids. 2A peptides constitute an important tool to ensure the delivery of multiple polypeptides in specific cells enabling several novel applications, such as the reconstitution of multi-subunit complexes.' | ||
+ | |||
+ | Will keep an eye if it will be used more extensively and eventually change the model | ||
+ | |||
+ | |||
+ | ==SVM analysis for gene expression== | ||
+ | |||
+ | |||
+ | |||
+ | 051812_042012. The retraining for this batch was done by incorporating the curated results from 2009 till 2012. | ||
+ | <pre> | ||
+ | |||
+ | old re-train | ||
+ | 14/37 = 37.8% 17/26 = 65.3% | ||
+ | |||
+ | |||
+ | </pre> | ||
+ | |||
+ | From this batch on we have started to manually manipulate the features by adding some and deleting others. The files are stored on Lario/Desktop/SVM | ||
+ | |||
+ | 06/08-05/18 2012. | ||
+ | <pre> | ||
+ | |||
+ | old re-train feature_manipulation | ||
+ | 29.30% 44.40% 55.00% | ||
+ | 28.90% 55% | ||
+ | |||
+ | |||
+ | </pre> | ||
+ | |||
+ | September 21 2012 | ||
+ | <pre> | ||
+ | |||
+ | old re-train feature_manipulation section_model | ||
+ | 54/98=55% 41/56=73.2% 54/79=68.3% 34/47 = 72.3% (159 out of 459 papers have results) | ||
+ | |||
+ | </pre> | ||
+ | |||
+ | |||
+ | November 02 2012 | ||
+ | <pre> | ||
+ | |||
+ | old re-train feature_manipulation | ||
+ | 9/13=69% 16/22=72.7% 28/46=60.8% | ||
+ | |||
+ | </pre> | ||
+ | |||
+ | June 14 2013 | ||
+ | <pre> | ||
+ | |||
+ | old re-train section_model | ||
+ | 6/10 = 60% 6/6 = 100% 5/5 = 100% | ||
+ | |||
+ | </pre> | ||
+ | |||
+ | In July 2014 Yuling retrained SVM with the latest results. He got the results from the curation status from: | ||
+ | |||
+ | validated positive | ||
+ | http://tazendra.caltech.edu/~postgres/cgi-bin/curation_status.cgi?action=listCurationStatisticsPapersPage&select_curator=two12028&listDatatype=otherexpr&method=allval%20pos%20cur& checkbox_cfp=on&checkbox_afp=on&checkbox_svm=on | ||
+ | |||
+ | validated negative | ||
+ | http://tazendra.caltech.edu/~postgres/cgi-bin/curation_status.cgi?action=listCurationStatisticsPapersPage&select_curator=two12028&listDatatype=otherexpr&method=allval%20neg&checkbox_cfp=on&checkbox_afp=on&checkbox_svm=on | ||
+ | |||
+ | also, Yuling is trying to see what was the difference between SVM and feature manipulation on this paper range | ||
+ | 44200 till 44700 | ||
+ | roughly between october 2013 and january 2014 | ||
+ | |||
+ | http://131.215.52.209/celegans/svm_results/20131004/ | ||
+ | http://131.215.52.209/celegans/svm_results/20140124/012414_011014_otherexpr | ||
+ | |||
+ | in the list we will take into account the curation negative SVM positive which are listed here: | ||
+ | http://tazendra.caltech.edu/~postgres/cgi-bin/referenceform.cgi | ||
+ | SELECT * FROM cur_curdata WHERE cur_selcomment ~ '1'; | ||
+ | |||
+ | the results of the analisys are as follows (oct2013-jan2014) | ||
+ | |||
+ | <pre> | ||
+ | testing papers total 176 69 positive 107 negatives | ||
+ | |||
+ | true positive false positive precision recall F score "=2*(precision*recall)/(precision+recall)" | ||
+ | feature manipulation 47 15 75.80% 47/69 68.10% 0.717439889 | ||
+ | current SVM model 32 7 82% 32/69 46.30% 0.591831645 | ||
+ | new testing SVM model 42 18 70% 42/69 60.80% 0.650764526 | ||
+ | </pre> | ||
+ | |||
+ | Jan2014-Dec2014 -using feature manipulation models | ||
+ | <pre> | ||
+ | testing papers total flagged 226 172 positive 54 negatives precision 76.1% | ||
+ | </pre> | ||
+ | |||
+ | |||
+ | ==Micropublications== | ||
+ | go to [[Micropublications]] | ||
+ | |||
+ | ==User data submission== | ||
+ | |||
+ | the submission form is here: | ||
+ | |||
+ | http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/expr_pattern.cgi | ||
+ | |||
+ | and it appends data here | ||
+ | |||
+ | http://tazendra.caltech.edu/~azurebrd/cgi-bin/data/expr.ace | ||
+ | |||
+ | ==Numbers== | ||
+ | *9260 Expression objects in WS180 ( 7176 + 2084 Chronograms) | ||
+ | *12110 Expression objects in WS233 (10026 + 2084 Chronograms) | ||
+ | *13380 Expression objects in WS243 (11296 + 2084 Chronograms) | ||
+ | *14273 Expression objects in WS243 (12189 + 2084 Chronograms) | ||
+ | |||
+ | ==Transgenic Alleles from Constructs== | ||
+ | ====The initial population of cns tables for expression constructs==== | ||
+ | Over 4000 Construct objects that did not have a 'standard nomenclature' name as transgenes -e.g. hkdEx1202 for extrachromosomal arrays- were imported into Construct OA. | ||
+ | |||
+ | In order to atomize curation details into different fields -from the construction summary description- Juancarlos wrote a script | ||
+ | |||
+ | the script is located here: | ||
+ | |||
+ | |||
+ | /home/postgres/work/pgpopulation/cns_construct/20141110_daniela_constructs/update_cns_by_daniela.pl* | ||
+ | |||
+ | More information/files on Lario in the folder Construct/Clone stuff | ||
+ | |||
+ | === Alignment of Construct Data for Alliance February 2021 === | ||
+ | * WB Expression curation associates Expr_patterns to both construct and transgenic alleles. To simplify expression schema changes for Alliance we will give transgene IDs to all constructs used in Expression Pattern. The Alliance schema will therefore have a transgenic allele tag, which will be populated with trangene data coming from WB expression curation. | ||
+ | |||
+ | ** As of 02.10.2021: There are 14053 pgids associated with constructs. Files are located on tazendra : /home/postgres/work/pgpopulation/exp_exprpattern/20210210_construct_transgene
| ||
+ | ** the file is_in contains 6947 pgids: these are constructs in expression OA that have a matching transgene in the Expression OA transgene field. For these objects Juancarlos will delete the construct from Expr OA as is redundant with the transgene. | ||
+ | |||
+ | ** the file is_not contains 55 pgids: these are objects for which there’s a construct that has a corresponding transgene in transgene OA but the transgene is not in the transgene field for that expression object. Daniela will go through the list. If the trangene happens to be identical to the construct she will put that transgene in expression OA and she will delete the construct in Expression OA. --Daniela done Feb 11 2021 | ||
+ | |||
+ | *script1* /home/postgres/work/pgpopulation/exp_exprpattern/20210210_construct_transgene/query_expression_constructs_transgenes.pl | ||
+ | This script finds Constructs that have a transgene, removes the construct from exp_cns field and adds the corresponding transgene in the Exp_trp field | ||
+ | |||
+ | ** the file does _not contains 7051 pgids: for these we will need to create new transgenes. See following paragraph: | ||
+ | |||
+ | ====Populating exp_transgene based on exp_construct==== | ||
+ | For back population of trp tables with existing constructs in the expression tables for which there is no associated transgene: | ||
+ | * find all exp_constructs that are not associated with a trp_construct | ||
+ | * create transgene IDs for each exp_construct following the data porting as laid out below | ||
+ | **Postgres creates a transgene pgid, and populates trp tables as follows | ||
+ | ***cns_summary copied to trp_summary | ||
+ | ***cns_name copied to trp_construct | ||
+ | ***cns_paper copied to trp_paper | ||
+ | ***cns_curator -default to Daniela | ||
+ | ***trp_name copied to exp_transgene | ||
+ | |||
+ | NOTE: There were 1849 transgenes and 2259 constructs with no paper info, 447 constructs and 159 transgenes were used in expression. We have now populated the paper field in construct and transgene for this objects via the Expression pattern connection: | ||
+ | /home/postgres/work/pgpopulation/exp_exprpattern/20210331_cns_trp_exp_paper | ||
+ | cns_no_paper_with_expr 447 constructs -> 405 have paper, spot checked the ones that have no paper and they had a Person association | ||
+ | trp_no_paper_with_expr 159 transgenes -> 125 have paper, spot checked the ones that have no paper and they had a Person association | ||
+ | The script to populate the paper info from expression objects is here: | ||
+ | /home/postgres/work/pgpopulation/exp_exprpattern/20210331_cns_trp_exp_paper/transfer_cns_trp_with_exp_no_paper.pl* | ||
+ | |||
+ | |||
+ | After data clean up will need to suppress data [delete construct] in the exp_construct <br> | ||
+ | |||
+ | All these changes were discussed and agreed upon on Feb 02.10.2021 in a meeting (attendees: Daniela, Juancarlos, Chris, Karen). | ||
+ | |||
+ | Script2 /home/postgres/work/pgpopulation/exp_exprpattern/20210316_construct_to_transgene/copy_construct_to_transgene.pg | ||
+ | |||
+ | ====Construct/Transgene curation moving forward==== | ||
+ | * When authors are not using standard nomenclature for a transgene, the Expression curator will: | ||
+ | ** go to construct OA and create a new construct | ||
+ | ** Take that construct ID and put it in the construct field in expression OA | ||
+ | ** A cronjob (concatenates script 2 and script 1 above) will run overnight and will | ||
+ | *** look for all constructs listed in Expression OA, | ||
+ | *** create a transgene object for such construct | ||
+ | *** Populate the trp_fields as above copying data over from the construct object. | ||
+ | *** Add the transgeneID just created in the transgene field of Expression OA for which the construct was made | ||
+ | *** Delete the construct from the construct field | ||
+ | |||
+ | Cronjob: # 0 4 * * * /home/postgres/work/pgpopulation/exp_exprpattern/cronjobs/transfer_exp_cns_trp/wrapper.pl | ||
+ | |||
+ | ==new allele request== | ||
+ | |||
+ | go to the name server and log in: | ||
+ | http://www.sanger.ac.uk/sanger/Worm_NameServer | ||
+ | |||
+ | check if the new variation already exists by clicking on find variation. | ||
+ | |||
+ | If it does: | ||
+ | generate an ID in the OA: | ||
+ | http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/generic.cgi?action=TempVariationObo | ||
+ | putting in the public name and the ID | ||
+ | |||
+ | if it doesn't: | ||
+ | on the name server click on 'request a new variation ID' | ||
+ | put in the public name and the paper -with additional info | ||
+ | |||
+ | then go to the OA as above and generate an ID | ||
+ | |||
+ | ==bulk OA uploads== | ||
+ | Chris' master table- leave as is | ||
+ | https://docs.google.com/spreadsheets/d/13Dt1ZjrEYUe6uolqbsdjbDRexftoXBXWV4knyb5j4f4/edit#gid=0 | ||
+ | |||
+ | Copy table in my drive | ||
+ | https://docs.google.com/spreadsheets/d/1I4eVTuHR3GW3m5CFVPM8Ll2GQoQP26XZaMldlLg78iQ/edit#gid=0 | ||
+ | |||
+ | ==Expression pattern remodel== | ||
+ | [[Expression pattern remodel]] | ||
+ | |||
+ | ==OA and GO CC curation comparison== | ||
+ | |||
+ | ===Overview=== | ||
+ | Daniela and Kimberly are comparing CC annotations between the OA and GO pipelines. The document for tracking is here: | ||
+ | https://docs.google.com/spreadsheets/d/1QnRd0TE7zZGC4ThQvLX1d-9sQH2luqrYLb_A26o8WCE/edit#gid=579706084 | ||
+ | |||
+ | ===Scenarios=== | ||
+ | * no OA but GO_CC | ||
+ | ** import in OA exp with IDA as evidence code | ||
+ | ** Markers for GO: keep just the first reference and get rid of the ones that are confirmatory. | ||
+ | |||
+ | * no GO but OA annotation | ||
+ | ** create a gpad file ask Tony Sawford to upload in protein to go database | ||
+ | ** the gpad should contain annotation extensions to tissues | ||
+ | |||
+ | * Secreted proteins: | ||
+ | ** clean up the secreted proteins annotations | ||
+ | |||
+ | === Importing GO_CC annotations in the Paper term info === | ||
+ | |||
+ | We can parse a .ace file that Kimberly generates for every build located here (Tazendra) | ||
+ | |||
+ | /home/acedb/kimberly/citace_upload/go/gpad2ace/gpad_parsing | ||
+ | |||
+ | and called gp_annotation.ace | ||
+ | |||
+ | We want to consider only objects that have | ||
+ | Annotation_relation "part_of" or "colocalizes_with" | ||
+ | and | ||
+ | Reference "WBPaper000nnnn" | ||
+ | |||
+ | Once I will enter the WBPaperID in OA, I would like to see in the term info: | ||
+ | |||
+ | *1) Gene: Display locus | ||
+ | *2) 'part_of' or 'colocalizes_with' | ||
+ | *3) GO Term: Display ‘name’ For example for id : GO:0005634 display nucleus | ||
+ | *4) GO_code: e.g.’IDA' | ||
+ | *5) Extension: get values from GO_term_relation and Display them. Example: | ||
+ | **part_of(Anatomy name) | ||
+ | For example, if you have: part_of(WBbt:0004821)|part_of(WBbt:0006786)|part_of(WBbt:0006787) | ||
+ | Display: | ||
+ | part_of(DVC)|part_of(ut2)|part_of(ut3) | ||
+ | **exists_during(anaphase) | ||
+ | *6) get values from Life_stage_relation and display them | ||
+ | **Life_stage_relation | ||
+ | Life_stage_relation "exists_during" "embryo" | ||
+ | *7) get the values from Anatomy_relation and display them | ||
+ | *8) contributed_by: whenever is NOT WormBase import the value | ||
+ | |||
+ | For point #8 we could store the info in Curated_by in the expression pattern model. If so create a field in Expression OA-tab 4- Curated_by, free small-text. (for now we are not dumping it, we want to check how often is used, will dump in the future if need be) | ||
+ | |||
+ | Name of the table: tin_paper_legocc | ||
+ | |||
+ | the links to protein to GO are: | ||
+ | ftp://ftp.ebi.ac.uk/pub/contrib/goa/ | ||
+ | File:gp_association.6239_wormbase.gz | ||
+ | |||
+ | the gp_association.6239_wormbase.gz File is the source file for Juancarlos' script here: | ||
+ | /home/acedb/kimberly/citace_upload/go/gpad2ace/2017_January | ||
+ | |||
+ | the gp_association.6239_wormbase.gz gets updated weekly, every Monday morning UK time, so we can up a cronjob for Tuesday morning 8:00 am California time. | ||
+ | |||
+ | |||
+ | Example: | ||
+ | <pre> | ||
+ | I would want to have | ||
+ | *Gene1, part of GO_Termx, GO_codex, Extensionsx | ||
+ | GO_Termy, GO_codey, Extensionsy | ||
+ | |||
+ | *Gene2, part of GO_Termz, GO_codez, Extensionsz | ||
+ | </pre> | ||
+ | |||
+ | All relevant files are on tazendra here: | ||
+ | /home/acedb/draciti/lego_cc_annotations | ||
+ | |||
+ | we are using the gp2protein file in that folder, Kimberly has plans to chenge the format into gpi (will modify the source accordingly when ready). | ||
+ | |||
+ | We have a cronjob running every tuesday morning (8:00 am Pasadena time) that will get data from the latest version of the gp_association.6239_wormbase.gz and will populate the tables on postgres. | ||
+ | |||
+ | 0 8 * * tue /home/acedb/draciti/lego_cc_annotations/wrapper.sh | ||
+ | |||
+ | '''When Kimberly changes the gp2protein file into a gpi file we need to change the update the script''' | ||
+ | |||
+ | === Comparison CGI === | ||
+ | |||
+ | Go to [[Comparison CGI]] | ||
+ | |||
+ | == AGR data transfer == | ||
+ | Annotations with Uncertain tag have not been included in the initial upload. | ||
+ | |||
+ | Mappings to LinkML | ||
+ | <pre> | ||
+ | /////////////////////////////////// | ||
+ | // | ||
+ | // ?Expr_pattern class | ||
+ | // | ||
+ | /////////////////////////////////// | ||
+ | |||
+ | |||
+ | ?Expr_pattern Expression_of Gene ?Gene OUTXREF Expr_pattern #Evidence Gene -> BiologicalEntity in LinkML | ||
+ | Reflects_endogenous_expression_of ?Gene -> endogenous tag in expression_qualifier_set in LinkML | ||
+ | // CDS ?CDS OUTXREF Expr_pattern // for coding genes -> double check with Paul D but probably good to delete | ||
+ | Sequence ?Sequence OUTXREF Expr_pattern // for clones??? -> moved to remarks (12 objects) good to ignore | ||
+ | // Pseudogene ?Pseudogene OUTXREF Expr_pattern // [030801 krb] -> asked to remove from schema Feb 2022 | ||
+ | Clone ?Clone OUTXREF Expr_pattern -> reagents in LinkML | ||
+ | Protein ?Protein OUTXREF Expr_pattern -> asked to remove from schema Feb 2022 | ||
+ | Protein_description Text // information for Expr_patterns with unknown antigens [031105 krb] -> asked to remove from schema Feb 2022 | ||
+ | // Homol Homol_homol ?Homol_data XREF Expr_homol ?Method Float Int UNIQUE Int Int UNIQUE Int #Homol_info //Expr_pattern mapping [060427 ar2] -> for chronogram, ignore if chronograms are going to be imported as pictures | ||
+ | Expression_data Life_stage ?Life_stage OUTXREF Expr_pattern #Qualifier -> developmental_stage in LinkML | ||
+ | Anatomy_term ?Anatomy_term OUTXREF Expr_pattern #Qualifier -> anatomical_structure in LinkML | ||
+ | GO_term ?GO_term OUTXREF Expr_pattern #GR_condition -> cellular_component in LinkML | ||
+ | Not_in_Life_stage ?Life_stage #Qualifier -> developmental_stage negated in LinkML | ||
+ | Not_in_Anatomy_term ?Anatomy_term #Qualifier -> anatomical_structure negated in LinkML | ||
+ | Not_in_GO_term ?GO_term #GR_condition -> cellular_component negated in LinkML | ||
+ | Subcellular_localization ?Text -> ExpressionExperimentStatement in LinkML | ||
+ | Type Antibody ?Text FILL_DEFAULT -> MMO immunohistochemistry -> assay_used in LinkML | ||
+ | *Cis_regulatory_element* Text FILL_DEFAULT -> this tag was used in WB CV for tagging sequence feature annotations. As such, it is not a method tag, can probably drop it. | ||
+ | |||
+ | EPIC ?Text FILL_DEFAULT -> used for Murray study, good to ignore | ||
+ | Genome_editing ?Text FILL_DEFAULT -> MMO knock-in in situ reporter assay -> assay_used in LinkML | ||
+ | In_situ Text FILL_DEFAULT -> MMO RNA in situ -> assay_used in LinkML | ||
+ | *Localizome* ?Text FILL_DEFAULT Localizome -> for chronograms good to ignore. We can bring chronograms as image objects. They don’t hold any anatomy/ls annotations per se. Example chronograph page on WB: https://wormbase.org/species/all/expr_pattern/Chronogram1954#0213--10 | ||
+ | Microarray ?Microarray_experiment -> holds images from Yanai -> can add a tag in Picture class. Link to high throughput data, too? Talk to Wen | ||
+ | Northern Text FILL_DEFAULT -> MMO northern -> assay_used in LinkML | ||
+ | Reporter_gene ?Text FILL_DEFAULT -> MMO in situ reporter -> assay_used in LinkML | ||
+ | RNASeq ?Analysis -> holds images from Yanai -> can add a tag in Picture class. Link to high throughput data, too? Talk to Wen | ||
+ | RT_PCR Text FILL_DEFAULT -> MMO RT PCR -> assay_used in LinkML | ||
+ | Tiling_array ?Analysis -> can add a tag in Picture class. Link to high throughput data, too? Talk to Wen | ||
+ | Western Text FILL_DEFAULT -> MMO western -> assay_used in LinkML | ||
+ | Expression_cluster ?Expression_cluster INXREF Expr_pattern //added for localizome | ||
+ | Microarray_results ?Microarray_results INXREF Expr_Pattern -> in high throughput dataset | ||
+ | Pattern ?Text //Multi_value ordering issue in Datomic possibly -> ExpressionExperimentStatement | ||
+ | Picture ?Picture INXREF Expr_pattern -> image in LinkML | ||
+ | MovieURL Text //Added by wen for link to movie URLs. -> asked to remove from schema Feb 2022 | ||
+ | Movie ?Movie INXREF Expr_pattern //Added by Wen to curate Expr_pattern video -> movie in LinkML | ||
+ | Species UNIQUE ?Species -> Can be inferred from biological entity. Not needed in LinkML | ||
+ | Remark ?Text #Evidence -> ExpressionAnnotationStatement in LinkML | ||
+ | DB_info ?Database ^database ?Database_field ^field Text ^accession Example: Expr1040545 - Miller study -> Link out can be handled via the Links to third party expression resources | ||
+ | Experiment Laboratory ?Laboratory -> asked to remove from schema Feb 2022 | ||
+ | Strain UNIQUE ?Strain -> specimen_genomic_model (AGM) in LinkML | ||
+ | Person UNIQUE ?Person -> talk to Kimberly to mint WBPaperIDs for personal communications | ||
+ | Reference ?Paper OUTXREF Expr_pattern -> reference in LinkML | ||
+ | Transgene ?Transgene OUTXREF Expr_pattern -> specimen_alleles in LinkML | ||
+ | Variation ?Variation INXREF Expr_pattern -> specimen_alleles in LinkML | ||
+ | Construct ?Construct OUTXREF Expression_pattern -> all constructs have been converted to transgenes -> specimen_alleles in LinkML | ||
+ | Associated_feature ?Feature OUTXREF Associated_with_expression_pattern #Evidence -> BiologicalEntity in LinkML?? | ||
+ | Antibody_info ?Antibody OUTXREF Expr_pattern // This applies to both Western & Antibody staining -> reagents in LinkML | ||
+ | // added [031120 krb] | ||
+ | Curated_by UNIQUE Text // Hinxton (HX) or Caltech (CIT) Sylvia [010927 krb] -> used by chronograms curated by caltech (good to ignore) | ||
+ | Historical_gene ?Gene Text -> double check with Chris how these type of info will be dealt with generally at Alliance | ||
+ | |||
+ | |||
+ | //Qualifer hash will be used for Expr_pattern curation to specify the reliability of data. | ||
+ | |||
+ | #Qualifier Certain | ||
+ | Uncertain //For faint or variable expression | ||
+ | Partial //For expression of unidentified cell in a cell group | ||
+ | Anatomy_term ?Anatomy_term //combines life stage with anatomy term in expr pattern annotation | ||
+ | Life_stage ?Life_stage //combines life stage with anatomy term in expr pattern annotation | ||
+ | Remark Text //New tag to take the optional text from the Certain/Uncertain/Partial nested tags above. | ||
+ | </pre> | ||
+ | |||
+ | ===detailed=== | ||
+ | * Gene ?Gene OUTXREF Expr_pattern #Evidence Gene -> BiologicalEntity in LinkML | ||
+ | * Reflects_endogenous_expression_of ?Gene -> endogenous tag in expression_qualifier_set in LinkML | ||
+ | * Clone ?Clone OUTXREF Expr_pattern -> reagents in LinkML | ||
+ | * Life_stage ?Life_stage OUTXREF Expr_pattern #Qualifier -> developmental_stage in LinkML | ||
+ | * Anatomy_term ?Anatomy_term OUTXREF Expr_pattern #Qualifier -> anatomical_structure in LinkML | ||
+ | * GO_term ?GO_term OUTXREF Expr_pattern #GR_condition -> cellular_component in LinkML | ||
+ | * Not_in_Life_stage ?Life_stage #Qualifier -> developmental_stage negated in LinkML | ||
+ | * Not_in_Anatomy_term ?Anatomy_term #Qualifier -> anatomical_structure negated in LinkML | ||
+ | * Not_in_GO_term ?GO_term #GR_condition -> cellular_component negated in LinkML | ||
+ | * Subcellular_localization ?Text -> ExpressionExperimentStatement in LinkML | ||
+ | * Type | ||
+ | ** Antibody ?Text FILL_DEFAULT -> MMO immunohistochemistry -> assay_used in LinkML | ||
+ | ** Genome_editing ?Text FILL_DEFAULT -> MMO knock-in in situ reporter assay -> assay_used in LinkML | ||
+ | ** In_situ Text FILL_DEFAULT -> MMO RNA in situ -> assay_used in LinkML | ||
+ | ** Northern Text FILL_DEFAULT -> MMO northern -> assay_used in LinkML | ||
+ | ** Reporter_gene ?Text FILL_DEFAULT -> MMO in situ reporter -> assay_used in LinkML | ||
+ | ** RT_PCR Text FILL_DEFAULT -> MMO RT PCR -> assay_used in LinkML | ||
+ | ** Western Text FILL_DEFAULT -> MMO western -> assay_used in LinkML | ||
+ | * Pattern ?Text //Multi_value ordering issue in Datomic possibly -> ExpressionExperimentStatement | ||
+ | * Picture ?Picture INXREF Expr_pattern -> image in LinkML | ||
+ | * Movie ?Movie INXREF Expr_pattern //Added by Wen to curate Expr_pattern video -> movie in LinkML | ||
+ | * Species UNIQUE ?Species -> Can be inferred from biological entity. Not needed in LinkML | ||
+ | * Remark ?Text #Evidence -> ExpressionAnnotationStatement in LinkML | ||
+ | * DB_info ?Database ^database ?Database_field ^field Text ^accession Example: Expr1040545 - Miller study -> Link out can be handled via the Links to third party expression resources | ||
+ | * Strain UNIQUE ?Strain -> specimen_genomic_model (AGM) in LinkML | ||
+ | * Reference ?Paper OUTXREF Expr_pattern -> reference in LinkML | ||
+ | * Transgene ?Transgene OUTXREF Expr_pattern -> specimen_alleles in LinkML | ||
+ | * Variation ?Variation INXREF Expr_pattern -> specimen_alleles in LinkML | ||
+ | * Construct ?Construct OUTXREF Expression_pattern -> all constructs have been converted to transgenes -> specimen_alleles in LinkML | ||
+ | * Associated_feature ?Feature OUTXREF Associated_with_expression_pattern #Evidence -> BiologicalEntity in LinkML?? | ||
+ | * Antibody_info ?Antibody OUTXREF Expr_pattern // This applies to both Western & Antibody staining -> reagents in LinkML | ||
+ | |||
+ | * Historical_gene ?Gene Text -> double check with Chris how these type of info will be dealt with generally at Alliance | ||
+ | |||
+ | //Qualifer hash will be used for Expr_pattern curation to specify the reliability of data. | ||
+ | |||
+ | #Qualifier Certain | ||
+ | Uncertain //For faint or variable expression | ||
+ | Partial //For expression of unidentified cell in a cell group | ||
+ | Anatomy_term ?Anatomy_term //combines life stage with anatomy term in expr pattern annotation | ||
+ | Life_stage ?Life_stage //combines life stage with anatomy term in expr pattern annotation | ||
+ | Remark Text //New tag to take the optional text from the Certain/Uncertain/Partial nested tags above. | ||
+ | |||
+ | To ignore or already removed from schema: | ||
+ | * CDS ?CDS OUTXREF Expr_pattern // for coding genes -> double check with Paul D but probably good to delete | ||
+ | * Sequence ?Sequence OUTXREF Expr_pattern // for clones??? -> moved to remarks (12 objects) good to ignore | ||
+ | * Protein ?Protein OUTXREF Expr_pattern -> asked to remove from schema Feb 2022 | ||
+ | * Protein_description Text -> asked to remove from schema Feb 2022 | ||
+ | * Pseudogene ?Pseudogene OUTXREF Expr_pattern // [030801 krb] -> asked to remove from schema Feb 2022 | ||
+ | * MovieURL Text //Added by wen for link to movie URLs. -> asked to remove from schema Feb 2022 | ||
+ | * Laboratory ?Laboratory -> asked to remove from schema Feb 2022 | ||
+ | * Curated_by UNIQUE Text // Hinxton (HX) or Caltech (CIT) Sylvia [010927 krb] -> used by chronograms curated by caltech (good to ignore) | ||
+ | * Type | ||
+ | **Cis_regulatory_element Text FILL_DEFAULT -> this tag was used in WB CV for tagging sequence feature annotations. As such, it is not a method tag, can probably drop it. | ||
+ | ** EPIC ?Text FILL_DEFAULT -> used for Murray study, good to ignore | ||
+ | |||
+ | Chronograms: | ||
+ | * Homol Homol_homol ?Homol_data XREF Expr_homol ?Method Float Int UNIQUE Int Int UNIQUE Int #Homol_info //Expr_pattern mapping [060427 ar2] -> for chronogram, ignore if chronograms are going to be imported as pictures | ||
+ | * Type | ||
+ | **Localizome ?Text FILL_DEFAULT Localizome -> for chronograms good to ignore. We can bring chronograms as image objects. They don’t hold any anatomy/ls annotations per se. Example chronograph page on WB: https://wormbase.org/species/all/expr_pattern/Chronogram1954#0213--10 | ||
+ | |||
+ | |||
+ | Talked to Wen- no need to import right now | ||
+ | * Expression_cluster ?Expression_cluster INXREF Expr_pattern //added for localizome -> only chronograms, no need to import | ||
+ | * Microarray_results ?Microarray_results INXREF Expr_Pattern -> in high throughput dataset | ||
+ | |||
+ | Coming in with tags in images: | ||
+ | * Type | ||
+ | ** Microarray ?Microarray_experiment -> holds images from Yanai -> can add a tag in Picture class. Link to high throughput data, too? Talk to Wen again once we will work on the image class | ||
+ | ** RNASeq ?Analysis -> holds images from Yanai -> can add a tag in Picture class. Link to high throughput data, too? Talk to Wen again once we will work on the image class | ||
+ | ** Tiling_array ?Analysis -> can add a tag in Picture class. Link to high throughput data, too? Talk to Wen again once we will work on the image class | ||
+ | |||
+ | |||
+ | Talk to Kimberly | ||
+ | * Person UNIQUE ?Person -> talk to Kimberly to mint WBPaperIDs for personal communications | ||
+ | |||
+ | === UBERON to WBbt anatomy mappings for Alliance === | ||
− | + | The csv file with the mappings is in this excel spreadsheet | |
− | + | https://docs.google.com/spreadsheets/d/1cC6jnDP7x2mQJmFQ6QRlZxv-nS_9vSROCJAm2W-SCwY/edit#gid=770823761 | |
− | |||
− | == | + | == Single Cell RNAseq graphs == |
− | + | For Articles that contain single cell RNA seq analysis, we can offer authors: | |
− | + | * Display of interactive graph TPM / % cell expressing per gene | |
+ | * Inclusion of enriched gene sets per cell (to the extent that we can translate their cell group to anatomy term). | ||
+ | * Link out to their analysis tools page | ||
− | + | The CeNGEN example can be seen here: https://wormbase.org/species/c_elegans/gene/WBGene00001170#-1-10 | |
− | + | Data are also stored here: http://caltech.wormbase.org/pub/wormbase/datasets-published/packer2019/ |
Latest revision as of 19:39, 11 May 2023
Contents
- 1 Expression Pattern
- 2 Genes with expression
- 3 WS248 numbers
- 4 OA interface
- 5 obsolete fields
- 6 Importing the large large scale Expression_pattern left on Citace Minus into OA
- 7 Serial numbers for large scale imports
- 8 Itai Yanai large scale import -WBPaper00041190
- 9 Itai Yanai 2015 large import -WBPaper00046121
- 10 TransgeneOme import
- 11 David Miller tiling arrays import -WBPaper00037950
- 12 Hench large scale import -Endrov
- 13 Paker 2019
- 14 Reilly 2020 - WBPaper00060123
- 15 EPIC detailed
- 16 Deleting files from Citace Minus
- 17 Expression-paper association
- 18 Dumper
- 19 Data parsing
- 20 -D file for Citace Minus
- 21 Exporting Reporter Gene description from Expr_pattern OA to Transgene OA
- 22 Miller paper- tiling arrays
- 23 Other nematodes SVM analysis for gene expression
- 24 Other species
- 25 parasitic nematodes papers containing expression data
- 26 2A viral technology
- 27 SVM analysis for gene expression
- 28 Micropublications
- 29 User data submission
- 30 Numbers
- 31 Transgenic Alleles from Constructs
- 32 new allele request
- 33 bulk OA uploads
- 34 Expression pattern remodel
- 35 OA and GO CC curation comparison
- 36 AGR data transfer
- 37 Single Cell RNAseq graphs
Expression Pattern
Current model (WS280)
?Expr_pattern Expression_of Gene ?Gene XREF Expr_pattern #Evidence Reflects_endogenous_expression_of ?Gene CDS ?CDS XREF Expr_pattern // for coding genes Sequence ?Sequence XREF Expr_pattern // for clones??? Pseudogene ?Pseudogene XREF Expr_pattern // [030801 krb] Clone ?Clone XREF Expr_pattern Protein ?Protein XREF Expr_pattern Protein_description Text // information for Expr_patterns with unknown antigens [031105 krb] Homol Homol_homol ?Homol_data XREF Expr_homol ?Method Float Int UNIQUE Int Int UNIQUE Int #Homol_info //Expr_pattern mapping [060427 ar2] Expression_data Life_stage ?Life_stage XREF Expr_pattern #Qualifier Anatomy_term ?Anatomy_term XREF Expr_pattern #Qualifier GO_term ?GO_term XREF Expr_pattern #GR_condition Not_in_Life_stage ?Life_stage #Qualifier Not_in_Anatomy_term ?Anatomy_term #Qualifier Not_in_GO_term ?GO_term #GR_condition Subcellular_localization ?Text Type Antibody ?Text Cis_regulatory_element Text EPIC ?Text Genome_editing ?Text In_situ Text Localizome ?Text Microarray ?Microarray_experiment Northern Text Reporter_gene ?Text RNASeq ?Analysis RT_PCR Text Tiling_array ?Analysis Western Text Expression_cluster ?Expression_cluster XREF Expr_pattern //added for localizome Microarray_results ?Microarray_results XREF Expr_Pattern Pattern ?Text Picture ?Picture XREF Expr_pattern MovieURL Text //Added by wen for link to movie URLs. Movie ?Movie XREF Expr_pattern //Added by Wen to curate Expr_pattern video Species UNIQUE ?Species Remark ?Text #Evidence DB_info ?Database ?Database_field Text Experiment Laboratory ?Laboratory Author ?Author Date UNIQUE DateType Strain UNIQUE ?Strain Reference ?Paper XREF Expr_pattern Transgene ?Transgene XREF Expr_pattern Variation ?Variation XREF Expr_pattern Construct ?Construct XREF Expression_pattern Associated_feature ?Feature XREF Associated_with_expression_pattern #Evidence Antibody_info ?Antibody XREF Expr_pattern // This applies to both Western & Antibody staining // added [031120 krb] Curated_by UNIQUE Text // Hinxton (HX) or Caltech (CIT) Sylvia [010927 krb] Historical_gene ?Gene Text //Qualifer hash will be used for Expr_pattern curation to specify the reliability of data. #Qualifier Certain Uncertain //For faint or variable expression Partial //For expression of unidentified cell in a cell group Anatomy_term ?Anatomy_term //combines life stage with anatomy term in expr pattern annotation Life_stage ?Life_stage //combines life stage with anatomy term in expr pattern annotation Remark Text //New tag to take the optional text from the Certain/Uncertain/Partial nested tags above.
Tags used in Expr_pattern objects (WS221):
Laboratory Expr_pattern Pattern Life_stage Gene Antibody Subcellular_localization GO_term Western Transgene Protein_description In_Situ Author Anatomy_term Reporter_gene Picture Date Reference Expressed_in Antibody_info Protein Northern Clone Cell RT_PCR Strain Remark MovieURL Pseudogene Curated_by Sequence
Types of fields Juancarlos can implement:
* text : text * bigtext : text box expanded * dropdown : few values * ontology : controlled vocabulary * multiontology / multidropdown : allows multiple values * toggle : on / off
Genes with expression
to check the number of genes that do have expression objects you should run this script on tazendra:
/home/postgres/work/get_stuff/for_daniela/20140715_exp_gene_distinct/get_exp_gene_distinct.pl
- 5575 as of August 2014
- 5734 as of January 2016
WS248 numbers
for expression in WS248 we have 19052 objects coming from Yanai elegans; 68097 for Yanai other species, 10545 for Miller tiling arrays and 13877 manually curated -> 111571 total for pictures in WS248 we have 19052 objects coming from Yanai elegans; 68097 for Yanai other species,and 13912 manually curated for a total of 101061.
These are the statistics from citace that Wen pulled on May 11th 2015: Here are the changes from WS243 to WS248:
find Antibody: 2525 --> 2785, 260 added. find Anatomy_term: 6839 --> 6842, 3 added. find Anatomy_function: 598 --> 924, 326 added. find DO_term: 6350 --> 6571, 221 added. find Expr_pattern: 42979 --> 111571, 68592 added. find Picture: 32636 --> 101061, 68425 added.
OA interface
OA editor label -- postgres table name -- type of table and description.
Dumper
On February 2015 we ahve added the qualifier life stage field so we could capture anatomy and life stage associated to each other. We have added a qualifier_lifestage field and modified the dumper so that whenever there is an anatomy and a life stage in the qualifier life stage it will dump:
Expr_pattern : "Expr12000" Anatomy_term "WBbt:0004575" Life_stage "WBls:0000264" Life_stage "WBls:0000264" Anatomy_term "WBbt:0004575"
we also set it up in a way that if only the qualifier life stage is filled it will dump it-this is because data entered in the life stage field in the micropublication form will go into exp_qualifierls
Tab1
- Pgdbid -- no table -- postgres database ID, generates automatically upon entry.
- Expr_pattern -- Expr_pattern : "exp_name" -- text -- Expression Pattern ID is generated when creating a new object. Take the highest Expr_patternID and increase by one When making a new row, the OA looks at all entries in exp_name that begin with "Expr", then captures the numbers, finding the highest number, adds 1 to it, puts 'Expr' in front, and uses that as the new name.
- Reference -- Reference "exp_paper" -- multiontology on paper WBPaperID - multiontology there are Expr objects with multiple papers associated. A query for that is: testdb=> SELECT * FROM exp_paper WHERE exp_paper ~ ','; and the result:
302 | "WBPaper00001926","WBPaper00001469" | 2011-05-31 11:37:14.153562-07 5478 | "WBPaper00002573","WBPaper00002922" | 2011-05-31 12:04:27.053611-07 5479 | "WBPaper00001785","WBPaper00002922" | 2011-05-31 12:04:27.284807-07 5501 | "WBPaper00003285","WBPaper00001812" | 2011-05-31 12:04:34.649968-07 5502 | "WBPaper00003285","WBPaper00001812" | 2011-05-31 12:04:34.893549-07 5557 | "WBPaper00001014","WBPaper00002319","WBPaper00000647" | 2011-05-31 12:04:50.273589-07 5558 | "WBPaper00001014","WBPaper00002319","WBPaper00000647" | 2011-05-31 12:04:50.51613-07 5559 | "WBPaper00001014","WBPaper00002319","WBPaper00000647" | 2011-05-31 12:04:50.724796-07 5689 | "WBPaper00002573","WBPaper00002922" | 2011-05-31 12:05:33.692997-07 5707 | "WBPaper00001014","WBPaper00002319","WBPaper00000647" | 2011-05-31 12:05:42.104837-07 8260 | "WBPaper00031556","WBPaper00032077" | 2011-05-31 12:18:44.374796-07
- Person -- "exp_person" 'multiontology on WBPersons, used to capture people for personal communications and to capture what was stored in the 'Author' field for old annotations. Added on feb 8th, 2021
- Gene -- Gene "exp_gene" -- multiontology on genes WBGeneID - show WBID, locus, and synonym in term info
- Endogenous -- "exp_endogenous" toggle tag in ace file Reflects_endogenous_expression_of
- Rel Anatomy -- "exp_relanatomy" dropdown on part_of
- Anatomy -- Anatomy_term "exp_anatomy" exp_qualifier "exp_qualifiertext" -- multiontology. Daniela will associate different Anatomy-qualifier-qualifier_text in different OA rows, so some Expr objects will have multiple rows / multiple pgids. When querying by any of these fields, if editing a different field, the curator should query by Expr to make sure all pgids for that object have that other field edited.
- Qualifier -- exp_qualifier -- dropdown -- Certain / Uncertain / Partial / NOT (NOT is not dumping for now. Added feb 2015 to capture negative expression)
- Anatomy certain -- exp_certain -- multiontology. Controlled vocabulary found here: https://github.com/raymond91125/Wao/raw/master/WBbt.obo (same as in Picture OA). We need to have 3 different Anatomy term boxes, one for the Partial, one for the certain and one for the uncertain Qualifiers.
- Anatomy Partial -- exp_partial -- multiontology.
- Anatomy Uncertain -- exp_uncertain -- multiontology.
- Anatomy no qualifier-- exp_noqualifier -- multiontology. We added this field because when we parsed the old expr_pattern data (WS226) 5518 anatomy_term lines did not have a #Qualifier.
- Qualifier Text -- exp_qualifiertext -- bigtext
- GO_term -- GO_term "exp_goid" -- multiontology of GO_Term like gop_goid.
- Subcellular_localization -- Subcellular_localization "exp_subcellloc"-- bigtext, details on subcellular localization.
- Rel LS -- "exp_rellifestage" dropdown on part_of and happens_during
- Life_stage -- Life_stage "exp_lifestage" Convert the life stage IDs into names from the obo_name_lifestage -- multiontology like in the phenotype OA and picture OA
- Species -- "exp_species"
on Nov 3rd 2014 we have added 4 fields that will not be dumped yet but will be used to aid granular curation and to port annotation extension into GO (and implemented also relations on feb 2015)
- Qualifier LS -> multiontology on life stages -> exp_qualifierls dependent_on
- GR Anatomy -> multiontology on anatomy terms -> exp_granatomy
- GR LS -> multiontology on life stages -> exp_grlifestage
- Rel Cell Cycle -- "exp_relcellcycle" dropdown on part_of, independent_of, happens_during, dependent_on
- GR Cell Cycle -> multiontology on GO -> exp_grcellcycle
Juancarlos parsed .ace dump from WS226: 5518 anatomy_term lines without a #Qualifier at all in expr_no_qualifier
2703 anatomy_term lines with #qualifier and extra text in expr_data_with_extra_anatomy. expr_data_with_extra_anatomy_categorized 796 unique text-expr linked to various anat_terms in expr_data_with_extra_anatomy for example, look at "Expressed in ventral male specific muscles." which has a unique Expr to multiple anat_terms ; or "1 neuron" linked to multiple different expr / anat_term
Tab2
- Type -- exp_exprtype -- multidropdown select from: Antibody, Reporter_gene, In_situ, RT_PCR, Northern, Western
- Antibody_Text -- Antibody "exp_antibodytext" -- bigtext " this tag was used 462 times in WS221
- Reporter_gene_Text -- Reporter_gene "exp_reportergene" -- bigtext " this tag was used 7273 times in WS221 and has been used twice for the same object -> lines are | separated. Details on reporter gene construct. Multiline, the dumper dumps multiple lines
- In_Situ -- In_Situ "exp_insitu" -- bigtext " this tag was used 434 times in WS221
- RT_PCR -- RT_PCR "exp_rtpcr" -- bigtext " this tag was used 165 times in WS221
- Northern -- Northern "exp_northern" -- bigtext " this tag was used 347 times in WS221
- Western -- Western "exp_western" -- bigtext " this tag was used 19 times in WS221
We have a multidropdown on the values above AND we have bigtext fields for each of the values above. D&J decided this on March 21
- Picture -- exp_picture -- Multiontology on Picture We will remove this tag: Picture objects will be created in Picture OA and XREF to Expr_pattern. They will not be entered here. Removed from OA -- J We removed Pictures form Expr_pattern as they are XREF'd to it
- Picture flag -- exp_pictureflag -- toggle notify picture person with a cronjob every 2 weeks. We keep this even if we remove the Picture tag (not currently used)
- Antibody_info -- Antibody_info "exp_antibody" -- multiontology on antibodies
- Antibody flag -- exp_antibodyflag -- toggle -> notify antibody person with a cronjob every 2 weeks
- Pattern -- Pattern "exp_pattern" -- bigtext, details on tissue distribution. Multiline
- Remark -- Remark "exp_remark" -- bigtext, if any comments required. Multiline
- Transgene -- Transgene "exp_transgene" -- multiontology on transgenes.
- Construct -- Construct "exp_construct" -- multiontology on constructs.
- Transgene flag -- exp_transgeneflag -- toggle -> notify transgene person with a cronjob every 2 weeks
- Sequence_feature -- exp_seqfeature -- multiontology on Features (WBsfIDs)
- Curator -- exp_curator -- Multiontology on people
- No dump -- exp_nodump -- Toggle Expr_pattern objects not to dump. If an Expr_pattern object is flagged as no dump, don't dump any data for that pgid, nor any other pgid that corresponds to the Expr_pattern object. (Read all exp_nodump + exp_name into a hash of Expr_patterns to not-dump.)
Tab3
- Protein_description -- Protein_description "exp_protein" -- text (30 objects)- cleaned up data for Alliance harmonization, asked to remove tag on Feb 2022. Deleted field from OA on Feb 25 2022
- Clone -- Clone "exp_clone" -- multiontology on clones (341 objects) (when OA is in place discuss with Chris on the clone class). Is there a better place to get clones than http://textpresso-dev.caltech.edu/gsa/worm/known_entities/Clone?
Clone and Strain lists could taken from spica from /home/citpub/arun/wb_entities/known_entities All of these don't have any Term Info (nor synonyms) if you need either of those you'd have to query WS for it, Karen probably knows how, she does it for other objects -- J ok, I don't think I'll need a term info and I need it mainly to parse old data which have a clone attached. so for now is fine as it is D ok, I'll change the parser to read these. this lists are kept updated with what is in acedb daily using the following cronjob: 06 23 * * * cd /home/citpub/arun/wb_entities; ./01getEntities.pl
In July 2014 there was a change in Hinxton that affected the clone list. From Juancarlos: Thanks Paul D. I've switched the script to look at clones2.ace.gz instead of clones.ace.gz and the data seems to have read in fine. The script is not in a repo, but I've symlinked it so it shows here http://tazendra.caltech.edu/~postgres/out/geneace/nightly_geneace.pl so it will always be the current version there. The best thing though would probably be to look at the wiki, to see what the script is supposed to do, and I don't know where or if there is a wiki for it. Karen, do we have one ?
- Strain -- Strain "exp_strain" -- multiontology on strains (812 objects). is there a better place to take the strain list then http://textpresso-dev.caltech.edu/gsa/worm/known_entities/Strain?
Clone and Strain lists could take from spica from /home/citpub/arun/wb_entities/known_entities this lists are kept updated with what is in acedb daily using the following cronjob: 06 23 * * * cd /home/citpub/arun/wb_entities; ./01getEntities.pl
- Sequence -- Sequence "exp_sequence" -- text (13 objects) F54E2 2x (clone), R05D8 2x (clone), Y38B5A (clone), "Z28375" -C "EMBL Z28375" (sequence), "Z28376" -C "EMBL Z28376" (sequence), "Z28377" -C "EMBL Z28377" (sequence), R11H6 (clone), Y40H4A (clone), U14525, C47G2 (clone), Z32673 (sequence). Obsoleted February 2021
- MovieURL -- MovieURL "exp_movieurl" -- (32 objects) text Deleted field from OA on Feb 25 2022
- Laboratory -- Laboratory "exp_laboratory" -- ontology (17 objects) cleaned up data for Alliance harmonization, asked to remove tag on Feb 2022 Deleted field from OA on Feb 25 2022
- Variation -- Variation "exp_variation". Multiontology on variations
Tab4
- Micropublication exp_micropublication toggle
- removed from OA on Feb 8 2021, the fields below were initially put in for micropubs but are no longer needed:
- Contact- ontology on persons -exp_contact
- e-mail exp_email
- Co-authors exp_coaut
- Funding exp_funding
Protein field addition to OA
Since we will have expression data dumped into protein-to-GO, we need a protein field to capture the right protein isoform whenever authors specify isoform-specific subcellular localization.
On Oct 23rd 2017 Juancarlos moved data that was present in the 'exp_protein' postgres table to a new table called 'exp_proteindesc'. He then used the protein table to store the Protein IDs. He added a new field on tab 3 of Expression OA called exp_protein, that autocompletes on protein names. It is a multi ontology field - it could be that 2 isoforms share the same pattern/reagents. The field is dumped as Protein "WP:CE06704"
cleaned up data for Alliance harmonization, asked to remove tag on Feb 2022
Microarray_results
The field Microarray_results has been added to the model for WS247. This will allow mapping to gene for other species (remanei, briggsae, japonica) coming from the Yanai study. Hinxton is mapping Microarray_results to Gene on the fly.
Make sure in the future not to add Microarray_result to C elegans expression objects to avoid overwriting any curated Gene references- DR 01-07-2015
Curated_by
- Curated_by -- exp_curatedby -- text (6228 objects) This is a legacy thing, the values are only Hinxton and Caltech.
In July 2014 we discussed to remove the Curated_by data. Generated a -D file for Citace minus. Deposited on CitaceMinus for the WS245 upload. /Users/danielaraciti/Desktop/Expression_pattern curation/Curated_by/Curated_by.ace.edited. In Expression OA all the objects that had a Curated_by HX tag are now assigned to Sylvia MArtinelli -the one who historically set the Curated_by tag in the model the list of pgids that were changed is here: /Users/danielaraciti/Desktop/Expression_pattern curation/Curated_by/Curated_by_hinxton.rtf
- Curated_by is currently used just by chronograms
obsolete fields
February 2021: Request a model change to get rid of the following tags, more info below:
- Cell -removed
- Cell_group -removed
- Author -- exp_author - removed
- Date -- exp_date (2617 objects) - removed
- Protein_description - asked to remove Feb 2022. Deleted field from OA on Feb 25 2022
- Expressed_in - text 1 entry. No info attached to this term. Left out DR 06062011 - removed
- Protein - text 1 entry could be put in Protein_description. Expr1941 done DR 06062011 -asked to remove Feb 2022 Deleted field from OA on Feb 25 2022
- Pseudogene - text (1 object) Expr111 done DR 06062011 - asked to remove Feb 2022
- CDS
- Sequence
- Cell -- exp_cell -- text (26 objects)-> Consolidate these objects with the Anatomy_term field done DR 06062011:
Expr_pattern : "Expr7477" Cell "P3.p" Certain Cell "P4.p" Certain Cell "P5.p" Certain Cell "P6.p" Certain Cell "P7.p" Certain Cell "P8.p" Certain
done DR 06062011
Expr_pattern : "Expr7595" Cell "CANL" Uncertain Cell "CANR" Uncertain
done DR 06062011
Expr_pattern : "Expr7605" Cell "M4" Certain
done DR 06062011
Expr_pattern : "Expr7632" Cell "AVG" Certain Cell "M5" Certain Cell "PVT" Certain Cell "PVCL" Uncertain Cell "PVCR" Uncertain Cell "PVNL" Uncertain Cell "PVNR" Uncertain Cell "PVQL" Uncertain Cell "PVQR" Uncertain
done DR 06062011
Expr_pattern : "Expr7691" Cell "P3.p" Certain Cell "P4.p" Certain Cell "P5.p" Certain Cell "P6.p" Certain Cell "P7.p" Certain Cell "P8.p" Certain
done DR 06062011
Expr_pattern : "Expr8715" Cell "M.dlpa" Certain Cell "M.drpa" Certain
done DR 06062011
- Authors and Date: Data stored in ?Author is legacy data. File was on citace minus. Wen created a -D file to delete such data for WS281. We decided on 02.11.2021 during the WB meeting that we will import names of authors present in the ?Author tag that are not listed as coauthors of the publication attached tot he Expr_pattern object. For those, we will also capture the date. These info was added to the remark field for the following objects on 02.12.2021:
WBPaper00005281: Expr26, Expr92, Expr94, Expr107, Expr112 WBPaper00001469: Expr55 WBPaper00002318: Expr133, Expr134, Expr135, Expr136, Expr137, Expr27, Expr60, Expr61, Expr62, Expr63, Expr64, Expr65, Expr66, Expr67, Expr68 WBPaper00001752: Expr46, Expr47, Expr48, Expr49, Expr50, Expr51, Expr52 WBPaper00001358: Expr56 WBPaper00002049: Expr59 WBPaper00001456: Expr86 WBPaper00002551: Expr87
To see a list of Expr_objects that included Author and Date Data, check the WS279.ace file on the WB FTP site.
- To DO still: upload -D file on Spica (done), request a model change (done) dump the author data (todo)
- Protein_description 33 objects -> Decision: Move to remarks -> Done DR 2021/02/26. Redundant info such as CPL-1 in Protein_description and CPL-1 in gene name were omitted.
Example: Expr_pattern : "Expr450" Gene "WBGene00000776" Protein_description "CPL-1" Expr_pattern : "Expr552" Gene "WBGene00006528" Protein_description "Tubulin alpha"
- Sequence 12 objects -> Decision: Move to remarks. Done DR 2021/03/15
Example: Expr_pattern : "Expr12" Gene "WBGene00003976" Sequence "Z28377|Z28375|Z28376"
- Laboratory 23 objects -> can infer via publication -> Decision: good to ignore. Cleaned up OA. Deleted field from OA on Feb 25 2022
Example: Expr_pattern : "Expr87" … Laboratory "ML" Gene "WBGene00003012"
Tags used only once that should be fixed
- Homol_homol tag is used in Chronograms -> we will not include Chronograms in the OA.
Comments for Parsing ExprCitace226 into OA
Parsing files in /home/postgres/work/pgpopulation/exp_exprpattern Many entries for Anatomy_term don't have one of the Certain/Partial/Uncertain. We leave them without the qualifier.
Chronogram tags
Right_priority Localizome Show_up_strand GFF_source Width Picture Reporter_gene Reference Gene Allow_misalign GFF_feature Transgene Homol_homol Remark Strain Colour Curated_by
the script to get the tags (e.g. from ExprWS221.ace or from Chronograms.ace) was written by Yuling, is called get_tags.pl and is located under desktop/Varia_protocols/get_tags
We will not include Chronograms in Expr_OA anyway as they are one time large scale exp. We have 2084 chronograms
to fix manually
* INVALID DATA antibody [WBPaper00032450]:capg-1 Expr8708 * INVALID DATA antibody [cgc3002]:beta-filagenin Expr1442 * INVALID DATA antibody [cgc4387]:hsp-16.2 Expr1117 * INVALID DATA antibody [cgc6057]:daf-21 Expr2687
- INVALID DATA goid GO:0000141 Expr3919 Done DR06062011
- INVALID DATA goid GO:0008221 Expr7871 Done DR06062011
- INVALID DATA transgene Is001 Expr2646 Done DR06062011
- INVALID DATA transgene Is007 Expr2646 Done DR06062011
- INVALID DATA transgene leals30 Expr9151 Done DR06062011
- INVALID DATA transgene pZMI.1In1 Expr725 Done DR06062011
- INVALID DATA transgene pZMI.1In2 Expr725 Done DR06062011
Need to correct the expression pattern transgene name
- Is001 -> WBPaper00006024_Is001 for Expr2646 WBPaper00006024 Done DR06062011
- Is007 -> WBPaper00006024_Is007 for Expr2646 WBPaper00006024 Done DR06062011
- pZMI.1In1 -> WBPaper00002501_In1 for Expr725 WBPaper00002501 Done DR06062011
- pZMI.1In2 -> WBPaper00002501_In2 for Expr725 WBPaper00002501 Done DR06062011
- Add leals30 Expr9151 WBPaper00037728 Done DR06062011
Need to correct the expression pattern GO name
- GO:0000141 is now GO:0032432 Done DR06062011
- GO:0008221 is now GO:0016529 Done DR06062011
There was a list of Anatomy term objects with invalid IDs. this is the mapping for the new ids:
- Old ID New ID
- WBbt000:6748 WBbt:0006748
- WBbt:0003852 WBbt:0003851
- WBbt:0004397 WBbt:0008116
- WBbt:0004398 WBbt:0008111
- WBbt:0004401 WBbt:0004392
- WBbt:0004459 WBbt:0003664
- WBbt:0004514 WBbt:0008052
- WBbt:0004515 WBbt:0008050
- WBbt:0004717 WBbt:0008046
- WBbt:0004718 WBbt:0008051
- WBbt:0004719 WBbt:0008049
- WBbt:0004720 WBbt:0008047
- WBbt:0004721 WBbt:0008045
- WBbt:0004722 WBbt:0008044
- WBbt:0005099 WBbt:0005830
- WBbt:0005211 WBbt:0005801
- WBbt:0005228 WBbt:0005214
- WBbt:0005323 WBbt:0005831
- WBbt:0005814 WBbt:0006909
- WBbt:6789 WBbt:0006789
all OK
Clean up of objects that did not have a reference nor Author
February 2021
- Clean up of expr_pattern data that did not have a reference nor a Person associated. These are empty old pattern objects. They have been deleted from postgres. Can track if necessary by looking at .ace files older than WS280 with a 'Merged to' search.
Expr_pattern : "Expr1996" Remark "Merged to Expr2436."
Importing the large large scale Expression_pattern left on Citace Minus into OA
File is on tazendra WS232LargeScaleExpr.ace
-D file for the import generated by Juancarlos /home/postgres/work/pgpopulation/exp_exprpattern/20120502_largescale/DashDWS232LargeScaleExpr.ace
there were only "Date" data. not Curated_by nor Author. We kept the "Data" values on Citace minus as we did for the previous import. The other field we ignored was pictures but we did not keep them in Citace Minus as we get them via Picture OA.
-D file deposited in CitaceMinus Data_for_Citace_minus/Data_from_Daniela on May 9th 2012
Serial numbers for large scale imports
Itai Yanai WBPaper00041190 C elegans (Expr starting with 101 and 102) Expression Expr1010178 to Expr1029229 Picture WBPicture0001011201 to WBPicture0001030252 David Miller Wormviz (Expr starting with 103 and 104) Expression Expr1030000 to Expr1040545 No pictures associated to the study Itai Yanai WBPaper00041190 Other species (briggsae, japonica, remanei. They never transferred brenneri) (Expr starting with 105 till 111) Expression Expr1050000 to Expr1118096 Picture WBPicture0001030253 to WBPicture0001098349 Gap in numbering expression objects and pictures (Expr1118097 till Expr1142791, WBPicture0001098350 till WBPicture0001123044) to leave the slot for the missing brenneri data, in case they will submit Itai Yanai WBPaper00046121 Expression Expr1142792 to Expr1163308 Picture WBPicture0001123045 to WBPicture0001143561 TransgeneOme project WBPaper00041419 Reserved Expression Expr1200000 to Expr1300000 Reserved Picture WBPicture0002000000 to WBPicture0003000000 the first expression object generated will always be Expr1200000, and Picture WBPicture0002000000 The system is dynamic, we will always have a different number of objects every release, according to what new they add in TransgeneOme DB Endrov Hench Paper WBPaper00046864 Expression from Expr1170000 till Expr1170087 Pictures from WBPicture0001150000 till WBPicture0001150087 Movies WBMovie0000100000 till WBMovie0000100087 Waterston paper (Paker 2019) Expression from Expr2000000 to Expr2036352 Pictures from WBPicture0001160000 till WBPicture0001196352
Itai Yanai large scale import -WBPaper00041190
In order to display pictures of expression time course we needed to generate expression objects. The objects (Expression and Picture) will be deleted once Wen will finish curating microarray for all species described in the paper and once we will have in place a way to generate images of expression on the fly - data will be retrieved directly from SPELL.
For now Daniela and Juancarlos have generated 2 .ace files, one for pictures and one for expression. the files are on CitaceMinus. The files are called expr_pattern_Yanai.ace and pictures_Yanai.ace
Expression pattern and Picture objects were given high numbers so when the new display system will be in place those could be deleted without affecting anything in OA.
Expression objects go from Expr1010178 to Expr1029229
Picture objects go from WBPicture0001011201 to WBPicture0001030252
there are 19052 objects for each class.
Files are also located here /Users/danielaraciti/Desktop/Citace_upload/Citace Minus Yanai
on december 5th 2014 other species pictures have been added too briggsae, remanei and japonica
Expr from Expr1050000 Pictures from WBPicture0001030253 on
Additional info on Yanai_Instructions_other_species2 on Lario
total number of objects: 20294 briggsae+ 21908 japonica+ 25895 remanei = 68097
- the objects will go in WS247
Hinxton will generate the WBGene name on the fly according to Microarray_results
TOTAL Yanai import elegans + other species: 87149
Itai Yanai 2015 large import -WBPaper00046121
Files here /Users/danielaraciti/Desktop/Expression_pattern curation/Large scale/yanai_2015
- Uploaded 20.517 objects for expression and pictures
- Expr from Expr1142792 till Expr1163308
- Pictures from WBPicture0001123045 till WBPicture0001143561
- NB: there is a gap in numbering expression objects (Expr1118097 till Expr1142791)> this is because Yanai's lab did not submit brenneri's pictures yet. We inquired few times but they were never transferred. We left the numbers available for the future. The brenneri.ace files to be transferred to spica once they submit the images are located here (Lario):
- /Users/danielaraciti/Desktop/brenneri
TransgeneOme import
We are going to import expression data (Images, constructs, and annotations) from the TransgeneOme project -Sarov et al., Cell, 2012. WBPaper00041419.
David Miller tiling arrays import -WBPaper00037950
We want to add links to Wormiz for each gene in order to display graphic expression profiling from tiling arrays. We are going to request a model change for Expr_pattern by adding a DB_INFO tag.
DB_INFO ?Database ?Database_field Text
We will also request the inclusion of Microarray and Tiling Array for Type
Type Reporter_gene ?Text In_situ Text Antibody ?Text Northern Text // Wen [krb 030425] Western Text // Wen RT_PCR Text // Wen Localizome ?Text //added by Wen Microarray ?Microarray_experiment // Daniela Tiling_array ?Analysis// Daniela
In this way will be easier to filter out Yanai and Miller's dayta for being displayed in a separate widget possibly called 'Expression profiling graphs'. Model change requested on 10-09-2013. Daniela and Juancarlos have generated a .ace file that was tested and read fine in acedb. More info on how the file was generated here: /home/acedb/draciti/Expr_pattern/Miller_import. Please note that the script generates "Tiling Array" in the Type. The actual tag is Tiling_array. D have changed it manually in the .ace file. Since during the process of model approval it was suggested to add ?Analysis for the tiling Arrays D modified the file and replaced it on CitaceMinus. The file is also stored here on Lario /Users/danielaraciti/Desktop/Citace_upload/Citace Minus Miller/WBPaper00037950.ace
How the file looks like
Database : "Wormviz" Name "Wormviz" URL "http:\/\/www.vanderbilt.edu\/wormdoc\/wormmap\/Welcome.html" URL_constructor "http:\/\/jsp.weigelworld.org\/wormviz\/tileviz.jsp?experiment=wormviz&normalization=absolute&probesetcsv=%s" Expr_pattern : "Expr1030000" Gene "WBGene00000001" Pattern "Tiling arrays expression graphs" Reference "WBPaper00037950" Tiling_array DB_INFO "Wormviz" "id" "WBGene00000001" Expr_pattern : "Expr1030001" Gene "WBGene00000002" Pattern "Tiling arrays expression graphs" Reference "WBPaper00037950" Tiling_array DB_INFO "Wormviz" "id" "WBGene00000002"
Object names from Expr1030000 to Expr1040545. 10,546 objects
Hench large scale import -Endrov
WBPaper00046864
Files sent to Juancarlos to create the .ace file: on tazendra: /home/acedb/draciti/Hench to rerun the script, in the draciti dir: /home/azurebrd/work/parsings/daniela/20210922_hench/hench_ls_set.pl
.ace file generated here: on tazendra /home/azurebrd/work/parsings/daniela/20210922_hench/
Expr from Expr1170000 till Expr1170087
Pictures from WBPicture0001150000 till WBPicture0001150087
Movies WBMovie0000100000 till WBMovie0000100087
Paker 2019
single cell RNA seq embryonic data Files sent to Juancarlos to generate the .ace file: On tazendra at /home/acedb/draciti/Paker2019
.ace file generated here /home/azurebrd/work/parsings/daniela/20220418_waterston/paker.ace
Expression from Expr2000000 to Expr2036352 Pictures from WBPicture0001160000 till WBPicture0001196352
- May 2023: discovered the image files had wrong mappings between gene name public ID and WBGeneID
- generated a file 'Rename' that has the list of the correct mappings. On tazendra at /home/azurebrd/work/parsings/Daniela/20220418_waterston.
- renamed Canopus the image files on Canopus /home/daniela/OICR/Pictures/WBPerson1562
- generated a -D file for Citace Minus:
Picture : WBPicture0001171211 -D Name "egl-1_WBGene00001186_embryo_terminal.jpg" egl-1_WBGene00001170_embryo_terminal.jpg Expression Expr_pattern : Expr2011211 -D Gene "WBGene00001186" -D Reflects_endogenous_expression_of "WBGene00001186" Gene "WBGene00001170" Reflects_endogenous_expression_of "WBGene00001170"
- Tested on spica, all good
- Patch the file for WS289
Reilly 2020 - WBPaper00060123
The large scale dataset from Reilly, 2020, WBPaper00060123 was imported into OA via script. Files here: /Users/draciti/Desktop/Reilly/Reilly_202109_for_Juancarlos
Script to parse the supplemental tables- tazendra /home/postgres/work/pgpopulation/exp_exprpattern/20210915_reilly_set/to_populate
starting at pgid 19442 Expr15560-Expr15660
EPIC detailed
Cell/time specific expression data have been generated by Wen /Users/danielaraciti/Desktop/Expression_pattern curation/Large scale/Murray/epic.ace
the folder contains also the digitized sulston tree, the files that John Murray sent with positive/negative calls and the lifestage.ace containing all the new life stages
the EPIC.ace file was uploaded on CitaceMinus for WS246
Deleting files from Citace Minus
After parsing the WS226 data into OA we dumped a .ace file for generating a -D file to delete objects from Citace Minus. To the file were added manually all the invalid objects found while parsing the data (e.g. old anatomy term IDs, old GO terms, invalid transgenes and antibody objects) See list in Data to fix manually in this wiki.
Expression-paper association
For papers curated:
find Expr_pattern; follow Reference
For genes related:
find Expr_pattern; follow Gene
Dumper
Sequence filed does not dump fine e.g. Expr_pattern : "Expr980" Sequence "R05D8|F54E2". Need to fix it. Fixed 06162011
Module located here: /home/postgres/work/citace_upload/expr_pattern/get_expr_pattern_ace.pm
Script that calls the module located here: /home/postgres/work/citace_upload/expr_pattern/use_package.pl*
use lib qw( /home/postgres/work/citace_upload/expr_pattern ); # this command line tells where to look for the module use get_expr_pattern_ace; # tells to use the module
my $outfile = 'expr_pattern.ace'; my $errfile = 'err.out'; # we did not set any rule for errors yet
open (OUT, ">$outfile") or die "Cannot create $outfile : $!\n"; open (ERR, ">$errfile") or die "Cannot create $errfile : $!\n";
my ($all_entry, $err_text) = &getExprPattern('all'); # uses the module to get all the Expr_pattern objects
print OUT "$all_entry\n"; # prints everything into the output expr_pattern file if ($err_text) { print ERR "$err_text\n"; } # prints error into the output error file
close (OUT) or die "Cannot close $outfile : $!"; close (ERR) or die "Cannot close $errfile : $!";
Module
package get_expr_pattern_ace; #name of the package require Exporter; # exports so that other perl scripts can use it
- our @ISA = qw(Exporter);
- our @EXPORT = qw( getExprPattern ); # we are only exporting the getExprPattern subroutine
- our $VERSION = 1.00;
- use strict;
- use diagnostics;
- use DBI;
- my $dbh = DBI->connect ( "dbi:Pg:dbname=testdb", "", "") or die "Cannot connect to database!\n"; # connect to postgres and the testDB database
- my $result;
- my %theHash; # where all the data are going to be stored
- my @tables = qw( name paper gene endogenous anatomy qualifier qualifiertext qualifierls goid subcellloc lifestage exprtype antibodytext reportergene insitu rtpcr northern western antibody pattern remark transgene construct curator nodump protein clone strain seqfeature sequence movieurl laboratory variation species ); # all the tables that have data
- my @maintables = qw(qw( paper gene anatomy goid subcellloc lifestage qualifierls exprtype antibodytext reportergene insitu rtpcr northern western antibody pattern remark transgene construct protein clone strain seqfeature sequence movieurl laboratory variation species ); # tables that have .ace tags
- my $all_entry = ; # where all the .ace data is going to go
- my $err_text = ; # where all the error data is going to go
- my %nameToIDs; #maps the expr_object id to PGID # type -> name -> ids -> count
- my %ids; #list of PGIDs
- my %pipeSplit; #tables that need to split on pipes
$pipeSplit{subcellloc}++; $pipeSplit{antibodytext}++; $pipeSplit{reportergene}++; $pipeSplit{insitu}++; $pipeSplit{rtpcr}++; $pipeSplit{northern}++; $pipeSplit{western}++; $pipeSplit{pattern}++; $pipeSplit{remark}++; $pipeSplit{sequence}++;
- my %tableToTag; #mapping table to the .ace tag
$tableToTag{paper} = 'Reference'; $tableToTag{gene} = 'Gene'; $tableToTag{anatomy} = 'Anatomy_term'; $tableToTag{qualifierls} = 'Life_stage'; $tableToTag{goid} = 'GO_term'; $tableToTag{subcellloc} = 'Subcellular_localization'; $tableToTag{lifestage} = 'Life_stage'; $tableToTag{exprtype} = 'Special'; $tableToTag{antibodytext} = 'Antibody'; $tableToTag{reportergene} = 'Reporter_gene'; $tableToTag{insitu} = 'In_situ'; $tableToTag{rtpcr} = 'RT_PCR'; $tableToTag{northern} = 'Northern'; $tableToTag{western} = 'Western'; $tableToTag{antibody} = 'Antibody_info'; $tableToTag{pattern} = 'Pattern'; $tableToTag{remark} = 'Remark'; $tableToTag{transgene} = 'Transgene'; $tableToTag{protein} = 'Protein_description'; $tableToTag{clone} = 'Clone'; $tableToTag{strain} = 'Strain'; $tableToTag{sequence} = 'Sequence'; $tableToTag{movieurl} = 'MovieURL'; $tableToTag{laboratory} = 'Laboratory';
- my %ontologyIdToName; # mappings for ids to names (only for life stage)
1;
sub getExprPattern {
- my ($flag) = shift; #can be all or the name for an expr_id
&populateOntIdToName(); #call the subroutine a thte bottom of the page for life stage name mapping
if ( $flag eq 'all' ) { $result = $dbh->prepare( "SELECT * FROM exp_name ; " ); } # get all entries for type else { $result = $dbh->prepare( "SELECT * FROM exp_name WHERE exp_name = '$flag' ;" ); } # get all entries for type of object name $result->execute(); # execute the query while (my @row = $result->fetchrow) { # it's going to do the following for every row of the query
$theHash{object}{$row[0]} = $row[1]; # it's going to map the PGIDs to the Expr_ID $nameToIDs{object}{$row[1]}{$row[0]}++; # for every Expr_ID we will get all the corresponding PGIDs $ids{$row[0]}++; } # list of all the PGIDs
- my $ids = ; my $qualifier = ; # if it looks for a specific subset of Expr_pattern it searches only for that subset of PGIDs from the %ids
if ($flag ne 'all') { $ids = join"','", sort keys %ids; $qualifier = "WHERE joinkey IN ('$ids')"; } foreach my $table (@tables) {
$result = $dbh->prepare( "SELECT * FROM exp_$table $qualifier;" ); # get data for table with qualifier (or not if not) $result->execute();
while (my @row = $result->fetchrow) { $theHash{$table}{$row[0]} = $row[1]; } # loops for all the values and store them in the hash } # foreach my $table (@tables) my %e1 = &getData($table, $joinkey);
my %e2 = &getData('qualifier', $joinkey); my %e3 = &getData('qualifiertext', $joinkey); my %e4 = &getData('qualifierls', $joinkey); my $l2_exists = 0; my $l3_exists = 0; my $l4_exists = 0; foreach my $e1 (sort keys %e1) { foreach my $e4 (sort keys %e4) { # dump anatomy to qualifierls in both directions for every crossproduct, if there is an anatomy. 2015 02 03 $l4_exists++; $cur_entry{"$tag\t\"$e1\" Life_stage \"$e4\"\n"}++; $cur_entry{"Life_stage\t\"$e4\" Anatomy_term \"$e1\"\n"}++; } foreach my $e2 (sort keys %e2) { foreach my $e3 (sort keys %e3) { $cur_entry{"$tag\t\"$e1\" $e2 \"$e3\"\n"}++; $l3_exists++; } unless ($l3_exists) { $cur_entry{"$tag\t\"$e1\" $e2\n"}++; $l2_exists++; } } unless ( ($l2_exists) || ($l3_exists) || ($l4_exists) ) { $cur_entry{"$tag\t\"$e1\"\n"}++; } } } elsif ($table eq 'qualifierls') { # micropub data could have qualifierls without anatomy my %e1 = &getData($table, $joinkey); my %e2 = &getData('anatomy', $joinkey); if (scalar keys %e2 < 1) { # if there is no anatomy data, dump each qualifierls (if there was anatomy it would have dumped above under the anatomy section) foreach my $e1 (sort keys %e1) { $cur_entry{"$tag\t\"$e1\"\n"}++; } } }
foreach my $name (sort keys %{ $nameToIDs{object} }) { #loops through all the names that are in the $nameToIDs{object}
- my $entry = ; my $has_data; # entry has .ace data for that expr_object. $has_data is a flag for object that have data
$entry .= "\nExpr_pattern : \"$name\"\n"; # add o the .ace entry the header Expr_pattern : "Expr1234"
- my %cur_entry; # is going to be a hash for filtering things (duplicated objects for qualifier -partial certain uncertain- excludes the duplicated rows that are overlapping
foreach my $joinkey (sort {$a<=>$b} keys %{ $nameToIDs{object}{$name} }) { # it loops through all the PGIDs for the current name next if ($theHash{nodump}{$joinkey}); # skips if the pgid has a NO DUMP flag foreach my $table (@maintables) { # it loops through the main tables (the ones with the .ace tag) next unless ($tableToTag{$table}); # it skips it if there's no tag
- my $tag = $tableToTag{$table}; # gets the tag
if ($table eq 'anatomy') { # in case of anatomy it does the following
- my %e1 = &getData($table, $joinkey); # gets the anatomy term list (based on the PGIDs)
- my %e2 = &getData('qualifier', $joinkey); # gets the qualifier for the previous anatomy term list (based on the PGIDs)
- my %e3 = &getData('qualifiertext', $joinkey); # gets the qualifier text for the previous anatomy term list (based on the PGIDs)
- my $l2_exists = 0; my $l3_exists = 0; # by default there no qualifier and no qualifier text
foreach my $e1 (sort keys %e1) { # loops through all anatomy foreach my $e2 (sort keys %e2) { # loops through all qualifier foreach my $e3 (sort keys %e3) { # loops through all qualifier text $cur_entry{"$tag\t\"$e1\" $e2 \"$e3\"\n"}++; $l3_exists++; } # if it finds the qualifier and if it finds the qualifier text then it adds it to the filter for later printing and makes a note that it found a qualifier text unless ($l3_exists) { # if there is no qualifier text $cur_entry{"$tag\t\"$e1\" $e2\n"}++; $l2_exists++; } } # and it finds the qualifier then it adds it to the filter for later printing and makes a note that it found a qualifier unless ( ($l2_exists) || ($l3_exists) ) { # if there is no qualifier nor qualifier text $cur_entry{"$tag\t\"$e1\"\n"}++; } } } # then it adds it to the filter just the Anatomy tag and data (e.g. Anatomy_term^t "WBbt:1234567") elsif ($table eq 'exprtype') { # it checks for expr_type
- my %entries = &getData($table, $joinkey); # gets data for expr_type and PGID
foreach my $entry (sort keys %entries) { $cur_entry{"$entry\n"}++; } } # for each data it adds the data to the filter but does not add the .ace tag else { my %entries = &getData($table, $joinkey); # gets data for every PGID and every other table that has a tag foreach my $entry (sort keys %entries) { $cur_entry{"$tag\t\"$entry\"\n"}++; } } # for each data it adds the data and the .ace tag to the filter } } # foreach my $joinkey (sort {$a<=>$b} keys %{ $nameToIDs{$type}{$name} }) foreach my $line (sort keys %cur_entry) { $entry .= $line; $has_data++; } # for each line in the filter it adds it to the .ace entry and it flag it has data if ($has_data) { $all_entry .= $entry; } # if it has data it adds this entry to all the entries } # foreach my $name (sort keys %{ $nameToIDs{$type} }) return( $all_entry, $err_text ); # it returns all the results to the use package script } # sub getExprPattern
sub getData { # get hash of values in this table
- my ($table, $joinkey) = @_; # gets the tables and the PGID
- my %entries; # it stores all the data for this tables and PGIDs
if ($theHash{$table}{$joinkey}) { # if it has data
- my $data = $theHash{$table}{$joinkey}; # it gets the data
unless ($table eq 'remark') { if ($data =~ m/^\"/) { $data =~ s/^\"//; } # it escapes with \ the " everywhere but not in the remarks field. if ($data =~ m/\"$/) { $data =~ s/\"$//; } } if ($data =~ m/\//) { $data =~ s/\//\\\//g; } # it escapes / with \ //g; }data =~ s/ m/ # it strips the ^M lines if ($data =~ m/\n/) { $data =~ s/\n/ /g; } # it replaces line breaks with 2 spaces if ($data =~ m/^\s+/) { $data =~ s/^\s+//g; } if ($data =~ m/\s+$/) { $data =~ s/\s+$//g; } # if it begins or end with a space it gets rid of those
- my @data; # this is an array for storing multiple data
if ($data =~ m/\",\"/) { @data = split/\",\"/, $data; } # if the data is a multiontology or multidropdown, it splits on the "," elsif ($pipeSplit{$table}) { @data = split/\|/, $data; } # otherwise if it is in the list of the pipe split tables it splits on the pipe else { push @data, $data; } # if it is neither of those treats the data as if it is the only entry foreach my $value (@data) { # for each of those multiple values if ($value =~ m/\"/) { $value =~ s/\"/\\\"/g; } # if there is a " it adds a backslash to neutralize it for acedb if ($value =~ m/^\s+/) { $value =~ s/^\s+//g; } # if the data begins with a space get rid of the space if ($value =~ m/\s+$/) { $value =~ s/\s+$//g; } # if the data ends with a space get rid of the space if ($table eq 'lifestage') { if ($ontologyIdToName{$table}{$value}) { $value = $ontologyIdToName{$table}{$value}; } } # convert life stage ids to lifestage names. 2011 05 13 # if it's a life stage and there's an ID to name mapping for these data then it uses the name instead of the ID if ($value) { $entries{$value}++; } # if after all of the above there is a value it adds to a filter of values } } return %entries; # it returns all the data that it got for this table and PGID } # sub getData
sub populateOntIdToName { # reads form obo_name_lifestage to get the mappings from life_stage id to name $result = $dbh->prepare( "SELECT * FROM obo_name_lifestage;" ); $result->execute(); while (my @row = $result->fetchrow) { $ontologyIdToName{'lifestage'}{$row[0]} = $row[1]; } } # sub populateOntIdToName
We have put an error check for dead genes and invalid papers on may 31st 2012:
elsif ($table eq 'gene') { my %entries = &getData($table, $joinkey); foreach my $entry (sort keys %entries) { if ($deadObjects{gene}{$entry}) { $err_text .= "$name has dead gene $entry $deadObjects{gene}{$entry}\n"; } else { $cur_entry{"$tag\t\"$entry\"\n"}++; } } } elsif ($table eq 'paper') { my %entries = &getData($table, $joinkey); foreach my $entry (sort keys %entries) { if ($deadObjects{paper}{$entry}) { $err_text .= "$name has dead paper $entry $deadObjects{paper}{$entry}\n"; } else { $cur_entry{"$tag\t\"$entry\"\n"}++; } } }
sub populateDeadObjects {
$result = $dbh->prepare( "SELECT * FROM gin_dead;" ); $result->execute(); while (my @row = $result->fetchrow) { $deadObjects{gene}{"WBGene$row[0]"} = $row[1]; } $result = $dbh->prepare( "SELECT * FROM pap_status WHERE pap_status = 'invalid';" ); $result->execute(); while (my @row = $result->fetchrow) { $deadObjects{paper}{"WBPaper$row[0]"} = $row[1]; }
} # sub populateDeadObjects
Historical Gene tag
Handling Dead Genes During Dump Process
The dumper script will now (as of May, 2013) run an automatic check for dead genes in any gene field. Any genes that are considered dead that are referenced in an Interaction object in the OA will be handled in the following manner:
1) If there is a replacement for the gene (i.e. the gene has merged into another gene), the dead gene will be dumped into a "Historical_gene" field in the .ACE file, the replacement gene will fill the original gene field. A comment will be added to the Historical_gene field via the #Evidence hash. The original gene field (now with the updated gene reference) will be printed with an "Inferred_automatically" tag after the gene. So, for example, if WBGene00001234 is now a dead gene that has been merged into WBGene00002345:
Gene "WBGene00001234"
becomes
Gene "WBGene00002345" Inferred_automatically Historical_gene "WBGene00001234" Remark "Note: This object originally referred to WBGene00001234. WBGene00001234 is now considered dead and has been merged into WBGene00002345. WBGene00002345 has replaced WBGene00001234 accordingly."
Notes:
Dead -> dead Suppressed -> suppressed merged_into WBGene -> merged split_into -> split looping through the genes where something happened to make sure they don't also point at something else exp_gene merged -> historical_gene + remark AND gene <gene> Inferred_automatically dead -> historical_gene + remark suppressed -> historical_gene + remark split -> historical_gene + remark AND error message normal ones -> just tag + value
Examples:
A split gene: WBGene00012507 A merged gene: WBGene0e0007524 A dead gene: WBGene00007814 A suppressed gene: WBGene00015490
Data parsing
File that was used for parsing is the WS226 dump and is located here: /home/postgres/work/pgpopulation/exp_exprpattern/ExprWS226.ace
There are 1802 objects without any Anatomy_term. I'm assuming this is okay -- J Yes, it is --D
What do we do with Marker objects ? Treat them the same as Expr_pattern objects ? -- J yes, treat the same --D
Life_stage in obo class have WBls:####### IDs, but data has lifestage names, is this bad data ? The OA only supports IDs (see phenotype, generegulation, picture OA) : can we convert the life stage names into WBls:#######? I asked Wen about this and she is fine with it --D Changed the parser to convert from name to ID, but still waiting until we talk to Karen
- WS gene reg obj : http://wormbase.org/db/misc/etree?name=WBPaper00036764_lin-28.b;class=Gene_regulation
- WS expr pat obj : http://wormbase.org/db/misc/etree?name=Expr2201;class=Expr_pattern
/home/postgres/work/pgpopulation/exp_exprpattern/invalid_ontology_values has many other objects that don't fit the ontologies. It would be best to either fix them in citace and redump, or to get mappings of bad-to-good values and put them in the parser. This was run on the sandbox, so if any values are real, the sandbox might not have all the values. -- J I see, there are many objects with invalid format for different classes. i will figure out what was the problem for each of them and get back to you --D. 20 Anatomy terms having old ids -> Daniela generated mapping with new IDs. 2 invalid objects for GO -> Alerted Ranjana, waiting for answer 5 Antibody objects -> alerted Xiaodong, 2 fixed, 3 waiting for Wen's answer (did she create the objects already or we should generate new ones?). 37 transgenes objects -> alerted Karen
Strain and Clone don't have ontologies yet, once we have those we'll see if any data is bad -- J ok --D
Only looking at WBPictureID pictures, if we need to dump both ways, it will get conversions from the WBPictureID's name. -- J I am not sure I get this..D we talked about it
-D file for Citace Minus
when tried to parse -D file into cite minus the following errors occurred;
- Pattern: 2 objects Expr98 did not parse in 2 pattern descriptions. Not in OA
I will add them manually in OA and -D those Done DR 06142011
- Anatomy_term: 14 objects
- 2 in Expr 120 checked OK only extra space at the end
- 1 in Expr 1269 checked OK only extra space at the end
- 1 in Expr 1569 checked OK only extra space at the end
- 8 in Expr2812 checked OK only extra space at the end
- 1 in Expr3211 checked OK only extra space at the end
- 1 in Expr7467 checked OK only extra space at the end
We can delete them from Citace Minus, text is fine in OA. Done DR 06142011
- Antibody_info: 4 objects
they are already on my list. Xiaodong should generate the objects. Will delete them Done DR 06142011 and add them manually in OA when ready. TODO
- Reference: 3 objects
Expr_pattern : "Expr2916" Reference "WBPaper00006518"
Expr_pattern : "Expr2994" Reference "WBPaper00013501"
Expr_pattern : "Expr3715" Reference "WBPaper00025175"
Wen looked into it and these are obsolete IDs. We delete them from Citace Minus Done DR 06152011
- Gene: 30 objects. This happened because the Gene field is an ontology. Some Expr_pattenr objects are associated to multiple genes therefore it did not parse the data in correctly. not only this the problem. Wen is looking into it could be obsolete IDs. Wen checked. Are obdolete we can -D. DR06162011
.
- Remark: 1 object Expr111 was not deleted as I added Pseudogene info in the remarks. Deleted from citace minus added into OA OK. Done DR 06142011
- Pseudogene: 1 object
Expr_pattern : "Expr111" Pseudogene "F56D5.8" can fix this manually and delete it from Citace Minus. Done DR 06142011
- Cell 26 objects
Expr_pattern : "Expr7477" Cell "P3.p" Certain Cell "P4.p" Certain Cell "P5.p" Certain Cell "P6.p" Certain Cell "P7.p" Certain Cell "P8.p" Certain
done DR 06142011
Expr_pattern : "Expr7595" Cell "CANL" Uncertain Cell "CANR" Uncertain
done DR 06142011
Expr_pattern : "Expr7605" Cell "M4" Certain
done DR 06142011
Expr_pattern : "Expr7632" Cell "AVG" Certain Cell "M5" Certain Cell "PVT" Certain Cell "PVCL" Uncertain Cell "PVCR" Uncertain Cell "PVNL" Uncertain Cell "PVNR" Uncertain Cell "PVQL" Uncertain Cell "PVQR" Uncertain
done DR 06142011
Expr_pattern : "Expr7691" Cell "P3.p" Certain Cell "P4.p" Certain Cell "P5.p" Certain Cell "P6.p" Certain Cell "P7.p" Certain Cell "P8.p" Certain
done DR 06142011
Expr_pattern : "Expr8715" Cell "M.dlpa" Certain Cell "M.drpa" Certain
done DR 06142011
- Sequence
Expr_pattern : "Expr12" -D Sequence "Z28375" -C "EMBL Z28375" -D Sequence "Z28376" -C "EMBL Z28376" -D Sequence "Z28377" -C "EMBL Z28377"
Expr_pattern : "Expr52" -D Sequence "R11H6" -D Sequence "Y40H4A"
Expr_pattern : "Expr979" -D Sequence "F54E2" -D Sequence "R05D8"
Expr_pattern : "Expr980" -D Sequence "F54E2" -D Sequence "R05D8"
Done Dr 06142011
- Picture. All picture bjects were -D DR06152011
Exporting Reporter Gene description from Expr_pattern OA to Transgene OA
IMPORTANT: whenever you curate an expr object fill in all the fields before duplicating the object itself. E.g if you need to put expression in the pharynx 'certain' intestine 'uncertain, make sure to generate an object with pharynx 'certain' fill in all the other info, e.g. WBPaper, reporter gene, pattern, ... and THEN duplicate the object. this is important also for the generation of new transgenes with the script below.
In the past transgene objects were generated only when authors did use standard nomenclature (e.g. adEx1256, acIs101). No new transgene objects were created for reporter fusions when there was no standard nomenclature.
From Jan 2012 we want to start generating transgene objects also for those reporter genes.
Action items:
Import all the transgene objects with no standard nomenclature present in Expression pattern OA into Transgene OA
In order to accomplish that we should
- Generate a name for the objects that have exp_reportergene and no exp_transgene and assign it to the table exp_transgene in Expression pattern OA. The name should be: ExprID_Ex (e.g. Expr1234_Ex)
- For all the ExprID_Ex that were generated in the previous step we should populate postgres tables in transgene OA as follows:
exp_transgene -> trp_name
exp_paper -> trp_paper
435 expr objects don't have papers. transfer those objects ? SELECT * FROM exp_reportergene WHERE joinkey NOT IN (SELECT joinkey FROM exp_transgene) AND joinkey NOT IN (SELECT joinkey FROM exp_paper); -- J
I spot checked them. The ones I have seen are coming from Ian Hope large scale expression that he sent few years back to Wen but we need to check systematically if they all come from him. The objects have an Author field associated but the Author is not in OA. When we created the Expr_pattern OA we decided to keep Author, Date, and Curated_by in a separate file in Citace Minus as they were fields not used anymore (see wiki above for reference). For those objects we should put the author in trp_person. If it's hard to retrieve the author from Citace Minus we could get it from the file "ExprWS221.ace" on Tazendra in /home/acedb/draciti dir. D There are 4307 after filtering duplicates -- J
exp_reportergene -> trp_remark
trp_curator -> Daniela Raciti ( WBPerson12028 ).
trp_nodump (for all Daniela Raciti)
There is no trp_nodump table using trp_objpap_falsepos (Fail field) -- J Karen said is good. D
Attention: we will take from Expr_pattern OA all objects regardless of the curator -both Wen and Daniela- but when populating transgene OA we will populate the curator field just with Daniela.
Expr objects exists in multiple OA rows, so there are multiple pgids per Expr object, so multiple objects get created in the transgene OA. See Expr1416. Is this correct ? I don't know if Transgene objects already have multiple pgids, and whether the dumper handles it. This is also going to make the deletion script more complicated, do all pgids for a given transgene name have to be dumpable for deletion ? -- J right, we have multiple pgids for a single Expr_object but the exp_reportergene is the same for all. We should have in Transgene OA only one Expr object i.e. Expr1416_Ex pgid 9980 and get rid of the duplicates. D
13 non-"Hope IA" authors are on the sandbox at /home/postgres/work/pgpopulation/transgene/20120127_expr_to_transgene/bad_authors let me know the mapping of those authors to WBPerson#### . After the mapping is done, we'll see how many of the 48 Expr objects have neither person nor paper; e.g. Expr1684 doesn't. -- J
Mapping:
- Arnold JM = WBPerson16468
- Bauer PK = WBPerson5125
- Britton C = WBPerson78
- Hashmi S = WBPerson4368
- Herbert R = WBPerson16472
- Krause MW = WBPerson346
- Lustigman S = WBPerson390
- Lynch AS = WBPerson1232
- McCarroll D = WBPerson16469
- Mohler WA = WBPerson428
- Mounsey A = WBPerson1716
- Royall CM = WBPerson16473
- Seydoux GC = WBPerson575
After mapping 3 objects had no paper nor person. All 3 are personal communications. J added them -> OK.
- Expr1684 -> Catherine Wolkow ( WBPerson696 )
- Expr1685 -> Massimo Hilliard ( WBPerson258 )
- Expr2781 -> Aharon Solomon ( WBPerson3909 )
The transgene objects will be revised by Karen (in order to delete duplicates if any).
To run the population script do it from a dir where you have write permission e.g. /home/acedb/draciti/Expr_pattern and then give the command
- ~postgres/work/pgpopulation/transgene/20120127_expr_to_transgene/transfer_expr_to_transgene.pl > outputlog
Populate script is /home/postgres/work/pgpopulation/transgene/20120127_expr_to_transgene/transfer_expr_to_transgene.pl -- J get expr objects that have a reportergene but no transgene : SELECT * FROM exp_reportergene WHERE joinkey NOT IN (SELECT joinkey FROM exp_transgene); for each of those get the exp_name and exp_paper Transgene name is ExprName plus _Ex Add to exp_trasnsgene and exp_transgene_hst as multiontology with doublequotes. Get highest transgene pgid, and for each new transgene, create a new transgene with that pgid, trp_name the new transgene name, trp_curator WBPerson12028, trp_objpap_falsepos Fail, trp_remark the exp_reportergene, if there's a paper trp_paper is the exp_paper, if there is no paper look at authors in ExprWS221.ace, and map to persons from Daniela's list into trp_person with doublequotes. If it's not in the list, tell Daniela to get WBPerson mappings. -- J
Author -> Person: Hope IA = WBPerson266
To run the deletion script do it from a dir where you have write permission e.g. /home/acedb/draciti/Expr_pattern and then give the command
- ~postgres/work/pgpopulation/transgene/20120127_expr_to_transgene/delete_cleared_exp_reportergene.pl > output1
Deletion script is /home/postgres/work/pgpopulation/transgene/20120127_expr_to_transgene/delete_cleared_exp_reportergene.pl -- J It looks for SELECT * FROM trp_name WHERE trp_name ~ 'Expr.*_Ex' AND joinkey NOT IN (SELECT joinkey FROM trp_objpap_falsepos); transgene names that have not been set as Fail. "SELECT * FROM trp_synonym WHERE trp_synonym ~ 'Expr.*_Ex' AND joinkey NOT IN (SELECT joinkey FROM trp_objpap_falsepos);" It looks at expression transgenes that match Expr.*_Ex : SELECT * FROM exp_transgene WHERE exp_transgene ~ '"Expr.*_Ex"' AND joinkey IN (SELECT joinkey FROM exp_reportergene); It gets invidual transgenes and if any of them don't match Expr.*_Ex it gives an error message. If all of them match, it checks that all transgenes are dumpable. If all are dumpable, it deletes the exp_reportergene for that pgid and inserts a null into the exp_reportergene_hst for that pgid --- J
We added a rule that should check for 'Expr.*_Ex' in the trp_Synonym too because when the transgene already existed Karen added the Expr_name under synonym. we want that the reporter gene field for those objects in Expr_OA will be deleted as well therefore Juancarlos added a rule in the deletion script that will lok for Expr.*_Ex in the synonym and if it finds it will delete repoter gene field from Expr_pattern OA. Line that he added:
"SELECT * FROM trp_synonym WHERE trp_synonym ~ 'Expr.*_Ex' AND joinkey NOT IN (SELECT joinkey FROM trp_objpap_falsepos);"
- There were >1000 objects that had a standard transgene name (eg adIs1783 and the reporter gene field filled with redundant information, e.g. [mpk-1::gfp]) Daniela went manually through them and deleted the redundant info in the reporter gene field in Expr_pattern OA. The info was already in transgene. Whenever the reporter gene field in Expr_pattern had more info (e.g. sequence) Daniela copied that info into the Transgene Remark field (in line with what we have done for objects above). also, whenever it was specified if transcriptional or translational fusion, that info was added to the reporter type in transgene OA. Double checked with Karen 02.16.2012 -> OK.
Whenever the info in the Reporter Gene field was more pertinent to Expr pattern it was left there. E.g. Expr1046: The larval expression pattern was studied by observing GFP expression in a strain carrying an APR-1::GFP reporter transgene on the integrated array zhIs2.
One example of redundant information was for kxEx74, Expr4687. The info present in the Expr reporter gene field were exactly the same as in the Transgene remark field.
- February 29th 2012: Daniela run the population script on tazendra after having ested the system on Mangolassi (everything was fine there). Outputlog file with the results of the transfer was copied on Daniela's pc Desktop/Wormbase/Expr_pattern_to_Transgene_transfer
- Daniela run the deletion script on september 5th 2012. In this way we deleted the reporter gene info in expression OA for all the objects that were transfered to transgene OA with the population script in february. Karen, Juancarlos and Daniela agreed that we had to set all transgene objects as 'dumpable' in order to delete the reporter gene field from expression OA. Daniela edited the transgenes Fail via the Batch mode. The results of the deletion script are on Lario in Desktop/Wormbase/Expr_pattern_to_Transgene_transfer
Moving forward we will use the new pipeline. In the new pipeline the script immediately deletes the reporter gene field in Expression OA.
- On September 11th 2012 Daniela run the new population script for the first time. the output is on tazendra /home/acedb/draciti/Expr_pattern/Expr_to_Transgene newpopulationscript_09112012. Checked few examples, everything went as it should have
- NB: there should be still some objects in Expression OA that have the reporter gene field AND the transgene. Those should be double checked. It can be that they will bear duplicated info. This happened for the object that Karen merged or looked at before the new pipeline was set in place.
The population script was run at the end of February 2012, from August on we started using the script that is described in the section below -"Current pipeline"
Even after running the deletion script we will not change anything in the Expression pattern page display. This is because there is already a link to the transgene page so all the information about the construct could be found there.
Addendum: Whenever in the reporter gene field in OA there are listed a transcriptional fusion and a translational fusion, only one name Expr123_Ex will be generated. that reporter gene will be transfered to transgene OA. Karen will add as synonym Expr123_Ex for both the transcriptional and translational fusion. If now we want to populate back Expr_OA with the real name of those transgenes we have to ask juancarlos to look for all objects that have the same Expr123_Ex name AND the same Paper but different transgene names and populate Expr_OA in the transgene field with both transgene names
Current Pipeline
For each Expression pattern object that does have a "Reporter Gene" NOT blank and a transgene field BLANK the following will happen:
- For the ones that don't have an existing Expr_Ex in these 2 tables trp_publicname and trp_synonym it will:
- generate an object in the transgene OA called WBTransgene000##### and put it in the trp_name
- add in the trp_synonym the name Expr1234_Ex (the numbering of the Expr objects is after the expression pattern object name)
- add WBPerson12028 in the trp_curator -this has to be changed whenever somebody else will take over expression patterns
- add the reporter gene field text into the trp_remark field
- add the WBPaper into the trp_paper field
- Now all the Expression objects have a mapping to the transgene and the script will
- add into the expr_transgene field the WBTransgene000##### name
- delete the text in the exp_reportergene field
Differently from the old pipeline every object here will be set as dump.
the script is located here:
~postgres/work/pgpopulation/transgene/20120127_expr_to_transgene/transfer_expr_to_transgene.pl > outputlog
Daniela runs the script MANUALLY before each upload, the script is normally run from this dir on tazendra /home/acedb/draciti/Transgene_generation
- the script was run for the first time on September 11th 2012. The output is on tazendra /home/acedb/draciti/Expr_pattern/Expr_to_Transgene newpopulationscript_09112012. Checked few examples, everything went as it should have.
In the script, sub readExprAce there is a subroutine that will get the mappings to Authors from the file /home/acedb/draciti/Expr_pattern/ExprWS221.ace. If the file is not there the script will fail. See above in this section for authors mapping.
sub populateTrpNameToId is getting trp_name (Transgene ID WBTransgene000#####) and map it to trp_publicname and is getting synonyms -trp_synonym- and map it to the trp_name.
For the synonym is splitting on pipes and removing spaces at the beginning and at the end
$trpNameToId{$syn} = $row[0]; this line stores into a hash the mappings of name into ID
merged into
Juancarlos added in transgene OA a "Merged into" box so that Karen will be able to merge transgenes. The transgene object that you merged into the other will be marked as invalid. Say that you are curating transgene2 and you see that is identical to transgene 1, you now click on to "merge into" and select transgene1. Transgene 2 becomes invalid. And Karen will have to add transgene2 into the synonym field of transgene1.
bad strains
to check if there are wrong strains in the OA run this script for a dir where you have permission
/home/postgres/work/pgpopulation/exp_exprpattern/20121011_find_bad_strain/find_bad_strain.pl > bad
Miller paper- tiling arrays
We have added in CitaceMinus a static file with the links to http://www.vanderbilt.edu/wormdoc/wormmap/Expressed_genes.html
The paper is Spencer WC, Zeller G, Watson JD, Henz SR, Watkins KL, McWhirter RD, Petersen S, Sreedharan VT, Widmer C, Jo J, Reinke V, Petrella L, Strome S, Von Stetina SE, Katz M, Shaham S, Rätsch G, Miller DM 3rd. A spatial and temporal map of C. elegans gene expression. Genome Res. 2011 Feb;21(2):325-41. Epub 2010 Dec 22. PubMed PMID: 21177967; PubMed Central PMCID: PMC3032935.
The file: miller_cell_type_expression.ace
We have added links to the site from the anatomy page.
Other nematodes SVM analysis for gene expression
From Yuling (Nov 7th 2013) Results here:
Looks like only 1% is deemed positive...
- 146 positives
- 15253 negatives
Daniela will go through the list and evaluate
Alternative approach: we can check how many papers have been curated for other species and use those as positive training set. The list of papers is below
Other species
there is a script on tazendra to check the objects curated to non elegans genes (the script looks in to the gin_synonyms tables and check what is not CELE)
/home/acedb/draciti/Expr_pattern/20140516_non_cele
find_non_cele.pl*
the output on May 16th 2014 was
WBGene00001198 not CELE_ in pgid 348 WBGene00001198 not CELE_ in pgid 350 WBGene00001198 not CELE_ in pgid 351 WBGene00001198 not CELE_ in pgid 352 WBGene00002126 not CELE_ in pgid 1424 WBGene00009821 not CELE_ in pgid 1520 WBGene00012263 not CELE_ in pgid 2914 WBGene00043408 not CELE_ in pgid 2914 WBGene00009175 not CELE_ in pgid 2914 WBGene00016878 not CELE_ in pgid 2914 WBGene00020512 not CELE_ in pgid 2914 WBGene00019252 not CELE_ in pgid 2914 WBGene00015732 not CELE_ in pgid 3033 WBGene00003454 not CELE_ in pgid 3330 WBGene00003440 not CELE_ in pgid 3330 WBGene00003441 not CELE_ in pgid 3330 WBGene00003427 not CELE_ in pgid 3330 WBGene00003461 not CELE_ in pgid 3330 WBGene00003459 not CELE_ in pgid 3330 WBGene00003428 not CELE_ in pgid 3330 WBGene00003447 not CELE_ in pgid 3330 WBGene00003455 not CELE_ in pgid 3330 WBGene00003453 not CELE_ in pgid 3330 WBGene00003436 not CELE_ in pgid 3330 WBGene00003439 not CELE_ in pgid 3330 WBGene00023572 not CELE_ in pgid 4733 WBGene00023572 not CELE_ in pgid 4734 WBGene00023575 not CELE_ in pgid 4737 WBGene00023572 not CELE_ in pgid 4739 WBGene00023572 not CELE_ in pgid 4740 WBGene00037006 not CELE_ in pgid 4743 WBGene00037006 not CELE_ in pgid 4744 WBGene00037005 not CELE_ in pgid 4747 WBGene00037005 not CELE_ in pgid 4748 WBGene00030970 not CELE_ in pgid 4750 WBGene00041435 not CELE_ in pgid 4753 WBGene00041435 not CELE_ in pgid 4754 WBGene00015274 not CELE_ in pgid 4829 WBGene00015274 not CELE_ in pgid 4830 WBGene00009175 not CELE_ in pgid 5797 WBGene00009175 not CELE_ in pgid 5798 WBGene00000600 not CELE_ in pgid 5823 WBGene00000604 not CELE_ in pgid 5894 WBGene00000605 not CELE_ in pgid 5910 WBGene00000607 not CELE_ in pgid 5990 WBGene00012263 not CELE_ in pgid 6244 WBGene00004041 not CELE_ in pgid 6776 WBGene00002126 not CELE_ in pgid 7395 WBGene00032753 not CELE_ in pgid 7603 WBGene00018677 not CELE_ in pgid 7886 WBGene00018677 not CELE_ in pgid 7887 WBGene00018677 not CELE_ in pgid 7888 WBGene00018677 not CELE_ in pgid 7889 WBGene00004041 not CELE_ in pgid 8272 WBGene00002485 not CELE_ in pgid 9192 WBGene00117029 not CELE_ in pgid 9327 WBGene00043222 not CELE_ in pgid 10105 WBGene00043320 not CELE_ in pgid 10671 WBGene00043320 not CELE_ in pgid 10672 WBGene00010154 not CELE_ in pgid 10897 WBGene00019581 not CELE_ in pgid 11074 WBGene00020312 not CELE_ in pgid 11399 WBGene00021255 not CELE_ in pgid 11756 WBGene00021255 not CELE_ in pgid 11757 WBGene00045485 not CELE_ in pgid 12323 WBGene00029022 not CELE_ in pgid 13372 WBGene00027230 not CELE_ in pgid 13550 WBGene00025707 not CELE_ in pgid 13949 WBGene00023404 not CELE_ in pgid 14000 WBGene00023404 not CELE_ in pgid 14001 WBGene00033342 not CELE_ in pgid 14026 WBGene00059989 not CELE_ in pgid 14027 WBGene00195119 not CELE_ in pgid 14036 WBGene00101073 not CELE_ in pgid 14036 WBGene00025707 not CELE_ in pgid 14037 WBGene00034222 not CELE_ in pgid 14037 WBGene00224104 not CELE_ in pgid 14038 WBGene00233940 not CELE_ in pgid 14039 WBGene00231085 not CELE_ in pgid 14040 WBGene00042594 not CELE_ in pgid 14109
some of these are dead genes and some came up because they do not have CELE_ in the synonyms (that is how the script identifies non elegans)
the 'clean' list is:
WBGene00023572 not CELE_ in pgid 4733 briggsae WBPaper00028961 WBGene00023572 not CELE_ in pgid 4734 briggsae WBPaper00028961 WBGene00023575 not CELE_ in pgid 4737 briggsae WBPaper00028961 WBGene00023572 not CELE_ in pgid 4739 briggsae WBPaper00028961 WBGene00023572 not CELE_ in pgid 4740 briggsae WBPaper00028961 WBGene00037006 not CELE_ in pgid 4743 briggsae WBPaper00028961 WBGene00037006 not CELE_ in pgid 4744 briggsae WBPaper00028961 WBGene00037005 not CELE_ in pgid 4747 briggsae WBPaper00028961 WBGene00037005 not CELE_ in pgid 4748 briggsae WBPaper00028961 WBGene00030970 not CELE_ in pgid 4750 briggsae WBPaper00028961 WBGene00041435 not CELE_ in pgid 4753 briggsae WBPaper00028961 WBGene00041435 not CELE_ in pgid 4754 briggsae WBPaper00028961 WBGene00032753 not CELE_ in pgid 7603 briggsae WBPaper00035320 WBGene00117029 not CELE_ in pgid 9327 pacificus WBPaper00040360 WBGene00029022 not CELE_ in pgid 13372 briggsae WBPaper00004520 WBGene00027230 not CELE_ in pgid 13550 briggsae WBPaper00043890 WBGene00025707 not CELE_ in pgid 13949 briggsae WBPaper00044493 WBGene00033342 not CELE_ in pgid 14026 briggsae WBPaper00004832 WBGene00059989 not CELE_ in pgid 14027 remanei WBPaper00004832 WBGene00195119 not CELE_ in pgid 14036 pacificus WBPaper00040023 WBGene00101073 not CELE_ in pgid 14036 pacificus WBPaper00040023 WBGene00025707 not CELE_ in pgid 14037 briggsae WBPaper00040859 WBGene00034222 not CELE_ in pgid 14037 briggsae WBPaper00040859 WBGene00224104 not CELE_ in pgid 14038 brugia WBPaper00041825 WBGene00233940 not CELE_ in pgid 14039 brugia WBPaper00041825 WBGene00231085 not CELE_ in pgid 14040 brugia WBPaper00041825 WBGene00042594 not CELE_ in pgid 14109 briggsae WBPaper00044831 WBGene00054802 in pgid 14251 remanei WBPaper00041071
the following were validated positive, not yet curated WBPaper00004561 Haemoncus WBPaper00004962 Volvulus WBPaper00005646 Brugia WBPaper00039907 Ascaris WBPaper00041323 Brugia WBPaper00041714 Stercoralis WBPaper00041951 Haemoncus WBPaper00042037 Haemoncus WBPaper00044651 Ascaris Suum
an additional list that can be checked is the following I got from Wen:
//Special Expr_pattern paper. Usually they contain expression patterns of un-specified genes. cgc2796 cgc2714 cgc2559 cgc2475 cgc2449 cgc2274 cgc2005 cgc1542 cgc1984 cgc4994 ges-1 expression in C. briggasae cgc4821 Expression of cpz-1 in O. volvulus cgc4837 Expression of a Drosophila transposon in C.elegans using glh-2 promoter. cgc4895 nud-1 expression in other species. cgc5831 expression of Od-mpp1 promoter corresponded to that produced by the T03F1.5 or the W09C3.6 promoter in C. elegans. cgc5943 2-D protein gel dev. stage assay, too ambiguous to curate. pmid14504223 antibody 1CB4 staining with unknown antigen cgc6097 expression in other species. cgc6393 expression in briggasae. cgc6588 expression in briggasae. cgc6591 expression in briggasae cgc6690 Curated as Gene_regulation. pmid15826643 expression pattern in other species. pmid15862576 expression pattern in other species. pmid15630478 expression pattern in other species. In contrast with FOG-2, a highly conserved GLD-1 ortholog is present in C. briggsae (Table 1) and has a germline expression pattern essentially identical to that of C. elegans (Figure 5A, top right and middle right). WBPaper00026965 expression pattern in other species. WBPaper00028902 expression pattern in other species 00025105 expression pattern in other species 00025000 expression pattern in other species 00028902 expression pattern in other species 00032298 expression of lin-11 in three species. WBPaper00035037 expression pattern of
the following is a list of validated negatives that can be used for SVM training (randomly selected from http://131.215.52.209/daniela/nematode/summaryN_id_nematode)
22922012 negative 22922533 negative 22923372 negative 22924021 negative 22930820 negative 22932059 negative 22933846 negative 22935096 negative 22936386 negative 23315190 negative 22947621 negative 22949749 negative 22949753 negative 23307236 negative 22949756 negative 22949757 negative 22951972 negative 22952671 negative 23306387 negative 22952792 negative 22952922 negative 23300895 negative 2295622 negative 22961235 negative 22961310 negative 22967068 negative 22969260 negative 22973231 negative 22983796 negative 22983799 negative 22983801 negative 22984141 negative 22984446 negative 22984536 negative 22992226 negative 22992297 negative 22992897 negative 23107597 negative 23107821 negative 23110936 negative 23110962 negative 23111012 negative 23111089 negative 23111398 negative 23112818 negative 23291463 negative 23029059 negative 23029330 negative 23029423 negative 23029572 negative 23289015 negative 2310180 negative
parasitic nematodes papers containing expression data
sent to Jane Lomax on September 9 2015
O volvulus http://www.ncbi.nlm.nih.gov/pubmed/11606224 H contortus http://www.ncbi.nlm.nih.gov/pubmed/14698436 http://www.ncbi.nlm.nih.gov/pubmed/15003846 http://www.ncbi.nlm.nih.gov/pubmed/12062493 http://www.ncbi.nlm.nih.gov/pubmed/23360558 http://www.ncbi.nlm.nih.gov/pubmed/23416426 http://www.ncbi.nlm.nih.gov/pubmed/25128369 http://www.ncbi.nlm.nih.gov/pubmed/25388625 A Caninum http://www.ncbi.nlm.nih.gov/pubmed/11755191 A suum http://www.ncbi.nlm.nih.gov/pubmed/12387846 http://www.ncbi.nlm.nih.gov/pubmed/21685128 http://www.ncbi.nlm.nih.gov/pubmed/24374308 S stercoralis http://www.ncbi.nlm.nih.gov/pubmed/14572516 http://www.ncbi.nlm.nih.gov/pubmed/23145190
2A viral technology
Proof of principle described in
- Simultaneous expression of multiple proteins under a single promoter in C. elegans via a versatile 2A-based toolkit
Arnaud Ahier & Sophie Jarriault, Genetics
'We report the use of viral 2A peptides, which trigger a “ribosomal-skip” or “STOP&GO” mechanism during translation, to express multiple proteins from a single vector in C. elegans. Although none of the viruses known to infect C. elegans contain 2A-like sequences, our results show that 2A peptides allow the production of separate functional proteins in all cell types and at all developmental stages tested in the worm. In addition, we constructed a toolkit including a 2A- based polycistronic plasmid and reagents to generate 2A-tagged fosmids. 2A peptides constitute an important tool to ensure the delivery of multiple polypeptides in specific cells enabling several novel applications, such as the reconstitution of multi-subunit complexes.'
Will keep an eye if it will be used more extensively and eventually change the model
SVM analysis for gene expression
051812_042012. The retraining for this batch was done by incorporating the curated results from 2009 till 2012.
old re-train 14/37 = 37.8% 17/26 = 65.3%
From this batch on we have started to manually manipulate the features by adding some and deleting others. The files are stored on Lario/Desktop/SVM
06/08-05/18 2012.
old re-train feature_manipulation 29.30% 44.40% 55.00% 28.90% 55%
September 21 2012
old re-train feature_manipulation section_model 54/98=55% 41/56=73.2% 54/79=68.3% 34/47 = 72.3% (159 out of 459 papers have results)
November 02 2012
old re-train feature_manipulation 9/13=69% 16/22=72.7% 28/46=60.8%
June 14 2013
old re-train section_model 6/10 = 60% 6/6 = 100% 5/5 = 100%
In July 2014 Yuling retrained SVM with the latest results. He got the results from the curation status from:
validated positive http://tazendra.caltech.edu/~postgres/cgi-bin/curation_status.cgi?action=listCurationStatisticsPapersPage&select_curator=two12028&listDatatype=otherexpr&method=allval%20pos%20cur& checkbox_cfp=on&checkbox_afp=on&checkbox_svm=on
validated negative http://tazendra.caltech.edu/~postgres/cgi-bin/curation_status.cgi?action=listCurationStatisticsPapersPage&select_curator=two12028&listDatatype=otherexpr&method=allval%20neg&checkbox_cfp=on&checkbox_afp=on&checkbox_svm=on
also, Yuling is trying to see what was the difference between SVM and feature manipulation on this paper range 44200 till 44700 roughly between october 2013 and january 2014
http://131.215.52.209/celegans/svm_results/20131004/ http://131.215.52.209/celegans/svm_results/20140124/012414_011014_otherexpr
in the list we will take into account the curation negative SVM positive which are listed here: http://tazendra.caltech.edu/~postgres/cgi-bin/referenceform.cgi SELECT * FROM cur_curdata WHERE cur_selcomment ~ '1';
the results of the analisys are as follows (oct2013-jan2014)
testing papers total 176 69 positive 107 negatives true positive false positive precision recall F score "=2*(precision*recall)/(precision+recall)" feature manipulation 47 15 75.80% 47/69 68.10% 0.717439889 current SVM model 32 7 82% 32/69 46.30% 0.591831645 new testing SVM model 42 18 70% 42/69 60.80% 0.650764526
Jan2014-Dec2014 -using feature manipulation models
testing papers total flagged 226 172 positive 54 negatives precision 76.1%
Micropublications
go to Micropublications
User data submission
the submission form is here:
http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/expr_pattern.cgi
and it appends data here
http://tazendra.caltech.edu/~azurebrd/cgi-bin/data/expr.ace
Numbers
- 9260 Expression objects in WS180 ( 7176 + 2084 Chronograms)
- 12110 Expression objects in WS233 (10026 + 2084 Chronograms)
- 13380 Expression objects in WS243 (11296 + 2084 Chronograms)
- 14273 Expression objects in WS243 (12189 + 2084 Chronograms)
Transgenic Alleles from Constructs
The initial population of cns tables for expression constructs
Over 4000 Construct objects that did not have a 'standard nomenclature' name as transgenes -e.g. hkdEx1202 for extrachromosomal arrays- were imported into Construct OA.
In order to atomize curation details into different fields -from the construction summary description- Juancarlos wrote a script
the script is located here:
/home/postgres/work/pgpopulation/cns_construct/20141110_daniela_constructs/update_cns_by_daniela.pl*
More information/files on Lario in the folder Construct/Clone stuff
Alignment of Construct Data for Alliance February 2021
- WB Expression curation associates Expr_patterns to both construct and transgenic alleles. To simplify expression schema changes for Alliance we will give transgene IDs to all constructs used in Expression Pattern. The Alliance schema will therefore have a transgenic allele tag, which will be populated with trangene data coming from WB expression curation.
- As of 02.10.2021: There are 14053 pgids associated with constructs. Files are located on tazendra : /home/postgres/work/pgpopulation/exp_exprpattern/20210210_construct_transgene
- the file is_in contains 6947 pgids: these are constructs in expression OA that have a matching transgene in the Expression OA transgene field. For these objects Juancarlos will delete the construct from Expr OA as is redundant with the transgene.
- the file is_not contains 55 pgids: these are objects for which there’s a construct that has a corresponding transgene in transgene OA but the transgene is not in the transgene field for that expression object. Daniela will go through the list. If the trangene happens to be identical to the construct she will put that transgene in expression OA and she will delete the construct in Expression OA. --Daniela done Feb 11 2021
- script1* /home/postgres/work/pgpopulation/exp_exprpattern/20210210_construct_transgene/query_expression_constructs_transgenes.pl
This script finds Constructs that have a transgene, removes the construct from exp_cns field and adds the corresponding transgene in the Exp_trp field
- the file does _not contains 7051 pgids: for these we will need to create new transgenes. See following paragraph:
Populating exp_transgene based on exp_construct
For back population of trp tables with existing constructs in the expression tables for which there is no associated transgene:
- find all exp_constructs that are not associated with a trp_construct
- create transgene IDs for each exp_construct following the data porting as laid out below
- Postgres creates a transgene pgid, and populates trp tables as follows
- cns_summary copied to trp_summary
- cns_name copied to trp_construct
- cns_paper copied to trp_paper
- cns_curator -default to Daniela
- trp_name copied to exp_transgene
- Postgres creates a transgene pgid, and populates trp tables as follows
NOTE: There were 1849 transgenes and 2259 constructs with no paper info, 447 constructs and 159 transgenes were used in expression. We have now populated the paper field in construct and transgene for this objects via the Expression pattern connection: /home/postgres/work/pgpopulation/exp_exprpattern/20210331_cns_trp_exp_paper cns_no_paper_with_expr 447 constructs -> 405 have paper, spot checked the ones that have no paper and they had a Person association trp_no_paper_with_expr 159 transgenes -> 125 have paper, spot checked the ones that have no paper and they had a Person association The script to populate the paper info from expression objects is here: /home/postgres/work/pgpopulation/exp_exprpattern/20210331_cns_trp_exp_paper/transfer_cns_trp_with_exp_no_paper.pl*
After data clean up will need to suppress data [delete construct] in the exp_construct
All these changes were discussed and agreed upon on Feb 02.10.2021 in a meeting (attendees: Daniela, Juancarlos, Chris, Karen).
Script2 /home/postgres/work/pgpopulation/exp_exprpattern/20210316_construct_to_transgene/copy_construct_to_transgene.pg
Construct/Transgene curation moving forward
- When authors are not using standard nomenclature for a transgene, the Expression curator will:
- go to construct OA and create a new construct
- Take that construct ID and put it in the construct field in expression OA
- A cronjob (concatenates script 2 and script 1 above) will run overnight and will
- look for all constructs listed in Expression OA,
- create a transgene object for such construct
- Populate the trp_fields as above copying data over from the construct object.
- Add the transgeneID just created in the transgene field of Expression OA for which the construct was made
- Delete the construct from the construct field
Cronjob: # 0 4 * * * /home/postgres/work/pgpopulation/exp_exprpattern/cronjobs/transfer_exp_cns_trp/wrapper.pl
new allele request
go to the name server and log in: http://www.sanger.ac.uk/sanger/Worm_NameServer
check if the new variation already exists by clicking on find variation.
If it does: generate an ID in the OA: http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/generic.cgi?action=TempVariationObo putting in the public name and the ID
if it doesn't: on the name server click on 'request a new variation ID' put in the public name and the paper -with additional info
then go to the OA as above and generate an ID
bulk OA uploads
Chris' master table- leave as is https://docs.google.com/spreadsheets/d/13Dt1ZjrEYUe6uolqbsdjbDRexftoXBXWV4knyb5j4f4/edit#gid=0
Copy table in my drive https://docs.google.com/spreadsheets/d/1I4eVTuHR3GW3m5CFVPM8Ll2GQoQP26XZaMldlLg78iQ/edit#gid=0
Expression pattern remodel
OA and GO CC curation comparison
Overview
Daniela and Kimberly are comparing CC annotations between the OA and GO pipelines. The document for tracking is here: https://docs.google.com/spreadsheets/d/1QnRd0TE7zZGC4ThQvLX1d-9sQH2luqrYLb_A26o8WCE/edit#gid=579706084
Scenarios
- no OA but GO_CC
- import in OA exp with IDA as evidence code
- Markers for GO: keep just the first reference and get rid of the ones that are confirmatory.
- no GO but OA annotation
- create a gpad file ask Tony Sawford to upload in protein to go database
- the gpad should contain annotation extensions to tissues
- Secreted proteins:
- clean up the secreted proteins annotations
Importing GO_CC annotations in the Paper term info
We can parse a .ace file that Kimberly generates for every build located here (Tazendra)
/home/acedb/kimberly/citace_upload/go/gpad2ace/gpad_parsing
and called gp_annotation.ace
We want to consider only objects that have Annotation_relation "part_of" or "colocalizes_with" and Reference "WBPaper000nnnn"
Once I will enter the WBPaperID in OA, I would like to see in the term info:
- 1) Gene: Display locus
- 2) 'part_of' or 'colocalizes_with'
- 3) GO Term: Display ‘name’ For example for id : GO:0005634 display nucleus
- 4) GO_code: e.g.’IDA'
- 5) Extension: get values from GO_term_relation and Display them. Example:
- part_of(Anatomy name)
For example, if you have: part_of(WBbt:0004821)|part_of(WBbt:0006786)|part_of(WBbt:0006787) Display: part_of(DVC)|part_of(ut2)|part_of(ut3)
- exists_during(anaphase)
- 6) get values from Life_stage_relation and display them
- Life_stage_relation
Life_stage_relation "exists_during" "embryo"
- 7) get the values from Anatomy_relation and display them
- 8) contributed_by: whenever is NOT WormBase import the value
For point #8 we could store the info in Curated_by in the expression pattern model. If so create a field in Expression OA-tab 4- Curated_by, free small-text. (for now we are not dumping it, we want to check how often is used, will dump in the future if need be)
Name of the table: tin_paper_legocc
the links to protein to GO are: ftp://ftp.ebi.ac.uk/pub/contrib/goa/ File:gp_association.6239_wormbase.gz
the gp_association.6239_wormbase.gz File is the source file for Juancarlos' script here: /home/acedb/kimberly/citace_upload/go/gpad2ace/2017_January
the gp_association.6239_wormbase.gz gets updated weekly, every Monday morning UK time, so we can up a cronjob for Tuesday morning 8:00 am California time.
Example:
I would want to have *Gene1, part of GO_Termx, GO_codex, Extensionsx GO_Termy, GO_codey, Extensionsy *Gene2, part of GO_Termz, GO_codez, Extensionsz
All relevant files are on tazendra here: /home/acedb/draciti/lego_cc_annotations
we are using the gp2protein file in that folder, Kimberly has plans to chenge the format into gpi (will modify the source accordingly when ready).
We have a cronjob running every tuesday morning (8:00 am Pasadena time) that will get data from the latest version of the gp_association.6239_wormbase.gz and will populate the tables on postgres.
0 8 * * tue /home/acedb/draciti/lego_cc_annotations/wrapper.sh
When Kimberly changes the gp2protein file into a gpi file we need to change the update the script
Comparison CGI
Go to Comparison CGI
AGR data transfer
Annotations with Uncertain tag have not been included in the initial upload.
Mappings to LinkML
/////////////////////////////////// // // ?Expr_pattern class // /////////////////////////////////// ?Expr_pattern Expression_of Gene ?Gene OUTXREF Expr_pattern #Evidence Gene -> BiologicalEntity in LinkML Reflects_endogenous_expression_of ?Gene -> endogenous tag in expression_qualifier_set in LinkML // CDS ?CDS OUTXREF Expr_pattern // for coding genes -> double check with Paul D but probably good to delete Sequence ?Sequence OUTXREF Expr_pattern // for clones??? -> moved to remarks (12 objects) good to ignore // Pseudogene ?Pseudogene OUTXREF Expr_pattern // [030801 krb] -> asked to remove from schema Feb 2022 Clone ?Clone OUTXREF Expr_pattern -> reagents in LinkML Protein ?Protein OUTXREF Expr_pattern -> asked to remove from schema Feb 2022 Protein_description Text // information for Expr_patterns with unknown antigens [031105 krb] -> asked to remove from schema Feb 2022 // Homol Homol_homol ?Homol_data XREF Expr_homol ?Method Float Int UNIQUE Int Int UNIQUE Int #Homol_info //Expr_pattern mapping [060427 ar2] -> for chronogram, ignore if chronograms are going to be imported as pictures Expression_data Life_stage ?Life_stage OUTXREF Expr_pattern #Qualifier -> developmental_stage in LinkML Anatomy_term ?Anatomy_term OUTXREF Expr_pattern #Qualifier -> anatomical_structure in LinkML GO_term ?GO_term OUTXREF Expr_pattern #GR_condition -> cellular_component in LinkML Not_in_Life_stage ?Life_stage #Qualifier -> developmental_stage negated in LinkML Not_in_Anatomy_term ?Anatomy_term #Qualifier -> anatomical_structure negated in LinkML Not_in_GO_term ?GO_term #GR_condition -> cellular_component negated in LinkML Subcellular_localization ?Text -> ExpressionExperimentStatement in LinkML Type Antibody ?Text FILL_DEFAULT -> MMO immunohistochemistry -> assay_used in LinkML *Cis_regulatory_element* Text FILL_DEFAULT -> this tag was used in WB CV for tagging sequence feature annotations. As such, it is not a method tag, can probably drop it. EPIC ?Text FILL_DEFAULT -> used for Murray study, good to ignore Genome_editing ?Text FILL_DEFAULT -> MMO knock-in in situ reporter assay -> assay_used in LinkML In_situ Text FILL_DEFAULT -> MMO RNA in situ -> assay_used in LinkML *Localizome* ?Text FILL_DEFAULT Localizome -> for chronograms good to ignore. We can bring chronograms as image objects. They don’t hold any anatomy/ls annotations per se. Example chronograph page on WB: https://wormbase.org/species/all/expr_pattern/Chronogram1954#0213--10 Microarray ?Microarray_experiment -> holds images from Yanai -> can add a tag in Picture class. Link to high throughput data, too? Talk to Wen Northern Text FILL_DEFAULT -> MMO northern -> assay_used in LinkML Reporter_gene ?Text FILL_DEFAULT -> MMO in situ reporter -> assay_used in LinkML RNASeq ?Analysis -> holds images from Yanai -> can add a tag in Picture class. Link to high throughput data, too? Talk to Wen RT_PCR Text FILL_DEFAULT -> MMO RT PCR -> assay_used in LinkML Tiling_array ?Analysis -> can add a tag in Picture class. Link to high throughput data, too? Talk to Wen Western Text FILL_DEFAULT -> MMO western -> assay_used in LinkML Expression_cluster ?Expression_cluster INXREF Expr_pattern //added for localizome Microarray_results ?Microarray_results INXREF Expr_Pattern -> in high throughput dataset Pattern ?Text //Multi_value ordering issue in Datomic possibly -> ExpressionExperimentStatement Picture ?Picture INXREF Expr_pattern -> image in LinkML MovieURL Text //Added by wen for link to movie URLs. -> asked to remove from schema Feb 2022 Movie ?Movie INXREF Expr_pattern //Added by Wen to curate Expr_pattern video -> movie in LinkML Species UNIQUE ?Species -> Can be inferred from biological entity. Not needed in LinkML Remark ?Text #Evidence -> ExpressionAnnotationStatement in LinkML DB_info ?Database ^database ?Database_field ^field Text ^accession Example: Expr1040545 - Miller study -> Link out can be handled via the Links to third party expression resources Experiment Laboratory ?Laboratory -> asked to remove from schema Feb 2022 Strain UNIQUE ?Strain -> specimen_genomic_model (AGM) in LinkML Person UNIQUE ?Person -> talk to Kimberly to mint WBPaperIDs for personal communications Reference ?Paper OUTXREF Expr_pattern -> reference in LinkML Transgene ?Transgene OUTXREF Expr_pattern -> specimen_alleles in LinkML Variation ?Variation INXREF Expr_pattern -> specimen_alleles in LinkML Construct ?Construct OUTXREF Expression_pattern -> all constructs have been converted to transgenes -> specimen_alleles in LinkML Associated_feature ?Feature OUTXREF Associated_with_expression_pattern #Evidence -> BiologicalEntity in LinkML?? Antibody_info ?Antibody OUTXREF Expr_pattern // This applies to both Western & Antibody staining -> reagents in LinkML // added [031120 krb] Curated_by UNIQUE Text // Hinxton (HX) or Caltech (CIT) Sylvia [010927 krb] -> used by chronograms curated by caltech (good to ignore) Historical_gene ?Gene Text -> double check with Chris how these type of info will be dealt with generally at Alliance //Qualifer hash will be used for Expr_pattern curation to specify the reliability of data. #Qualifier Certain Uncertain //For faint or variable expression Partial //For expression of unidentified cell in a cell group Anatomy_term ?Anatomy_term //combines life stage with anatomy term in expr pattern annotation Life_stage ?Life_stage //combines life stage with anatomy term in expr pattern annotation Remark Text //New tag to take the optional text from the Certain/Uncertain/Partial nested tags above.
detailed
- Gene ?Gene OUTXREF Expr_pattern #Evidence Gene -> BiologicalEntity in LinkML
- Reflects_endogenous_expression_of ?Gene -> endogenous tag in expression_qualifier_set in LinkML
- Clone ?Clone OUTXREF Expr_pattern -> reagents in LinkML
- Life_stage ?Life_stage OUTXREF Expr_pattern #Qualifier -> developmental_stage in LinkML
- Anatomy_term ?Anatomy_term OUTXREF Expr_pattern #Qualifier -> anatomical_structure in LinkML
- GO_term ?GO_term OUTXREF Expr_pattern #GR_condition -> cellular_component in LinkML
- Not_in_Life_stage ?Life_stage #Qualifier -> developmental_stage negated in LinkML
- Not_in_Anatomy_term ?Anatomy_term #Qualifier -> anatomical_structure negated in LinkML
- Not_in_GO_term ?GO_term #GR_condition -> cellular_component negated in LinkML
- Subcellular_localization ?Text -> ExpressionExperimentStatement in LinkML
- Type
- Antibody ?Text FILL_DEFAULT -> MMO immunohistochemistry -> assay_used in LinkML
- Genome_editing ?Text FILL_DEFAULT -> MMO knock-in in situ reporter assay -> assay_used in LinkML
- In_situ Text FILL_DEFAULT -> MMO RNA in situ -> assay_used in LinkML
- Northern Text FILL_DEFAULT -> MMO northern -> assay_used in LinkML
- Reporter_gene ?Text FILL_DEFAULT -> MMO in situ reporter -> assay_used in LinkML
- RT_PCR Text FILL_DEFAULT -> MMO RT PCR -> assay_used in LinkML
- Western Text FILL_DEFAULT -> MMO western -> assay_used in LinkML
- Pattern ?Text //Multi_value ordering issue in Datomic possibly -> ExpressionExperimentStatement
- Picture ?Picture INXREF Expr_pattern -> image in LinkML
- Movie ?Movie INXREF Expr_pattern //Added by Wen to curate Expr_pattern video -> movie in LinkML
- Species UNIQUE ?Species -> Can be inferred from biological entity. Not needed in LinkML
- Remark ?Text #Evidence -> ExpressionAnnotationStatement in LinkML
- DB_info ?Database ^database ?Database_field ^field Text ^accession Example: Expr1040545 - Miller study -> Link out can be handled via the Links to third party expression resources
- Strain UNIQUE ?Strain -> specimen_genomic_model (AGM) in LinkML
- Reference ?Paper OUTXREF Expr_pattern -> reference in LinkML
- Transgene ?Transgene OUTXREF Expr_pattern -> specimen_alleles in LinkML
- Variation ?Variation INXREF Expr_pattern -> specimen_alleles in LinkML
- Construct ?Construct OUTXREF Expression_pattern -> all constructs have been converted to transgenes -> specimen_alleles in LinkML
- Associated_feature ?Feature OUTXREF Associated_with_expression_pattern #Evidence -> BiologicalEntity in LinkML??
- Antibody_info ?Antibody OUTXREF Expr_pattern // This applies to both Western & Antibody staining -> reagents in LinkML
- Historical_gene ?Gene Text -> double check with Chris how these type of info will be dealt with generally at Alliance
//Qualifer hash will be used for Expr_pattern curation to specify the reliability of data.
- Qualifier Certain
Uncertain //For faint or variable expression Partial //For expression of unidentified cell in a cell group Anatomy_term ?Anatomy_term //combines life stage with anatomy term in expr pattern annotation Life_stage ?Life_stage //combines life stage with anatomy term in expr pattern annotation Remark Text //New tag to take the optional text from the Certain/Uncertain/Partial nested tags above.
To ignore or already removed from schema:
- CDS ?CDS OUTXREF Expr_pattern // for coding genes -> double check with Paul D but probably good to delete
- Sequence ?Sequence OUTXREF Expr_pattern // for clones??? -> moved to remarks (12 objects) good to ignore
- Protein ?Protein OUTXREF Expr_pattern -> asked to remove from schema Feb 2022
- Protein_description Text -> asked to remove from schema Feb 2022
- Pseudogene ?Pseudogene OUTXREF Expr_pattern // [030801 krb] -> asked to remove from schema Feb 2022
- MovieURL Text //Added by wen for link to movie URLs. -> asked to remove from schema Feb 2022
- Laboratory ?Laboratory -> asked to remove from schema Feb 2022
- Curated_by UNIQUE Text // Hinxton (HX) or Caltech (CIT) Sylvia [010927 krb] -> used by chronograms curated by caltech (good to ignore)
- Type
- Cis_regulatory_element Text FILL_DEFAULT -> this tag was used in WB CV for tagging sequence feature annotations. As such, it is not a method tag, can probably drop it.
- EPIC ?Text FILL_DEFAULT -> used for Murray study, good to ignore
Chronograms:
- Homol Homol_homol ?Homol_data XREF Expr_homol ?Method Float Int UNIQUE Int Int UNIQUE Int #Homol_info //Expr_pattern mapping [060427 ar2] -> for chronogram, ignore if chronograms are going to be imported as pictures
- Type
- Localizome ?Text FILL_DEFAULT Localizome -> for chronograms good to ignore. We can bring chronograms as image objects. They don’t hold any anatomy/ls annotations per se. Example chronograph page on WB: https://wormbase.org/species/all/expr_pattern/Chronogram1954#0213--10
Talked to Wen- no need to import right now
- Expression_cluster ?Expression_cluster INXREF Expr_pattern //added for localizome -> only chronograms, no need to import
- Microarray_results ?Microarray_results INXREF Expr_Pattern -> in high throughput dataset
Coming in with tags in images:
- Type
- Microarray ?Microarray_experiment -> holds images from Yanai -> can add a tag in Picture class. Link to high throughput data, too? Talk to Wen again once we will work on the image class
- RNASeq ?Analysis -> holds images from Yanai -> can add a tag in Picture class. Link to high throughput data, too? Talk to Wen again once we will work on the image class
- Tiling_array ?Analysis -> can add a tag in Picture class. Link to high throughput data, too? Talk to Wen again once we will work on the image class
Talk to Kimberly
- Person UNIQUE ?Person -> talk to Kimberly to mint WBPaperIDs for personal communications
UBERON to WBbt anatomy mappings for Alliance
The csv file with the mappings is in this excel spreadsheet https://docs.google.com/spreadsheets/d/1cC6jnDP7x2mQJmFQ6QRlZxv-nS_9vSROCJAm2W-SCwY/edit#gid=770823761
Single Cell RNAseq graphs
For Articles that contain single cell RNA seq analysis, we can offer authors:
- Display of interactive graph TPM / % cell expressing per gene
- Inclusion of enriched gene sets per cell (to the extent that we can translate their cell group to anatomy term).
- Link out to their analysis tools page
The CeNGEN example can be seen here: https://wormbase.org/species/c_elegans/gene/WBGene00001170#-1-10
Data are also stored here: http://caltech.wormbase.org/pub/wormbase/datasets-published/packer2019/