Difference between revisions of "WormBase-Caltech Weekly Calls"

From WormBaseWiki
Jump to navigationJump to search
 
(94 intermediate revisions by 5 users not shown)
Line 41: Line 41:
 
[[WormBase-Caltech_Weekly_Calls_August_2019|August]]
 
[[WormBase-Caltech_Weekly_Calls_August_2019|August]]
  
 +
[[WormBase-Caltech_Weekly_Calls_September_2019|September]]
  
== September 12, 2019 ==
+
[[WormBase-Caltech_Weekly_Calls_October_2019|October]]
  
=== Update on SVM pipeline ===
 
* New SVM pipeline: more analysis and more parameter tuning
 
* avoiding precision (and F-value) as a measure (dependent on ratio of positives and negatives in test set)
 
* For example shown, "dumb" machine starts out with precision above 0.6
 
* G-value (Michael's invention); does not depend on distribution of sets
 
* Applied to various data types
 
* Analysis: 10-fold cross validation
 
** Randomly select 10% pos and neg (without replacement) and repeat until all papers sampled
 
* F-value changes over different p/n values; G-value does not (essentially flat)
 
* Area Under the Curve (AUC): probability that a random positive scores higher than random negative
 
* AUC values for many WB data types upper 80%'s into 90%'s
 
* Ranjana: How many papers for a good training set? Michael: we don't know yet
 
* Can't reproduce old training sets (for old SVM); provide Michael better training sets if you want improved SVM
 
* If SVM still not good enough, Michael will work on deep neural networks (Tensor Flow)
 
* Michael can provide training sets he has used recently
 
  
=== Clarifying definitions of "defective" and "deficient" for phenotypes ===
+
== November 7, 2019 ==
* WB phenotype ontology has many "variant/abnormal" terms and distinct subclass terms for "defective/deficient"
 
* Have tried to create a logical definition pattern for these terms, but the vagueness of the meaning of "defective" and how it is distinct from "abnormal" has stalled the process
 
* What do we mean exactly by "defective" and how, specifically, is this distinct from "abnormal"?
 
* Definitions include meanings or words:
 
** "Variations in the ability"
 
** "aberrant"
 
** "defect"
 
** "defective"
 
** "defects"
 
** "deficiency"
 
** "deficient"
 
** "disrupted"
 
** "impaired"
 
** "incompetent"
 
** "ineffective"
 
** "perturbation that disrupts"
 
** Failure to execute the characteristic response = abnormal?
 
** abnormal
 
** abnormality leading to specific outcomes
 
** fail to exhibit the same taxis behavior = abnormal?
 
** failure
 
** failure OR delayed
 
** failure, slower OR late
 
** failure/abnormal
 
** reduced
 
** slower
 
  
=== Citace upload ===
+
=== WS275 Citace upload ===
** Tuesday, Sep 24th
+
* Maybe Nov 22 upload to Hinxton
 +
* CIT curators upload to Spica on Tues Nov 19
  
=== Strain to ID mapping ===
+
=== ?Genotype class ===
* Waiting on Hinxton to send strain ID mapping file?
+
* [https://docs.google.com/document/d/19hP9r6BpPW3FSAeC_67FNyNq58NGp4eaXBT42Ch3gDE/edit?usp=sharing Working data model document]
* Hopefully we can all get that well before the upload deadline
+
* Several classes have a "Genotype" tag with text entry
* Will do global replacement at time of citace upload (at least for now)
+
** Strain
 +
** 2_point_data
 +
** Pos_neg_data
 +
** Multi_pt_data
 +
** RNAi
 +
** Phenotype_info
 +
** Mass_spec_experiment (no data as of WS273)
 +
** Condition
 +
* Collecting all genotype text entries yields ~33,000 unique entries, with many different forms:
 +
** Species entries, like "Acrobeloides butschlii wild isolate" or "C. briggsae"
 +
** Strain entries, like "BA17[fem-1(hc-17)]" or "BB21" or "BL1[pK08F4.7::K08F4.7::GFP; rol-6(+)]"
 +
** Anonymous transgenes, like "BEC-1::GFP" or "CAM-1-GFP" or "Ex[Pnpr-9::unc-103(gf)]"
 +
** Complex constructs, like "C56C10.9(gk5253[loxP + Pmyo-2::GFP::unc-54 3' UTR + Prps-27::neoR::unc-54 3' UTR + loxP]) II"
 +
** Text descriptions, like "Control" or "WT" or "Control worms fed on HT115 containing the L4440 vector without insert" or "N.A."
 +
** Bacterial genotypes, like "E. coli [argA, lysA, mcrA, mcrB, IN(rrnD-rrnE)1, lambda-, rcn14::Tn10(DE3 lysogen::lavUV5 promoter -T7 polymerase]"
 +
** Including balancers, like "F26H9.8(ok2510) I/hT2 [bli-4(e937) let-?(q782) qIs48] (I;III)"
 +
** Reference to parent strain, like "Parent strain is AG359"
 +
** Referring to RNAi, like "Pglr-1::wrm-1(RNAi)" or "Phsp-6::gfp; phb-1(RNAi)"
 +
** Referring to apparent null or loss of function alleles, like "Phsp-4::GFP(zcIs4); daf-2(-)" or "ced-10(lf)"
  
=== New name server ===
+
=== Gene comparison SObA ===
* When will this officially go live?
+
* http://wobr2.caltech.edu/~azurebrd/cgi-bin/soba_multi.cgi?action=Gene+Pair+to+SObA+Graph
* Will we now be able to request strain IDs through the server? Yes
 
  
=== SObA Graphs ===
 
* New graphs now live on site (Expression, Gene Ontology, Human Diseases, Phenotypes)
 
* A lot of whitespace padding above and below graph; maybe trim? trimming vertically would ultimately limit the view pane when user wants to zoom in, so we should leave as is for now
 
* Diff tool: Raymond and Juancarlos created a prototype diff tool (for comparing two genes, for example)
 
** Paul: compared two genes that should be very similar, but there are a lot of differences; may reflect annotation coverage rather than biology
 
  
 +
== November 14, 2019 ==
  
== September 19, 2019 ==
+
=== TAGC meeting ===
 +
* The Allied Genetics Conference next April (2020) in/near Washington DC
 +
* Abstract deadline is Dec 5th
 +
* Alliance has a shared booth (3 adjacent booths)
 +
* Micropublications will have a booth (Karen and Daniela will attend)
 +
* Focus will be on highlighting the Alliance
 +
* Workshop at NLM in days following TAGC about curation at scale (Kimberly attending and chairing session)
  
=== Strains ===
+
=== Alliance all hands meeting ===
* Need to wait for new strain IDs from Hinxton before running dumping scripts
+
* Lightning talk topics?
* Don't edit multi-ontology strain fields in OA for now!
+
** Single cell RNA Seq (Eduardo)
* Juancarlos will map free text strain entries to strain IDs once we have the complete mapping file
+
** SimpleMine? (Wen)
* "Requested strain" field in Disease OA; not dumped, so don't need to worry about right now
+
** SObA? (Raymond); still working on multi-species SObA
 +
** Phenotype community curation?
 +
** Micropublications?
 +
** AFP?
  
=== Alliance literature curation ===
+
=== Alliance general ===
* Working group will be formed soon
+
* Alliance needs a curation database
* Will work out general common pipelines for literature curation
+
** A curation working group was proposed
 +
** What needs to happen to get this going?
 +
** Would include text mining tools/resources
 +
** Would be good to have something like the curation status form
 +
** MODs likely have their own special requirements, but there should probably be at least a common minimal set of features
 +
** Variant sequence curation could be a good first start (if all MODs handle their own variant sequence curation) as a common data type
 +
* Micropubs pushing data submission forms; might as well house them within the Alliance
 +
* Would be good to have a common (or individually relevant) AFP form(s) for all Alliance members
 +
** Maybe MOD curators can manage configuration files to indicate what is relevant for their species
 +
** First priority is to focus on automatically recognizable entities/features from papers
 +
 
 +
 
 +
== November 21, 2019 ==
 +
 
 +
=== Textpresso: merging main docs and supps? ===
 +
* Currently, Textpresso searches in paper main documents and all individual supplemental documents separately
 +
* This results in possibly getting many results for the same publication, each scored and displayed separately
 +
* Do we want Texptpresso to search on a single, consolidated file containing the main document of a paper AND the supplementals?
 +
* Currently, the scoring algorithm is often scoring supplemental documents higher than main papers, presumably due to a weighting of documents in which there is a higher percentage of sentences with matches to the keyword(s)
 +
* This cannot be done completely manually; agreed, this would have to be largely (completely?) automated
 +
* Would be good to check how PMC/Europe PMC handles articles in which main docs and supps are consolidated into a single PDF already (in addition to individual files)
 +
* Detecting duplicated sentences would be useful, but may be quite a thorny issue (need to research)
 +
* Chris will update GitHub ticket to ask Sibyl to NOT search on C. elegans supplementals, for now, and only search on main documents
 +
 
 +
=== Europe PMC: biocuration landscape analysis ===
 +
* Dayane Araújo has asked that a curator (Chris currently) attend a conference call (next Monday, Nov 25) hosted by Europe PMC about assessing biocuration across databases
 +
* Chris has asked for details but has so far not received anything specific
 +
* Should we attend? Yes, at least to listen. If complex questions come up, we can just tell them we'll look it up
 +
* Would be great if there were aggregated references for particular datasets so that users of data and analyses could be given all references to properly cite in their own article

Latest revision as of 17:29, 21 November 2019

Previous Years

2009 Meetings

2011 Meetings

2012 Meetings

2013 Meetings

2014 Meetings

2015 Meetings

2016 Meetings

2017 Meetings

2018 Meetings


GoToMeeting link: https://www.gotomeet.me/wormbase1


2019 Meetings

January

February

March

April

May

June

July

August

September

October


November 7, 2019

WS275 Citace upload

  • Maybe Nov 22 upload to Hinxton
  • CIT curators upload to Spica on Tues Nov 19

?Genotype class

  • Working data model document
  • Several classes have a "Genotype" tag with text entry
    • Strain
    • 2_point_data
    • Pos_neg_data
    • Multi_pt_data
    • RNAi
    • Phenotype_info
    • Mass_spec_experiment (no data as of WS273)
    • Condition
  • Collecting all genotype text entries yields ~33,000 unique entries, with many different forms:
    • Species entries, like "Acrobeloides butschlii wild isolate" or "C. briggsae"
    • Strain entries, like "BA17[fem-1(hc-17)]" or "BB21" or "BL1[pK08F4.7::K08F4.7::GFP; rol-6(+)]"
    • Anonymous transgenes, like "BEC-1::GFP" or "CAM-1-GFP" or "Ex[Pnpr-9::unc-103(gf)]"
    • Complex constructs, like "C56C10.9(gk5253[loxP + Pmyo-2::GFP::unc-54 3' UTR + Prps-27::neoR::unc-54 3' UTR + loxP]) II"
    • Text descriptions, like "Control" or "WT" or "Control worms fed on HT115 containing the L4440 vector without insert" or "N.A."
    • Bacterial genotypes, like "E. coli [argA, lysA, mcrA, mcrB, IN(rrnD-rrnE)1, lambda-, rcn14::Tn10(DE3 lysogen::lavUV5 promoter -T7 polymerase]"
    • Including balancers, like "F26H9.8(ok2510) I/hT2 [bli-4(e937) let-?(q782) qIs48] (I;III)"
    • Reference to parent strain, like "Parent strain is AG359"
    • Referring to RNAi, like "Pglr-1::wrm-1(RNAi)" or "Phsp-6::gfp; phb-1(RNAi)"
    • Referring to apparent null or loss of function alleles, like "Phsp-4::GFP(zcIs4); daf-2(-)" or "ced-10(lf)"

Gene comparison SObA


November 14, 2019

TAGC meeting

  • The Allied Genetics Conference next April (2020) in/near Washington DC
  • Abstract deadline is Dec 5th
  • Alliance has a shared booth (3 adjacent booths)
  • Micropublications will have a booth (Karen and Daniela will attend)
  • Focus will be on highlighting the Alliance
  • Workshop at NLM in days following TAGC about curation at scale (Kimberly attending and chairing session)

Alliance all hands meeting

  • Lightning talk topics?
    • Single cell RNA Seq (Eduardo)
    • SimpleMine? (Wen)
    • SObA? (Raymond); still working on multi-species SObA
    • Phenotype community curation?
    • Micropublications?
    • AFP?

Alliance general

  • Alliance needs a curation database
    • A curation working group was proposed
    • What needs to happen to get this going?
    • Would include text mining tools/resources
    • Would be good to have something like the curation status form
    • MODs likely have their own special requirements, but there should probably be at least a common minimal set of features
    • Variant sequence curation could be a good first start (if all MODs handle their own variant sequence curation) as a common data type
  • Micropubs pushing data submission forms; might as well house them within the Alliance
  • Would be good to have a common (or individually relevant) AFP form(s) for all Alliance members
    • Maybe MOD curators can manage configuration files to indicate what is relevant for their species
    • First priority is to focus on automatically recognizable entities/features from papers


November 21, 2019

Textpresso: merging main docs and supps?

  • Currently, Textpresso searches in paper main documents and all individual supplemental documents separately
  • This results in possibly getting many results for the same publication, each scored and displayed separately
  • Do we want Texptpresso to search on a single, consolidated file containing the main document of a paper AND the supplementals?
  • Currently, the scoring algorithm is often scoring supplemental documents higher than main papers, presumably due to a weighting of documents in which there is a higher percentage of sentences with matches to the keyword(s)
  • This cannot be done completely manually; agreed, this would have to be largely (completely?) automated
  • Would be good to check how PMC/Europe PMC handles articles in which main docs and supps are consolidated into a single PDF already (in addition to individual files)
  • Detecting duplicated sentences would be useful, but may be quite a thorny issue (need to research)
  • Chris will update GitHub ticket to ask Sibyl to NOT search on C. elegans supplementals, for now, and only search on main documents

Europe PMC: biocuration landscape analysis

  • Dayane Araújo has asked that a curator (Chris currently) attend a conference call (next Monday, Nov 25) hosted by Europe PMC about assessing biocuration across databases
  • Chris has asked for details but has so far not received anything specific
  • Should we attend? Yes, at least to listen. If complex questions come up, we can just tell them we'll look it up
  • Would be great if there were aggregated references for particular datasets so that users of data and analyses could be given all references to properly cite in their own article