Difference between revisions of "WormBase-Caltech Weekly Calls"

From WormBaseWiki
Jump to navigationJump to search
m
Line 1: Line 1:
 +
= Previous Years =
 +
 
[[WormBase-Caltech_Weekly_Calls_2009|2009 Meetings]]
 
[[WormBase-Caltech_Weekly_Calls_2009|2009 Meetings]]
  
 +
[[WormBase-Caltech_Weekly_Calls_2011|2011 Meetings]]
 +
 +
[[WormBase-Caltech_Weekly_Calls_2012|2012 Meetings]]
 +
 +
[[WormBase-Caltech_Weekly_Calls_2013|2013 Meetings]]
 +
 +
[[WormBase-Caltech_Weekly_Calls_2014|2014 Meetings]]
 +
 +
[[WormBase-Caltech_Weekly_Calls_2015|2015 Meetings]]
 +
 +
[[WormBase-Caltech_Weekly_Calls_2016|2016 Meetings]]
 +
 +
[[WormBase-Caltech_Weekly_Calls_2017|2017 Meetings]]
 +
 +
[[WormBase-Caltech_Weekly_Calls_2018|2018 Meetings]]
 +
 +
 +
GoToMeeting link: https://www.gotomeet.me/wormbase1
 +
 +
 +
= 2019 Meetings =
 +
 +
[[WormBase-Caltech_Weekly_Calls_January_2019|January]]
 +
 +
[[WormBase-Caltech_Weekly_Calls_February_2019|February]]
  
==2011 Meetings==
+
[[WormBase-Caltech_Weekly_Calls_March_2019|March]]
  
[[WormBase-Caltech_Weekly_Calls_February_2011|February]]
+
[[WormBase-Caltech_Weekly_Calls_April_2019|April]]
  
 +
[[WormBase-Caltech_Weekly_Calls_May_2019|May]]
  
== March 3, 2011 ==
+
[[WormBase-Caltech_Weekly_Calls_June_2019|June]]
  
SUMMARY:
+
[[WormBase-Caltech_Weekly_Calls_July_2019|July]]
  
Every other month patch:
+
[[WormBase-Caltech_Weekly_Calls_August_2019|August]]
*We will try to generate an ACEDB patch every other month, starting with the Paper class data
 
*Need to coordinate with Juancarlos and Todd
 
  
Need to check .ACE patch files before upload:
 
*Will need to be vigilant about checking for errors and inconsistencies before sending to Todd to put up on website
 
*Curators need to check their own data for problems
 
*Wen will check for consistency between the different data types
 
  
Think about connecting website to Postgres:
+
== September 12, 2019 ==
*Think about showing Postgres data "immediately" on site
 
  
 +
=== Update on SVM pipeline ===
 +
* New SVM pipeline: more analysis and more parameter tuning
 +
* avoiding precision (and F-value) as a measure (dependent on ratio of positives and negatives in test set)
 +
* For example shown, "dumb" machine starts out with precision above 0.6
 +
* G-value (Michael's invention); does not depend on distribution of sets
 +
* Applied to various data types
 +
* Analysis: 10-fold cross validation
 +
** Randomly select 10% pos and neg (without replacement) and repeat until all papers sampled
 +
* F-value changes over different p/n values; G-value does not (essentially flat)
 +
* Area Under the Curve (AUC): probability that a random positive scores higher than random negative
 +
* AUC values for many WB data types upper 80%'s into 90%'s
 +
* Ranjana: How many papers for a good training set? Michael: we don't know yet
 +
* Can't reproduce old training sets (for old SVM); provide Michael better training sets if you want improved SVM
 +
* If SVM still not good enough, Michael will work on deep neural networks (Tensor Flow)
 +
* Michael can provide training sets he has used recently
  
DETAILS:
+
=== Clarifying definitions of "defective" and "deficient" for phenotypes ===
 +
* WB phenotype ontology has many "variant/abnormal" terms and distinct subclass terms for "defective/deficient"
 +
* Have tried to create a logical definition pattern for these terms, but the vagueness of the meaning of "defective" and how it is distinct from "abnormal" has stalled the process
 +
* What do we mean exactly by "defective" and how, specifically, is this distinct from "abnormal"?
 +
* Definitions include meanings or words:
 +
** "Variations in the ability"
 +
** "aberrant"
 +
** "defect"
 +
** "defective"
 +
** "defects"
 +
** "deficiency"
 +
** "deficient"
 +
** "disrupted"
 +
** "impaired"
 +
** "incompetent"
 +
** "ineffective"
 +
** "perturbation that disrupts"
 +
** Failure to execute the characteristic response = abnormal?
 +
** abnormal
 +
** abnormality leading to specific outcomes
 +
** fail to exhibit the same taxis behavior = abnormal?
 +
** failure
 +
** failure OR delayed
 +
** failure, slower OR late
 +
** failure/abnormal
 +
** reduced
 +
** slower
  
Delayed release cycle:
+
=== Citace upload ===
*Will require more work to prepare for more frequent release of certain data types
+
** Tuesday, Sep 24th
*Aside from Kimberly's data, most data types are not urgent (e.g. Expression pattern)
 
*What are the users feeling?
 
**Having data faster will help users; they don't ask, because they don't see it
 
*On-the-fly updating of website? Like Postgres?
 
*Since we use ACEDB, we have to patch WS with .ACE file, or rebuild whole thing
 
*Flat file Postgres database, replaced every night?
 
*Website calls Postgres directly for certain data types?
 
*Performing build without sequence is easy? Do everything without sequence?
 
*How to integrate sequence data with other data once they're decoupled through the patching process?
 
*We need .ACE patch files
 
*Concise description separate from most else (but connected to papers)
 
*Do papers first?
 
*Website can show anything
 
*If we have a lot of patches, will not have check for data inconsistency/confliction
 
*Trial patch .ace files for papers first
 
*Juancarlos: Scripts that check differences between data dumps; scripts are data type specific
 
**Curators need to talk to Juancarlos about the importance of different data tags
 
*Paper .ACE file: Would include bibliographic info, journals, authors, genes associated from abstract or added manually
 
*One reason for more frequent releases: because we have first pass author forms; show them we add it quickly
 
**what will be added through the forms: expression patterns? RNAi (difficult?)?
 
*We should check patch before we send to Todd!!! Don't want to crash database
 
*How frequently to patch? Weekly? Daily? Check with Todd, how often he can load them?
 
*Chron job to create patch ACE files, send to curators to check for problems, then send to Todd
 
*Interdependency of data types; curators rely on other curators?
 
*Postgres directly to website? Todd would have to work it out
 
*New information flag on website? Toggle visibility?
 
*How do we know that the data do not conflict with each other?
 
*What are common problems? Dumper script goes bad, makes broken lines, empty fields
 
*Error catching mechanisms? More checks on postgres? Dump files?
 
*Data merging problems? What are the cases that are conflicts? Prevent them? Know beforehand?
 
*If we don't know, as long as it doesn't crash the database or fail to load, then OK
 
*Don't do -D stuff, maybe? No deletions? Skip typos?
 
*Always have to check ACE files anyway, but have to do every week (2 weeks?)
 
*We can try a patch every other month
 
*What can we do without the patch?
 
*Did SAB talk about changing to relational databases?
 
**Get website going as is first, and see if it matters?
 
**If people don't want to change data models, we can switch over to relational
 
**Separate panel on website directly from Postgres?
 
*Wen can check the data integration every other month for patch
 
  
 +
=== Strain to ID mapping ===
 +
* Waiting on Hinxton to send strain ID mapping file?
 +
* Hopefully we can all get that well before the upload deadline
 +
* Will do global replacement at time of citace upload (at least for now)
  
 +
=== New name server ===
 +
* When will this officially go live?
 +
* Will we now be able to request strain IDs through the server? Yes
  
== March 10th, 2011 ==
+
=== SObA Graphs ===
 +
* New graphs now live on site (Expression, Gene Ontology, Human Diseases, Phenotypes)
 +
* A lot of whitespace padding above and below graph; maybe trim? trimming vertically would ultimately limit the view pane when user wants to zoom in, so we should leave as is for now
 +
* Diff tool: Raymond and Juancarlos created a prototype diff tool (for comparing two genes, for example)
 +
** Paul: compared two genes that should be very similar, but there are a lot of differences; may reflect annotation coverage rather than biology
  
Release schedule and patches
 
*What is the appropriate frequency?
 
*Scheme: do what we're already doing, Wen merges into citace
 
*Excluding sequence related data
 
*Need to include Mary Ann's data (strains etc.)
 
*Daily update too frequent; maybe once per month/week
 
*Submit .ACE file to Todd with simple syntax; easily parsable; old description removed and new information added
 
*Make updates only in contrast to last WS, not previous patch/temporary upload
 
*ACEDB diff step only relative to WS
 
*Wen: Postgres can dump diff ace file; already have diff ace files for every data type at Caltech; integrate into citace;
 
*Raymond: integration is important; we need to talk about how much work needs to be done by each approach
 
*Wen: consistency checks, backups, store each version?
 
*Rayomnd: citace 224 to 225 (for example), display done on class level
 
**Example: Gene page; only update information relevant to gene class to be displayed
 
*Do once per month: faster than currently because it doesn't have to go through the dev site
 
*If we update Citace to Citace, missing a lot of cross-references?
 
*Mock citace with Mary Ann's data? Becomes diff base; Mary Ann submits (non-sequence related) data directly to WBCIT
 
*Build low-connectivity ace at WBCIT? Add Mary Ann's data, remove RNAi
 
*Todd: important consideration: things added won't be available for search until formal release; weird things about diff; new reference associations with genes; a lot of duplications?
 
**Raymond: will look at it
 
*Todd: WBGene00000846, example, see how fast it loads, go from there; would like a single ACE file (concatenation of all individual ace files); would not happen on development; would have to happen on production releases; would take production database off line, clone it, and upload it to all production nodes
 
*Individual curators need to check their individual ace files for errors
 
*Frequency: monthly
 
*Todd: we should just run some tests first, to check feasibility
 
*Raymond: WBGene1, example, concise description has typo, WS225 has typo fixed from WS224, diff file shows:
 
**-D old_description
 
**new_description
 
*Load diff ace into original database?
 
*parallel display; unrelated to resident WS?
 
*Todd: producing two web pages for each object?
 
*Raymond: No, only changing relevant tags, etc.
 
*Todd: Two databases running at same time inefficient; include timestamps?
 
*Raymond: No, cannot include timestamps
 
*Wen: Send patch ace files to Todd
 
*Raymond: In conflict with versioning; how to show new data
 
*Wen: Call it "WS225.1"
 
  
Human Disease Relevance tag in Concise Description
+
== September 19, 2019 ==
*Ranjana: Sent out e-mail; human disease tag "Human Disease Relevance"; to clean up concise description form (old tags in form outdated); could be putting more information into concise description; OMIM human disease
 
*Raymond: make not just text field, but make entity field pointing to object; meant to be human readable, this may break up the concise description into OMIM-related and OMIM-non-related info; why parse the data into a tag?
 
*Paul S: OMIM descriptions as a separate tag
 
*Raymond: Rewrite concise description?
 
*Ranjana: No
 
*Paul S: Would you mention human disease relevance in concise description? yes, but if it's just a link out to OMIM, then separate out; OMIM may have changed since Erich wrote original script; check OMIM for new information and tags that may be able to get pulled out
 
*Ranjana: Michael Paulini can consolidate orthology information?
 
  
Karen: Transgene model
+
=== Strains ===
*A lot of changes to propose
+
* Need to wait for new strain IDs from Hinxton before running dumping scripts
*Deletions of tags; more coming
+
* Don't edit multi-ontology strain fields in OA for now!
*Other things in database connected to transgenes
+
* Juancarlos will map free text and ontology-name strain entries to strain IDs once we have the complete mapping file
*Many things in transgene objects that may be able to be parsed into different tags (new job for someone?)
+
* "Requested strain" field in Disease OA; not dumped, so don't need to worry about right now
*Strict nomenclature for transgene descriptor
 
*Clones present need to be parsed into clone class?
 
*Todd made Clone page;
 
*Start with vectors/backbones and then work on specific plasmids
 
  
Gene class-phenotype connections and descriptions?
+
=== Alliance literature curation ===
 +
* Working group will be formed soon
 +
* Will work out general common pipelines for literature curation
  
GSA markup at Flybase
+
=== SObA Graph relations ===
*Flybase is not willing to fully QC all papers
+
* Currently only integrating over "is a", "part of" and "regulates"
*Do we push Flybase and/or provide a better tool to QC?
+
* Maybe we could provide users an option to specify which relations to include, or maybe just exclude "regulates"
*Are we worried about the GSA markup for flies not looking professional?
 
*People need to be willing to pay for the QC/curation; depends on database priorities
 
*CIT will spend more time on in-house development to make Fly GSA markup easier/more efficient
 
  
Putting SPELL on Amazon cloud?
+
=== Author First Pass ===
 +
* Putting together paper for AFP
 +
* Reviewing all user input for paper
 +
* Asking individual curators to check input

Revision as of 16:39, 19 September 2019

Previous Years

2009 Meetings

2011 Meetings

2012 Meetings

2013 Meetings

2014 Meetings

2015 Meetings

2016 Meetings

2017 Meetings

2018 Meetings


GoToMeeting link: https://www.gotomeet.me/wormbase1


2019 Meetings

January

February

March

April

May

June

July

August


September 12, 2019

Update on SVM pipeline

  • New SVM pipeline: more analysis and more parameter tuning
  • avoiding precision (and F-value) as a measure (dependent on ratio of positives and negatives in test set)
  • For example shown, "dumb" machine starts out with precision above 0.6
  • G-value (Michael's invention); does not depend on distribution of sets
  • Applied to various data types
  • Analysis: 10-fold cross validation
    • Randomly select 10% pos and neg (without replacement) and repeat until all papers sampled
  • F-value changes over different p/n values; G-value does not (essentially flat)
  • Area Under the Curve (AUC): probability that a random positive scores higher than random negative
  • AUC values for many WB data types upper 80%'s into 90%'s
  • Ranjana: How many papers for a good training set? Michael: we don't know yet
  • Can't reproduce old training sets (for old SVM); provide Michael better training sets if you want improved SVM
  • If SVM still not good enough, Michael will work on deep neural networks (Tensor Flow)
  • Michael can provide training sets he has used recently

Clarifying definitions of "defective" and "deficient" for phenotypes

  • WB phenotype ontology has many "variant/abnormal" terms and distinct subclass terms for "defective/deficient"
  • Have tried to create a logical definition pattern for these terms, but the vagueness of the meaning of "defective" and how it is distinct from "abnormal" has stalled the process
  • What do we mean exactly by "defective" and how, specifically, is this distinct from "abnormal"?
  • Definitions include meanings or words:
    • "Variations in the ability"
    • "aberrant"
    • "defect"
    • "defective"
    • "defects"
    • "deficiency"
    • "deficient"
    • "disrupted"
    • "impaired"
    • "incompetent"
    • "ineffective"
    • "perturbation that disrupts"
    • Failure to execute the characteristic response = abnormal?
    • abnormal
    • abnormality leading to specific outcomes
    • fail to exhibit the same taxis behavior = abnormal?
    • failure
    • failure OR delayed
    • failure, slower OR late
    • failure/abnormal
    • reduced
    • slower

Citace upload

    • Tuesday, Sep 24th

Strain to ID mapping

  • Waiting on Hinxton to send strain ID mapping file?
  • Hopefully we can all get that well before the upload deadline
  • Will do global replacement at time of citace upload (at least for now)

New name server

  • When will this officially go live?
  • Will we now be able to request strain IDs through the server? Yes

SObA Graphs

  • New graphs now live on site (Expression, Gene Ontology, Human Diseases, Phenotypes)
  • A lot of whitespace padding above and below graph; maybe trim? trimming vertically would ultimately limit the view pane when user wants to zoom in, so we should leave as is for now
  • Diff tool: Raymond and Juancarlos created a prototype diff tool (for comparing two genes, for example)
    • Paul: compared two genes that should be very similar, but there are a lot of differences; may reflect annotation coverage rather than biology


September 19, 2019

Strains

  • Need to wait for new strain IDs from Hinxton before running dumping scripts
  • Don't edit multi-ontology strain fields in OA for now!
  • Juancarlos will map free text and ontology-name strain entries to strain IDs once we have the complete mapping file
  • "Requested strain" field in Disease OA; not dumped, so don't need to worry about right now

Alliance literature curation

  • Working group will be formed soon
  • Will work out general common pipelines for literature curation

SObA Graph relations

  • Currently only integrating over "is a", "part of" and "regulates"
  • Maybe we could provide users an option to specify which relations to include, or maybe just exclude "regulates"

Author First Pass

  • Putting together paper for AFP
  • Reviewing all user input for paper
  • Asking individual curators to check input