WormBase-Caltech Weekly Calls February 2018

February 1, 2018

Automated gene descriptions - orthology

Some genes have human orthology mentioned in automated descriptions, even though the orthology call has not been called in DIOPT
WormBase uses EnsemblCompara and other methods (not aggregate method like DIOPT)
Orthology synchrony is a challenge; WormBase and FlyBase may need to pay special attention to orthology calls and discrepancies
DIOPT is purely automated, does not consider other information about orthology evidence
We should be clear about how the orthology calls are made

Next upload

Unclear of exact date
Probably end of March

SimpleMine issue

Redundant genes in input list are merged
Should SimpleMine provide an option to keep redundancies?
- Give option up front? Provide submission step to point out redundancies? Ask for choice?
We can default to show row-by-row correspondence, and display the number of redundant entries
Conclusion: Make an option for users to indicate if they want row-by-row correspondence or a merged list

Cell type expression

Waterston paper
40,000 random cells, clusters sequenced individually to a depth of 20,000 reads; ~1000 genes per cell; cluster data; make judgement call as to what cell types they likely are
For now, we can do a simple annotation: significantly expressed genes for each cell type
Supplemental table S5 for neurons
Maybe just ignore the hybrid calls like AQM/PVM, etc.
It may be good to isolate the single cell data from other expression data
We should annotate/capture the expression clusters
Would be good to be able to do enrichment analysis on the clusters; compare data sets
Data has not been placed in SPELL yet, Gary considered the data a work in progress
We can communicate with Waterston group; are they collecting more data?
Wen will take another look at the data
Gary W. concerned about the reported/assumed/inferred identity of the cells in the paper
Probably cannot curate to individual cells, but we can annotate to a higher level term
We want to annotate and display expression enrichment as well as presence/absence calls

February 8, 2018

Release schedule

Wen will ask Hinxton to update the published release schedule (for next data upload)

New York Worm Meeting

Wen and Kimberly will present a WormBase tutorial on March 24
Wen communicated to Oliver Hobert; suggested topics:
- Multi-gene (batch) search tools
- How literature info gets into WormBase? Curation process?
- Should we discuss completeness?

GO curation

New simple input form for Noctua, being developed at USC
Not very much GO curation happening at WB right now
Protein-2-GO pipeline
Do we have a good Phenotype-2-GO(Process) mapping pipeline? We have our old mappings; not very reliable; would need to spend more time expanding the worm phenotype ontology and GO to improve
Cellular component curation will come in from WB expression curation
Don't have pipeline for Interactions-2-GO
Textpresso Molecular Functions pipeline?
geneprod and catalyticact data types for molecular function pipeline
Textpresso can send molecular function annotations to Noctua
For high-level pathway curation; we should probably read WormBook chapters (or other reviews) and develop pathways (using non-experimental evidence codes)
We could potentially seed Noctua models from Reactome
We would like to have complete curation for major pathways for gene enrichment analysis
Roles of small molecules in Noctua models still being worked out

Phenotype curation

Chris has had community curation pipeline on back burner while updating Wiki and dealing with AGR, WormMine, etc.
Will get back to soon; will resend email requests for newer papers sent over a year ago

Expression curation

Daniela getting back to expression curation after Micropublication stuff has quieted down

Gene regulation curation

April came across dataset involving regulation of siRNAs that don't seem to have gene objects in WB
May need to instantiate genes for these?

Physical interaction curation

SVM classification; do we flag a paper as negative that has protein interactions but no interactions for C. elegans
Can we generate a good SVM that only identifies WB-curatable papers?

Disease curation

Now curating the specific genetic entities involved in a disease model
Will also capture environmental conditions, treatments (e.g. ameliorates, exacerbates)
Curation in-line with AGR standards
Evidence code needed for assertions that an animal is a model of disease in which the assertion is based on background knowledge and experimental evidence, together
Evidence Code Ontology (ECO) is developing a new term to accommodate
Disease curators can use new evidence code as well as any existing codes
Is there a definition of a "disease model"?
What are the minimal criteria for considering something a disease model?
WB and FB curators focus on cellular phenotype and relation to the disease

Expression cluster curation

27 papers in pipeline
Will then work on "single-cell" RNAseq
- Wen, Raymond, and David should discuss

April and May Worm Meetings

Midwest and Colorado meetings
Wen submitting abstracts
Wen and Kimberly can write up abstract template for New York meeting and send around to be modified for future meetings

WormBook

Published last version for legacy site

Papers

Daniel requested 13 (older) papers from Caltech library through inter-library process
Received more than half as images; would need optical character recognition (OCR) for Textpresso purposes
What is the state of the art of OCR now? How good is it? Can we ask Caltech library for the service?
Are these high priority papers? Need to check to see if worth processing

AGR

Disease working group setting up a face-to-face meeting
Variant working group may need a face-to-face meeting as well
Expression working group working out initial AGR site data display mockups
Interaction working group; we will want to incorporate miRNA/target interactions (RNA-RNA interactions); will look at miRBase

February 15, 2018

Model changes

Models freeze March 2nd
Will need to get model changes proposed and tested by then

Sys admin of Tazendra/Mangolassi

Raymond will discuss with Juancarlos to centralize
Need good documentation for forms, tools, etc.
Will be a push to put all code for tools and forms on GitHub

Tazendra forms, tools bug this week

There is a dependency on Mangolassi for some tools
Mangolassi went down and caused problems
Would be good to decouple the two machines

AGR

May not get an AGR all-hands face-to-face meeting before the summer
Working groups can decide to have face-to-face meeting
People should speak up if they have interest in visiting other MODs/sites; can be arranged
Consider what grant proposals could come out of such meetings/visits
Currently no ontology working group, no anatomy working group
Could establish a preliminary working group; reach out to relevant people
Anatomy working group issues may come up in expression working group
- Daniela will keep Raymond updated on relevant issues that come up with the expression group

Ontology Browser gene lists

Chris requesting change to gene list display from WOBr
https://github.com/WormBase/website/issues/6190
Should provide WBGene IDs, not just gene public names
- That was the original intent, but using WBGene IDs was, for some reason, causing issues when developing the tool; will need to revisit that issue to get WBGene IDs displayed

February 22, 2018

Making MOD data publicly available in a central location

Meets the NIH mandate for WB as a publicly-funded project and helps researchers get their data highlighted faster than waiting for the db build
- would put in filters to avoid releasing sensitive data or incomplete curation annotations
- would be good for journal hyperlinking project since it needs access to up to date data--see more below about project
- Central data repository for all data (MOD) files would be helpful to developers (and users)
- Does Caltech have an FTP site that could be used?

Journal Hyperlinking project goals

Hyperlinking project, has been in production since 2009, links bioentities in worm, yeast, and fly research articles to relevant databases, requires the latest data from WormBase, SGD, and FlyBase - could use a central repository to pull entities (name, ids, synonyms) from

- MOD Curators check link accuracy and check for missed links (needs ongoing fte support)
- supporting this project is not in the remit of any of the MODs, project is not sustainable without outside funding, hence finding funding outside of WB.
- Since inception, project links bioentities in GSA papers (Genetics and G3)
- C. elegans genes, alleles, rearrangements, strains, clones, short phenotype names, and transgenes are linked in these papers
- Karen's (InSilico) grant goals are to
  - expand the pipeline to other journals, specifically eLife (then to PLoS)
  - expand to all AGR member mods in addition to SGD and FlyBase. Also bring in PomBase
  - not planning on expanding linking beyond simple text recognition of known lexica and entities that follow a regex
  - SBIR commercialization plan is to extend entity identification to commercial reagents and collect data for subscription-based access from biotech suppliers.
  - need data from postgres, which is not available in geneace dump. Could possibly just dump all Postgres data into one place, Karen's developers could write scripts to process that data;
    - - Juancarlos will setup a URL that can be used to access the data; will setup on cronjob every day at 8pm
  - InSilico hyperlinks go through InSilico page to embedded i-frames of MOD entity page
    - allows trackability of link access, stats that will be given to the MODS
    - allows monitoring and resolution of links that go dead
    - allows splash pages for silent links

Alliance SAB meeting

SAB critical of:
- not being unified
- not being organized
Now everyone has committed
Concern still exists about autonomy of MODs
- Will each user community still be served effectively by the Alliance?
Organization is easier when all are committed
Maybe bring in a professional organizer/project manager (long term)
New aggressive timeline for progress
- April 23rd meeting; need to give material a week earlier
- Need year and a half plan; each working group will provide details
Only 2 full time Alliance staff; may need more on project; difficult for individuals to split time/effort
SAB member: curation involves expert decision making/analysis on issues, not just straight-forward data acquisition
Maybe we would have better curation consistency if individual curators focused on particular topics; became experts for certain subject matter
Possibility to have Alliance all-hands call in Fall
Working groups can have face-to-face meetings; have travel/meeting budget until July 31st then resets

Automated gene descriptions

Difficult to handle genes with high information content; many ontology term annotations
How do we simplify descriptions? Using higher-level terms, slim terms? Gets tricky

Micropublications

Michael Elowitz tweeted out micropublication stuff
Received feedback on Twitter; worth looking at threads, comments

Community curation plan

Alliance need for community curation
Micropublications are a bit of a pilot for community curation
Need shared curation/submission forms; fit shared data models?

RNAi secondary targets

WormMine and WormBase gene/RNAi pages include secondary RNAi targets but WOBr and SObA do not
Should we include or exclude secondary targets? We want consistency across data sets
No gold standard RNAi target prediction algorithm
We should be transparent about primary/secondary status wherever we include secondary targets in display
Could be addressed with phenotype display proposal
Should probably remove secondary targets from bulk data sets