Difference between revisions of "WormBase-Caltech Weekly Calls"

From WormBaseWiki
Jump to navigationJump to search
Line 15: Line 15:
 
[[WormBase-Caltech_Weekly_Calls_2016|2016 Meetings]]
 
[[WormBase-Caltech_Weekly_Calls_2016|2016 Meetings]]
  
= 2017 Meetings =
+
[[WormBase-Caltech_Weekly_Calls_2017|2017 Meetings]]
  
[[WormBase-Caltech_Weekly_Calls_January_2017|January]]
+
[[WormBase-Caltech_Weekly_Calls_2018|2018 Meetings]]
  
[[WormBase-Caltech_Weekly_Calls_February_2017|February]]
+
[[WormBase-Caltech_Weekly_Calls_2019|2019 Meetings]]
  
[[WormBase-Caltech_Weekly_Calls_March_2017|March]]
+
[[WormBase-Caltech_Weekly_Calls_2020|2020 Meetings]]
  
[[WormBase-Caltech_Weekly_Calls_April_2017|April]]
+
[[WormBase-Caltech_Weekly_Calls_2021|2021 Meetings]]
  
[[WormBase-Caltech_Weekly_Calls_May_2017|May]]
+
[[WormBase-Caltech_Weekly_Calls_2022|2022 Meetings]]
  
[[WormBase-Caltech_Weekly_Calls_June_2017|June]]
+
[[WormBase-Caltech_Weekly_Calls_2023|2023 Meetings]]
  
[[WormBase-Caltech_Weekly_Calls_July_2017|July]]
+
==April 18th, 2024==
 +
*NNC pipeline being switched off locally and moving into the Alliance ABC.
  
[[WormBase-Caltech_Weekly_Calls_August_2017|August]]
+
==April 11th, 2024==
 +
*Caltech WS293 ace files ready for the upload
  
[[WormBase-Caltech_Weekly_Calls_September_2017|September]]
+
==April 4th, 2024==
 +
* Continued discussion on sustainability
 +
* CZI, single cell RNAseq for Alliance -> anything happening will be few months down the road
 +
** Data is still going to SPELL and enrichment analysis
 +
** Peter Roy asking about expression profile of a condition and find similar expression profiles (SPELL like analysis) but SPELL cannot currently deal with scRNAseq data. Wen says it is possible (regarding each cell group as an experiment). Can try loading the into SPELL. Doe it improve the function of SPELL? Only 5-10 datasets. These data are a bit different from bulk RNAseq.
 +
* Textpresso: good to have a presentation for other MODs to show Textpresso capabilities? Yes. Maybe during sprint review
 +
* Michael's presentation on LLMs - Named Entity Recognition (NER)
  
[[WormBase-Caltech_Weekly_Calls_October_2017|October]]
+
==March 14, 2024==
  
[[WormBase-Caltech_Weekly_Calls_November_2017|November]]
+
=== TAGC debrief ===
  
 +
==February 22, 2024==
  
== December 7, 2017 ==
+
===NER with LLMs===
  
=== Next Citace upload ===
+
* Wrote scripts and configured an LLM for Named Entity Recognition. Trained an LLM on gene names and diseases. Works well so far (F1 ~ 98%, Accuracy ~ 99.9%)
* Submit files to Wen by Jan 16, 2018
 
* There shouldn't be model changes (that affect Caltech) for this next upload
 
* Juancarlos goes on vacation Dec 19th
 
  
=== WB Curator candidate ===
+
* Is this similar to the FlyBase system? Recording of presentation  https://drive.google.com/drive/folders/1S4kZidL7gvBH6SjF4IQujyReVVRf2cOK
* April, could join WB for 1-2 years
 
* Could possibly interview next Thursday
 
* Could curate sequence features, enhancers, alleles/mutations/variations?
 
* Could start in January
 
  
=== Expression clusters ===
+
* Textpresso server is kaput. Services need to be transferred onto Alliance servers.
* Wen and Raymond discussed; will add an "uncertain" tag or something similar
 
* Tag can be used to filter out datasets that are complex (like mixed cell/tissue types)
 
  
=== Automated descriptions & expression clusters ===
+
* There are features on Textpresso, such as link to PDF, that are desirable to curators but should be blocked from public access.
* Many expression clusters are associated with infection
 
* Such data not currently incorporated into automated gene descriptions
 
* Wen can talk with Ranjana to incorporate (particularly for info-sparse genes)
 
* Can use GO terms, but need to be careful about how they are applied (what relationships they have) to genes
 
  
=== Author First Pass Form ===
+
* Alliance curation status form development needs use cases. ref https://wiki.wormbase.org/index.php/WormBase-Caltech_Weekly_Calls#February_15.2C_2024
*[https://docs.google.com/spreadsheets/d/1sS_uAjBJ2r5H90Lam62Ai0HunjwvfjnklkFNrDoNXeU/edit Analysis of current flags and numbers of entries]
 
  
==== Overall approach ====
 
*AFP form would move from being just a flagging pipeline towards a validation and data entry pipeline
 
*Also move from free text boxes towards autocompletes, controlled vocabularies
 
*If this is a validation and a portal for data entry forms, should we display everything we can mine?
 
*Idea is to eventually create data entry forms for as many data types as possible
 
*Long term goal would include linking to TPC and evidence sentences for validation
 
*Data types not curated by WB would need to be re-evaluated: continue to mine but share with other curation groups? no longer mine? new WB objects or pages?
 
*Periodically review the form and add/remove data types and curation forms as needed
 
* Would be good to ask authors if they had to omit data from the recently published paper and if they want to micropublish
 
  
==== Questions ====
 
*Question for WB curators: Are we missing data types? 
 
*Question for WB curators: Are there data types that we don't curate (e.g. domain analysis) but for which we could share mining information with other groups, e.g. UniProt?
 
*Question for WB curators: Are there data types we should no longer mine?
 
*Question for Wen: Does it make sense to include an afp flag for expression cluster?
 
*Question for Karen: What afp tables are populated from the Genetics/G3 pipeline?
 
**journal first pass forms : http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/journal/journal_first_pass.cgi?
 
Details here: http://wiki.wormbase.org/index.php/Genetics_Markup_by_Textpresso_and_First_Pass
 
  
[[File:Screen_Shot_2017-12-07_at_9.28.56_AM.png]]
+
==February 15, 2024==
  
==== Thoughts ====
+
=== Literature Migration to the Alliance ABC ===
* Could we add to citation index score as incentive to provide validation/data?
+
==== Use Cases for Searches and Validation in the ABC (or, what are your common actions in the curation status form)? ====
* How do we make sure we are not turning off/overwhelming users/participants/authors?
+
===== Find papers with a high confidence NN classification for a given topic that have also been flagged positive by an author in a community curation pipeline and that haven’t been curated yet for that topic =====
* Can gamify the forms; give points?
+
*Facet for topic
* Create an app that has user-friendly data submission tools/forms?
+
*Facet for automatic assertion
 +
**neural network method
 +
*Facet for confidence level
 +
**High
 +
*Facet for manual assertion
 +
**author assertion
 +
***ACKnowledge method
 +
**professional biocurator assertion
 +
***curation tools method - NULL
  
 +
===== Manually validate paper - topic flags without curating =====
 +
*Facet for topic
 +
*Facet for manual assertion
 +
**professional biocurator assertion
 +
***ABC - no data
  
== December 14, 2017 ==
+
===== View all topic and entity flags for a given paper and validate, if needed =====
 +
* Search ABC with paper identifier
 +
* Migrate to Topic and Entity Editor
 +
* View all associated data
 +
* Manually validate flags, if needed
  
=== MOD data in Data Commons ===
+
=== PDF Storage ===
* NIH Data Commons wants DB data in common formats
+
* At the Alliance PDFs will be stored in Amazon s3
* We've wanted APIs or web service to allow access all data in standard format
+
* We are not planning to formally store back-up copies elsewhere
* Good to have one place for all data from all MODs
+
* Is this okay with everyone?
* This should be a work product of the AGR
 
* Consider priority: data site, website
 
* AGR working groups work towards data standards
 
* GAF (gene association file) could be used now; what can we learn about the GAF: development, usage?
 
* GAF hasn't changed much in last ~10 years
 
  
=== Contingency plan for Juancarlos' vacation ===
+
==February 8, 2024==
* If Tazendra crashes/dies, we can perform a hard drive swap at Caltech
+
* TAGC
 +
** Prominent announcement on the Alliance home page?
  
=== Essential genes ===
+
* Fixed login on dockerized system (dev). Can everybody test their forms?
* David Fay asked for list of essential C. elegans genes
 
* David asked for genes whose mutants are inviable as homozygotes, ignoring RNAi
 
* Could we generate a file for users? Can we make a conservative list with ~no false positives
 
* What are all the criteria (that are curated) that could be used to decide whether a gene is essential?
 
* How does any user define "essential"? What are the cutoffs/thresholds?
 
* Best option may be to provide a table with all relevant phenotype annotation attributes and let user decide based on filters
 
  
=== Expression certainty ===
+
==February 1, 2024==
* Proteomics experiments have subjective thresholds
+
* Paul will ask Natalia to take care of pending reimbursements
* How do we handle cases where one peptide maps to multiple genes?
+
* Dockerized system slow pages (OA and FPKMMine). Will monitor these pages in the future. Will look for timeouts in the nginx logs.
* How should large scale expression data be incorporated into enrichment analysis?
 
  
=== GitHub tracker for Caltech curation issues ===
+
==January 25, 2024==
* "Caltech curation" repository - primarily used for creating curation tool set for Caltech team
 
* "WormBase curation" repository - last used in 2015
 
* Should there be a single repository for tracking? Yes, use the "wormbase-curation" repository
 
* Curators will start submitting tickets in the "wormbase-curation" repository
 
* OA code is in there now
 
* We should add OA dumper code as well
 
  
 +
=== Curator Info on Curation Forms ===
 +
* Saving curator info using cookies in dockerized forms. Can we deploy to prod?
  
== December 21, 2017 ==
+
=== ACKnowledge Author Request - WBPaper00066091 ===
 +
* I am more than willing to assist; however, the task exceeds the capabilities of the normal flagging process.
  
=== Upload ===
+
* The paper conducts an analysis of natural variations within 48 wild isolates. To enhance the reliability of the variant set, I utilized the latest variant calling methods along with a custom filtering approach. The resulting dataset comprises 1,957,683 unique variants identified using Clair3. Additionally, Sniffles2 was used to identify indels of >30 bp, which numbered in the thousands to tens of thousands for most wild isolates. It is worth noting that variants identified with Sniffles2 have less reliable nucleotide positions in the genome.
* Jan 19th citace upload to Hinxton
 
* Get .ace files to Wen by Tuesday Jan 16th, 10am PST
 
  
=== Wen's outreach slides ===
+
* I am reaching out to inquire whether WormBase would be interested in incorporating this dataset. An argument in favor is the higher quality of my data. However, I am mindful of the potential substantial effort involved for WormBase, and it is unclear whether this aligns with your priorities.
* Wen created slides for presenting at San Diego area worm meeting
 
* Meeting on Jan 12, 2018
 
* Wen recreated her slides based on feedback from Sternberg lab meeting
 
* Slides cover step-by-step tutorials for using Wormbase website
 
* For future outreach meetings/presentations, we can reach out ahead of time to find out what topics would be of greatest interest
 
* Curators, review Wen's slides and send her feedback
 
* Can take a look at WormBase YouTube channel for existing videos
 
** https://www.youtube.com/user/WormBaseHD/featured
 
* Chris can send WormMine tutorial videos and/or slides to Wen to share with audience
 
  
=== Karen and Daniela going to Bay Area meeting ===
+
* Should WormBase decide to use my variant data set, I am more than willing to offer my assistance.
* January meeting
 
* WormBase outreach, micropublications
 
  
=== April will start in Jan ===
+
=== Update on NN Classification via the Alliance ===
* Will start on transcriptional regulation
+
* Use of primary/not primary/not designated flag to filter papers
* May work on allele phenotypes
+
* Secondary filter on papers with at least C. elegans as species
 +
* Finalize sources (i.e. evidence) for entity and topic tags on papers
 +
* Next NN clasification scheduled for ~March
 +
 
 +
* We decided to process all papers (even non-elegans species) and have filters on species after processing.
 +
* NNC html pages will show NNC values together with species.
 +
* Show all C. elegans papers first and other species in a separate bin.
 +
 
 +
=== Travel Reimbursements ===
 +
* Still waiting on October travel reimbursement (Kimberly)
 +
* Still waiting on September and October travel reimbursements (Wen)
 +
 
 +
=== UniProt ===
 +
* Jae found some genes without uniProt IDs, but the genes are there on uniProt but without WBGene IDs.
 +
* Wen reached to Stavros and Chris to investigate WormBase and AGR angles.
 +
* Stavros escalates the issue on Hinxton Standup.
 +
* Mark checks Build scripts and WS291 results. After that, he contacted UniProt and he's working with them to figure this out.
 +
 
 +
==January 18, 2024==
 +
* OA showing different names highlighted when logging in the OA, now fixed on staging
 +
 
 +
 
 +
==January 11, 2024==
 +
* Duplicate function in OA was not working when using special characters. Valerio debugged and is now fixed.
 +
** Curators should make sure that, when pasting special characters, the duplicate function works
 +
* OA showing different names highlighted when logging in the OA, Valerio will debug and check what IP address he sees
 +
** If you want to bookmark an OA url for your datatype and user, log on once, and bookmark that page (separately for prod and dev)
 +
* Chris tested on staging and production the phenotype form and the data are still going to tazendra
 +
** Chris will check with Paulo. Once it is resolved we need to take everything that is on tazendra and put it on the cloud with different PGIDs
 +
** Raymond: simply set up forwarding at our end?
 +
* AI working group: Valerio is setting up a new account for open AI -paid membership for ChatGPT4. We can also use Microsoft Edge copilot (temporary?)
 +
* Chris getting ready to deploy a 7.0.0. public release - February 7th. Carol wanted to push out monthly releases. This will include WS291. For subsequent releases the next several releases will be WS 291 until WS292 is available.
 +
* Valerio would like to use an alliancegenome.org email address for the openAI account
 +
* New alliance drive: https://drive.google.com/drive/folders/0AFkMHZOEQxolUk9PVA
 +
** note: please move shared files that you own to new Alliance Google Drive.  Here is the link to the information that Chris Mungall sent:  For more instructions see the video and SOP here:https://agr-jira.atlassian.net/browse/SCRUM-925?focusedCommentId=40674
 +
* Alliance logo and 50 word description for TAGC> Wen will talk to the outreach WG
 +
* Name server. Manuel working on this, Daniela and Karen will reach out to him and let him know that down the road micropublication would like to use the name server API to generate IDs in bulk
 +
* Karen asking about some erroneous IDs used in the name server. Stavros says that this is not a big deal because the "reason" is not populating the name server
 +
* It would be good to be able to have a form to capture additional fields for strains and alleles (see meeting minutes August 31st 2023. https://wiki.wormbase.org/index.php/WormBase-Caltech_Weekly_Calls_2023#August_31st.2C_2023). This may happen after Manuel is done with the authentication.
 +
* Michael: primary flag with Alliance. Kimberly talked about this with the blue team. They will start bringing that over all papers and fix the remaining 271 items later.
 +
 
 +
==January 4, 2024==
 +
* ACKnowlegde pipeline help desk question:
 +
** Help Desk: Question about Author Curation to Knowledgebase (Zeng Wanxin) [Thu 12/14/2023 5:48 AM]
 +
* Citace upload, current deadline: Tuesday January 9th
 +
** All processes (dumps, etc.) will happen on the cloud machine
 +
** Curators need to deposit their files in the appropriate locations for Wen
 +
* Micropublication pipeline
 +
** Ticketing system confusion
 +
** Karen and Kimberly paper ID pipeline; may need sorting out of logistics

Revision as of 16:04, 18 April 2024

Previous Years

2009 Meetings

2011 Meetings

2012 Meetings

2013 Meetings

2014 Meetings

2015 Meetings

2016 Meetings

2017 Meetings

2018 Meetings

2019 Meetings

2020 Meetings

2021 Meetings

2022 Meetings

2023 Meetings

April 18th, 2024

  • NNC pipeline being switched off locally and moving into the Alliance ABC.

April 11th, 2024

  • Caltech WS293 ace files ready for the upload

April 4th, 2024

  • Continued discussion on sustainability
  • CZI, single cell RNAseq for Alliance -> anything happening will be few months down the road
    • Data is still going to SPELL and enrichment analysis
    • Peter Roy asking about expression profile of a condition and find similar expression profiles (SPELL like analysis) but SPELL cannot currently deal with scRNAseq data. Wen says it is possible (regarding each cell group as an experiment). Can try loading the into SPELL. Doe it improve the function of SPELL? Only 5-10 datasets. These data are a bit different from bulk RNAseq.
  • Textpresso: good to have a presentation for other MODs to show Textpresso capabilities? Yes. Maybe during sprint review
  • Michael's presentation on LLMs - Named Entity Recognition (NER)

March 14, 2024

TAGC debrief

February 22, 2024

NER with LLMs

  • Wrote scripts and configured an LLM for Named Entity Recognition. Trained an LLM on gene names and diseases. Works well so far (F1 ~ 98%, Accuracy ~ 99.9%)
  • Textpresso server is kaput. Services need to be transferred onto Alliance servers.
  • There are features on Textpresso, such as link to PDF, that are desirable to curators but should be blocked from public access.


February 15, 2024

Literature Migration to the Alliance ABC

Use Cases for Searches and Validation in the ABC (or, what are your common actions in the curation status form)?

Find papers with a high confidence NN classification for a given topic that have also been flagged positive by an author in a community curation pipeline and that haven’t been curated yet for that topic
  • Facet for topic
  • Facet for automatic assertion
    • neural network method
  • Facet for confidence level
    • High
  • Facet for manual assertion
    • author assertion
      • ACKnowledge method
    • professional biocurator assertion
      • curation tools method - NULL
Manually validate paper - topic flags without curating
  • Facet for topic
  • Facet for manual assertion
    • professional biocurator assertion
      • ABC - no data
View all topic and entity flags for a given paper and validate, if needed
  • Search ABC with paper identifier
  • Migrate to Topic and Entity Editor
  • View all associated data
  • Manually validate flags, if needed

PDF Storage

  • At the Alliance PDFs will be stored in Amazon s3
  • We are not planning to formally store back-up copies elsewhere
  • Is this okay with everyone?

February 8, 2024

  • TAGC
    • Prominent announcement on the Alliance home page?
  • Fixed login on dockerized system (dev). Can everybody test their forms?

February 1, 2024

  • Paul will ask Natalia to take care of pending reimbursements
  • Dockerized system slow pages (OA and FPKMMine). Will monitor these pages in the future. Will look for timeouts in the nginx logs.

January 25, 2024

Curator Info on Curation Forms

  • Saving curator info using cookies in dockerized forms. Can we deploy to prod?

ACKnowledge Author Request - WBPaper00066091

  • I am more than willing to assist; however, the task exceeds the capabilities of the normal flagging process.
  • The paper conducts an analysis of natural variations within 48 wild isolates. To enhance the reliability of the variant set, I utilized the latest variant calling methods along with a custom filtering approach. The resulting dataset comprises 1,957,683 unique variants identified using Clair3. Additionally, Sniffles2 was used to identify indels of >30 bp, which numbered in the thousands to tens of thousands for most wild isolates. It is worth noting that variants identified with Sniffles2 have less reliable nucleotide positions in the genome.
  • I am reaching out to inquire whether WormBase would be interested in incorporating this dataset. An argument in favor is the higher quality of my data. However, I am mindful of the potential substantial effort involved for WormBase, and it is unclear whether this aligns with your priorities.
  • Should WormBase decide to use my variant data set, I am more than willing to offer my assistance.

Update on NN Classification via the Alliance

  • Use of primary/not primary/not designated flag to filter papers
  • Secondary filter on papers with at least C. elegans as species
  • Finalize sources (i.e. evidence) for entity and topic tags on papers
  • Next NN clasification scheduled for ~March
  • We decided to process all papers (even non-elegans species) and have filters on species after processing.
  • NNC html pages will show NNC values together with species.
  • Show all C. elegans papers first and other species in a separate bin.

Travel Reimbursements

  • Still waiting on October travel reimbursement (Kimberly)
  • Still waiting on September and October travel reimbursements (Wen)

UniProt

  • Jae found some genes without uniProt IDs, but the genes are there on uniProt but without WBGene IDs.
  • Wen reached to Stavros and Chris to investigate WormBase and AGR angles.
  • Stavros escalates the issue on Hinxton Standup.
  • Mark checks Build scripts and WS291 results. After that, he contacted UniProt and he's working with them to figure this out.

January 18, 2024

  • OA showing different names highlighted when logging in the OA, now fixed on staging


January 11, 2024

  • Duplicate function in OA was not working when using special characters. Valerio debugged and is now fixed.
    • Curators should make sure that, when pasting special characters, the duplicate function works
  • OA showing different names highlighted when logging in the OA, Valerio will debug and check what IP address he sees
    • If you want to bookmark an OA url for your datatype and user, log on once, and bookmark that page (separately for prod and dev)
  • Chris tested on staging and production the phenotype form and the data are still going to tazendra
    • Chris will check with Paulo. Once it is resolved we need to take everything that is on tazendra and put it on the cloud with different PGIDs
    • Raymond: simply set up forwarding at our end?
  • AI working group: Valerio is setting up a new account for open AI -paid membership for ChatGPT4. We can also use Microsoft Edge copilot (temporary?)
  • Chris getting ready to deploy a 7.0.0. public release - February 7th. Carol wanted to push out monthly releases. This will include WS291. For subsequent releases the next several releases will be WS 291 until WS292 is available.
  • Valerio would like to use an alliancegenome.org email address for the openAI account
  • New alliance drive: https://drive.google.com/drive/folders/0AFkMHZOEQxolUk9PVA
  • Alliance logo and 50 word description for TAGC> Wen will talk to the outreach WG
  • Name server. Manuel working on this, Daniela and Karen will reach out to him and let him know that down the road micropublication would like to use the name server API to generate IDs in bulk
  • Karen asking about some erroneous IDs used in the name server. Stavros says that this is not a big deal because the "reason" is not populating the name server
  • It would be good to be able to have a form to capture additional fields for strains and alleles (see meeting minutes August 31st 2023. https://wiki.wormbase.org/index.php/WormBase-Caltech_Weekly_Calls_2023#August_31st.2C_2023). This may happen after Manuel is done with the authentication.
  • Michael: primary flag with Alliance. Kimberly talked about this with the blue team. They will start bringing that over all papers and fix the remaining 271 items later.

January 4, 2024

  • ACKnowlegde pipeline help desk question:
    • Help Desk: Question about Author Curation to Knowledgebase (Zeng Wanxin) [Thu 12/14/2023 5:48 AM]
  • Citace upload, current deadline: Tuesday January 9th
    • All processes (dumps, etc.) will happen on the cloud machine
    • Curators need to deposit their files in the appropriate locations for Wen
  • Micropublication pipeline
    • Ticketing system confusion
    • Karen and Kimberly paper ID pipeline; may need sorting out of logistics