Difference between revisions of "WormBase-Caltech Weekly Calls"

From WormBaseWiki
Jump to navigationJump to search
 
(84 intermediate revisions by 6 users not shown)
Line 22: Line 22:
  
 
[[WormBase-Caltech_Weekly_Calls_2020|2020 Meetings]]
 
[[WormBase-Caltech_Weekly_Calls_2020|2020 Meetings]]
 
  
 
= 2021 Meetings =
 
= 2021 Meetings =
Line 28: Line 27:
 
[[WormBase-Caltech_Weekly_Calls_January_2021|January]]
 
[[WormBase-Caltech_Weekly_Calls_January_2021|January]]
  
 +
[[WormBase-Caltech_Weekly_Calls_February_2021|February]]
  
== Feb 4th, 2021 ==
+
[[WormBase-Caltech_Weekly_Calls_March_2021|March]]
===How the "duplicate" function works in OAs with respect to object IDs (Ranjana and Juancarlos)===
 
*A word of caution: when you duplicate a row, for those OAs with Object IDs (eg., WBGenotype00000014) note that the object ID gets duplicated as well and does not advance to the next ID like the PGID does
 
*If you do use the "duplicate" function, remember to manually change the Object ID
 
* We can implement checks to make sure distinct annotations/objects don't share the same ID
 
 
 
=== GAF Wiki and headers ===
 
* Any more comments about the Wiki page and the proposal? https://wiki.wormbase.org/index.php/WormBase_gene_association_file
 
  
=== Missing references in expression GAFs ===
 
* ~300 missing from anatomy association file and ~45 missing from development association file
 
* Daniela looking into missing refs; many are personal communications or very old papers
 
* Will change ?Expr_pattern model to possibly remove ?Author reference and add in a ?Person reference instead
 
** 399 objects in WS279 reference an author; Daniela will take a look
 
* Would be good to have some reference for those objects in the GAF file on the FTP site; could use WBPerson when ready
 
  
 +
== April 1, 2021 ==
  
== Feb 11th, 2021 ==
+
=== Antibodies ===
=== Alliance Literature Paper Tags ===
+
* Alignment of the antibody class to Alliance:
*What do we definitely want to transfer to the Alliance?
+
** Propose to move possible_pseudonym (192) and Other_animal (37) to remarks. Those tags are not currently used for curation.
*Alliance literature group [https://docs.google.com/spreadsheets/d/1d3Y73x1BFiARkbxrvPPX2tCh5rFeBQRoBMHOcaXmijA/edit#gid=1866989939 spreadsheet]
+
*** Other animal is sometimes used for older annotations, e.g. authors say that the antibodies were raised both  in rats and rabbits. Standard practice would create 2 records, one for the rat antibody and one for the rabbit.
*Current flags vs legacy flags
+
*** Possible pseudonym was used when  a curator was not able to unambiguously assign a previous antibody to a record. (we have a Other name -synonym- tag to capture unambiguous ones). When moving to remarks we can keep a controlled vocabulary for easy future parsing, e.g. “possible_pseudonym:”
*Can we map everything to the proposed hierarchy or do we need to add some more classes?
+
** Antigen field: currently separated into Protein, peptide, and other_antigen (e.g.: homogenate of early C.elegans embryos, sperm). Propose to use just one antigen field to capture antigen info.
*Kimberly will review existing tags/flags to sort out what we know we need and what is questionable
 
  
=== Personal communications in Expr_pattern ===
+
All changes proposed above were approved by the group
* 27 objects missing reference (personal communications)
 
** Even if we capture the WBPerson in the Person tag, how are we submitting these to Alliance? The evidence required by the expression JSON spec https://github.com/alliance-genome/agr_schemas/blob/master/ingest/expression/wildtypeExpressionModelAnnotation.json (and other specs) must be a publication, as defined by the publicationRef.json https://github.com/alliance-genome/agr_schemas/blob/master/ingest/publicationRef.json. If there's no PMID for a publication listed as evidence, a MOD ID will suffice for the "publicationId" but we have no WBPaperID created for such  objects.
 
** One way to solve this: Daniela can go over the list and see if the initial personal communication resulted in  a publication later on. One example is Expr181 (expression of cpl-1 in hypodermis and pharynx), communicated  via email by Sarwar Hashmi in 2000, Expr450 (expression of cpl-1 in hypodermis, intestine) communicated by  Britton in 2001. The pattern was published in 2002 by Hashmi and Britton in 2002 (WBPaper00005099). Daniela can then associate WBPaper00005099 to Expr181 and Expr450.
 
** The solution above still  does not work for all: An example is lad-2 personal communication from Oliver Hobert, 2002. Later published by Lishia Chen (2008). Removing Oliver’s personal communication will remove evidence of data provenance from the Hobert’s lab. Unless Oliver published this in a paper that was eluded from our flagging system (e.g. flagged SVM negative).
 
** Daniela can go over the entire list and contact the authors for such cases.
 
** Are personal communications used in other classes?
 
** Action item: Daniela will add Persons in the person tag for such communications. Will request a model change to Hinxton. Will ask Magda to populate column 6 of the GAF file with Author data. Will add a request for the DQMs to allow Persons in the 'Evidence' in the JSON in addition to Papers
 
  
=== Author data in Expr_pattern ===
+
=== textpress-dev clean up ===
* 399 Expression objects have the author tag populated. Most of them were submitted even prior Wen started working on Expr_pattern.
+
* Michael has asked curators to assess what they have on textpresso-dev as it will not be around forever :-(
** out of 399 objects, we have 32 for which the authors partially match. One example is Expr60, which has Bauer as extra author in the .ace file. Bauer is not listed as author in the paper.
+
* is it okay to transfer data and files we want to keep to tazendra? and then to our own individual machines?
** should we keep the author info and store it in the Person tag? Even if we do, how are we submitting these to Alliance? And should we at all? This is legacy data
+
* Direct access may be possible via Caltech VPN
** Decision: we can remove the authors and add in the remarks the historic info
+
* Do we want to move content to AWS? May be complicated; it is still easy and cheap to maintain local file systems/machines
  
=== Date tag in Expr_pattern ===
+
=== Braun servers ===
* The date tag seems to be  populated for objects that have authors (above) to probably capture when the submission occurred.
+
* 3 servers stored in Braun server room; is there a new contact person for accessing these servers?
* In addition, Date is populated for a large scale submission from Ian hope (2006-03), later published.
+
* Mike Miranda replacement just getting settled; Paul will find out who is managing the server room and let Raymond know
* We can still keep this info as is for WB (currently stored in citace minus) but what are we going to do for the Alliance submission? The tag was used last time in 2006 for the Hope study but prior to this was used in  the ‘90s (1990, 1998).
 
** We  can get rid of date, too. And pull the fo for the ones for which authors do not match
 
  
=== Proposed WormBase metrics page ===
+
=== Citace upload ===
* Inspired by MGI's stats page:
+
* Next Friday, April 9th, by end of the day
** http://www.informatics.jax.org/mgihome/homepages/stats/all_stats.shtml
+
* Wen will contact Paul Davis for the frozen WS280 models file
* Sibyl and Paulo working on. Prototype here: https://master.d25n59ij2csrbn.amplifyapp.com/
 
** Current prototype is C. elegans specific
 
* Chris is collecting ideas and queries here:
 
** https://docs.google.com/spreadsheets/d/1OeZuMRSHelVD7tGRIxEkCKOyDaBPN29wrGbNU3TGxKU/edit?usp=sharing
 
* Could eventually be used across the Alliance
 
  
  
== Feb 18th, 2021 ==
+
== April 8, 2021 ==
  
=== CenGen data ===
+
=== Braun server outage ===
* How can we incorporate the CenGen data into WormBase pages? i.e. provide users info:
+
* Raymond fixed; now Spica, wobr and wobr2 are back up
** Per gene: what cells express this gene?
 
** Per cell: what genes are expressed in this cell?
 
** May be derived from Eduardo's data processing
 
* CenGen has a weekly call: have invited Wen, Daniela, and Raymond
 
** Too much for all three to join?
 
* Good to establish healthy boundaries for responsibilities
 
* Do they want to collaborate or no?
 
* We can link to the main CenGen page; once gene-level data is available we can consume and make available
 
* Eduardo's tool is one WormBase tool for processing and providing CenGen data; will make them aware
 
* Ultimately this data (and its presentation) will need to get into the Alliance; may remain a WB-specific/portal feature for the near future
 
* Alaska? Probably won't be maintained
 
  
=== Cleaning up bounced emails to outreach@wormbase.org ===
+
=== Textpresso API ===
* Many unread messages (~140) in inbox
+
* Was down yesterday affecting WormiCloud; Michael has fixed
* Many of those are bounced emails from AFP pipeline and webinar announcements
+
* Valerio will learn how to manage the API for the future
* If relevant people could review those bounced emails and, as appropriate, add people or email addresses to the Omit list using the Omit Form CGI (http://tazendra.caltech.edu/~postgres/cgi-bin/omit_form.cgi) that would be appreciated.
 
  
 +
=== Grant opportunities ===
 +
* Possibilities to apply for supplements
 +
* May 15th deadline
 +
* Druggable genome project
 +
** Pharos: https://pharos.nih.gov/
 +
** could we contribute?
 +
* Visualization, tools, etc.
 +
* Automated person descriptions?
 +
* Automated descriptions for proteins, ion channels, druggable targets, etc.?
  
== Feb 25th, 2021 ==
+
=== New WS280 ONTOLOGY FTP directory ===
 +
* Changes requested here: https://github.com/WormBase/website/issues/7900
 +
* Here's the FTP URL: ftp://ftp.wormbase.org/pub/wormbase/releases/WS280/ONTOLOGY/
 +
* Known issues (Chris will report):
 +
** Ontology files are provided as ".gaf" in addition to ".obo"; we need to remove the ".gaf" OBO files
 +
** Some files are duplicated and/or have inappropriate file extensions
  
===Expr_pattern clean up===
+
=== Odd characters in Postgres ===
* Seldom populated  tags. Can we move the associated info into remarks and get rid of the tag? This is in view of the Alliance import
+
* Daniela and Juancarlos discovered some errors with respect to special characters pasted into the OA
** Protein_description 33 objects
+
* Daniela would like to automatically pull in micropublication text (e.g. figure captions) into Postgres
<pre>Example: Expr_pattern : "Expr450"
+
* We would need an automated way to convert special characters, like degree symbols ° into html unicode \&deg\;
Gene "WBGene00000776"
+
* Juancarlos and Valerio will look into possibly switching from a Perl module to a Python module to handle special characters
Protein_description "CPL-1"
 
  
Expr_pattern : "Expr552"
 
Gene "WBGene00006528"
 
Protein_description "Tubulin alpha"</pre>
 
  
** Sequence 12 objects
+
== April 15, 2021 ==
<pre>Example: Expr_pattern : "Expr12"
 
      Gene "WBGene00003976"
 
      Sequence "Z28377|Z28375|Z28376"</pre>
 
  
** Laboratory 23 objects -> can infer via publication
+
=== Special characters in Postgres/OA ===
<pre>Example: Expr_pattern : "Expr87"
+
* Juancarlos working on/proposing a plan to store UTF-8 characters in Postgres and the OA which would then get converted, at dumping, to HTML entities (e.g. &alpha;) for the ACE files
        …
+
* There is still a bit of cleanup needed to fix or remove special characters (not necessarily UTF-8) that apparently got munged upon copy/pasting into the OA in the past
Laboratory "ML"
+
* Note: copy/paste from a PDF often works fine, but sometimes does not work as expected so manual intervention would be needed (e.g. entering Greek characters by hand in UTF-8 format)
Gene "WBGene00003012"</pre>
+
* Would copy/pasting from HTML be better than PDF?
 +
* For Person curation it would be good to be able to faithfully store and display appropriate foreign characters (e.g. Chinese characters, Danish characters, etc.)
 +
* Mangolassi script called "get_summary_characters.pl" located here: /home/postgres/work/pgpopulation/grg_generegulation/20200618_summary_characters
 +
** Juancarlos will modify script to take a data type code as an argument on the command line and return all Postgres tables (and their respective PGIDs) that have special characters, e.g.
 +
*** $ ./get_summary_characters.pl exp
 +
*** $ ./get_summary_characters.pl int
 +
*** $ ./get_summary_characters.pl grg
 +
** or could pass just the datatype + field (postgres table). e.g.
 +
*** $ ./get_summary_characters.pl pic_description
 +
** Juancarlos will email everyone once it's ready.  It's ready, email sent. Script is at /home/postgres/work/pgpopulation/oa_general/20210411_unicode_html/get_summary_characters.pl  Symlink this to your directory and run it from there, it will create files in the directory you are at when running it.
 +
* Action items:
 +
** Juancarlos will update the "get_summary_characters.pl" script as described above
 +
** Curators should use the "get_summary_characters.pl" to look for (potentially) bad characters in their OAs/Postgres tables
 +
** Need to perform bulk (automated) replacement of existing HTML entities into corresponding UTF-8 characters
 +
** Curators will need to work with Juancarlos for each OA to modify the dumper
 +
** Juancarlos will write (or append to existing) Postgres/OA dumping scripts to:
 +
*** 1) Convert UTF-8 characters to HTML entities in ACE files
 +
*** 2) Convert special quote and hyphen characters into simple versions that don't need special handling
  
 +
=== CeNGEN pictures ===
 +
* Model change went in to accommodate images from the CeNGEN project
 +
* Want gene page images for CeNGEN data; have the specifications for such images been worked out? Maybe not yet
 +
* Raymond and Daniela will work with data producers to acquire images when ready
  
* Empty tags. Can remove tags from WB Expression model?
+
=== Supplement opportunities ===
** Cell -> 0 objects
+
* Money available for software development to "harden" existing software
** Expressed_in -> 0 objects
+
* Might be possible to make Eduardo's single cell analysis tools more sustainable
** Protein -> 0 objects
+
* Could make WormiCloud adapted to Alliance?
** Pseudogene -> 0 objects
+
* Put Noctua on more stable production footing? (GO cannot apply as they are in final year of existing grant)
  
* Others
+
=== Student project for Textpresso ===
** Historical_gene -> 51 objects. How are historical gene tags treated for other classes in Alliance?
+
* Create tool to allow user to submit text and return a list of similar papers
** EPIC -> Ad hoc Tag for Murray, no correspondence with method ontology terms
+
* Use cases:
** Species -> what are we doing with non elegans annotations?
+
** curator wants an alert to find papers similar to what they've curated
** MovieURL -> 32 - Mohler -> move to  movies
+
** look for potential reviewers of a paper based on similar text content

Latest revision as of 19:34, 15 April 2021

Previous Years

2009 Meetings

2011 Meetings

2012 Meetings

2013 Meetings

2014 Meetings

2015 Meetings

2016 Meetings

2017 Meetings

2018 Meetings

2019 Meetings

2020 Meetings

2021 Meetings

January

February

March


April 1, 2021

Antibodies

  • Alignment of the antibody class to Alliance:
    • Propose to move possible_pseudonym (192) and Other_animal (37) to remarks. Those tags are not currently used for curation.
      • Other animal is sometimes used for older annotations, e.g. authors say that the antibodies were raised both in rats and rabbits. Standard practice would create 2 records, one for the rat antibody and one for the rabbit.
      • Possible pseudonym was used when a curator was not able to unambiguously assign a previous antibody to a record. (we have a Other name -synonym- tag to capture unambiguous ones). When moving to remarks we can keep a controlled vocabulary for easy future parsing, e.g. “possible_pseudonym:”
    • Antigen field: currently separated into Protein, peptide, and other_antigen (e.g.: homogenate of early C.elegans embryos, sperm). Propose to use just one antigen field to capture antigen info.

All changes proposed above were approved by the group

textpress-dev clean up

  • Michael has asked curators to assess what they have on textpresso-dev as it will not be around forever :-(
  • is it okay to transfer data and files we want to keep to tazendra? and then to our own individual machines?
  • Direct access may be possible via Caltech VPN
  • Do we want to move content to AWS? May be complicated; it is still easy and cheap to maintain local file systems/machines

Braun servers

  • 3 servers stored in Braun server room; is there a new contact person for accessing these servers?
  • Mike Miranda replacement just getting settled; Paul will find out who is managing the server room and let Raymond know

Citace upload

  • Next Friday, April 9th, by end of the day
  • Wen will contact Paul Davis for the frozen WS280 models file


April 8, 2021

Braun server outage

  • Raymond fixed; now Spica, wobr and wobr2 are back up

Textpresso API

  • Was down yesterday affecting WormiCloud; Michael has fixed
  • Valerio will learn how to manage the API for the future

Grant opportunities

  • Possibilities to apply for supplements
  • May 15th deadline
  • Druggable genome project
  • Visualization, tools, etc.
  • Automated person descriptions?
  • Automated descriptions for proteins, ion channels, druggable targets, etc.?

New WS280 ONTOLOGY FTP directory

Odd characters in Postgres

  • Daniela and Juancarlos discovered some errors with respect to special characters pasted into the OA
  • Daniela would like to automatically pull in micropublication text (e.g. figure captions) into Postgres
  • We would need an automated way to convert special characters, like degree symbols ° into html unicode \&deg\;
  • Juancarlos and Valerio will look into possibly switching from a Perl module to a Python module to handle special characters


April 15, 2021

Special characters in Postgres/OA

  • Juancarlos working on/proposing a plan to store UTF-8 characters in Postgres and the OA which would then get converted, at dumping, to HTML entities (e.g. α) for the ACE files
  • There is still a bit of cleanup needed to fix or remove special characters (not necessarily UTF-8) that apparently got munged upon copy/pasting into the OA in the past
  • Note: copy/paste from a PDF often works fine, but sometimes does not work as expected so manual intervention would be needed (e.g. entering Greek characters by hand in UTF-8 format)
  • Would copy/pasting from HTML be better than PDF?
  • For Person curation it would be good to be able to faithfully store and display appropriate foreign characters (e.g. Chinese characters, Danish characters, etc.)
  • Mangolassi script called "get_summary_characters.pl" located here: /home/postgres/work/pgpopulation/grg_generegulation/20200618_summary_characters
    • Juancarlos will modify script to take a data type code as an argument on the command line and return all Postgres tables (and their respective PGIDs) that have special characters, e.g.
      • $ ./get_summary_characters.pl exp
      • $ ./get_summary_characters.pl int
      • $ ./get_summary_characters.pl grg
    • or could pass just the datatype + field (postgres table). e.g.
      • $ ./get_summary_characters.pl pic_description
    • Juancarlos will email everyone once it's ready. It's ready, email sent. Script is at /home/postgres/work/pgpopulation/oa_general/20210411_unicode_html/get_summary_characters.pl Symlink this to your directory and run it from there, it will create files in the directory you are at when running it.
  • Action items:
    • Juancarlos will update the "get_summary_characters.pl" script as described above
    • Curators should use the "get_summary_characters.pl" to look for (potentially) bad characters in their OAs/Postgres tables
    • Need to perform bulk (automated) replacement of existing HTML entities into corresponding UTF-8 characters
    • Curators will need to work with Juancarlos for each OA to modify the dumper
    • Juancarlos will write (or append to existing) Postgres/OA dumping scripts to:
      • 1) Convert UTF-8 characters to HTML entities in ACE files
      • 2) Convert special quote and hyphen characters into simple versions that don't need special handling

CeNGEN pictures

  • Model change went in to accommodate images from the CeNGEN project
  • Want gene page images for CeNGEN data; have the specifications for such images been worked out? Maybe not yet
  • Raymond and Daniela will work with data producers to acquire images when ready

Supplement opportunities

  • Money available for software development to "harden" existing software
  • Might be possible to make Eduardo's single cell analysis tools more sustainable
  • Could make WormiCloud adapted to Alliance?
  • Put Noctua on more stable production footing? (GO cannot apply as they are in final year of existing grant)

Student project for Textpresso

  • Create tool to allow user to submit text and return a list of similar papers
  • Use cases:
    • curator wants an alert to find papers similar to what they've curated
    • look for potential reviewers of a paper based on similar text content