Difference between revisions of "International C. elegans Meetings - UCLA"

From WormBaseWiki
Jump to navigationJump to search
(Created page with 'Contact Anne Marie Mahoney at the Genetics Society of America. The GSA has worked with us in the past to get International Meetings into WormBase. They will send us a parseable…')
 
m
(26 intermediate revisions by the same user not shown)
Line 1: Line 1:
Contact Anne Marie Mahoney at the Genetics Society of America.
+
=Contact Information and File Specifications=
  
The GSA has worked with us in the past to get International Meetings into WormBase.
+
*Contact Anne Marie Mahoney at the Genetics Society of America.  The GSA has worked with us in the past to get International Meetings into WormBase.
  
They will send us a parseable file of abstracts in which each abstract is a separate file, named according to its program number.
+
*They will send us a parseable file of abstracts in which each abstract is a separate file, named according to its program number.
  
Juancarlos can parse the abstracts.
+
*Also ask for a participants list, containing names, addresses, email addresses, and institutions, for Cecilia's records.
  
Proofread the abstracts to make sure they are correct.  Keep an eye out for proper translation of foreign characters as this is a common source of error.
+
*Juancarlos can parse the abstracts into mangolassi where we can proofread them before entering them into the editor on tazendra.
  
Also ask for a participants list, containing names, addresses, email addresses, and institutions, for Cecilia's records.
+
=File Format=
 +
 
 +
*The International Meeting Abstracts are sent to us in a specific file format, with tag:value attributes that are readily parsed into the paper tables (pap_).
 +
 
 +
*Each abstract is sent as an individual file, named according to the abstract number, e.g. 826C.txt
 +
 
 +
*Below is the list of tag names.  Authors and Institutions are listed sequentially, with increased numerical values, if needed.
 +
 
 +
** AbstractNo :
 +
** Title :
 +
** Author 1 :
 +
** Presenting Author :
 +
** Study Group :
 +
** Author 1 Affiliation :
 +
** Institution 1 :
 +
** Body of Abstract :
 +
 
 +
=Parsing Scripts=
 +
 
 +
*First place meeting abstracts in a folder on mangolassi
 +
**For 2019 abstracts: /home/acedb/kimberly/meeting_abstracts/2019_iwm_meeting_abstracts/AbsFiles_2019/AbsFilesWormBase-20190719174941
 +
*In the postgres account, meeting abstracts are located in directories, named according to the meeting, here:  /home/postgres/work/pgpopulation/pap_papers/abstracts
 +
*There are two scripts that we run for processing the abstracts:
 +
**unaccentHtml.pl - this script converts HTML characters and names to text wherever possible, and converts accented characters to unaccented characters, for example:
 +
***æ  represented in HTML characters as & # 230 ;  (note that these symbols are strung together in the actual code)
 +
***ω represented in HTML names as & omega ; (note that these symbols are strung together in the actual code)
 +
***This script searches the text for each of the above type of HTML coding and then we manually make sure we have a mapping for each
 +
***This script also strips out HTML for things like paragraph, font, etc.
 +
**parse.pl - this script parses the abstract information into the correpsonding paper tables
 +
***The parse.pl script has some values that need to be set for each specific meeting:
 +
****year
 +
****identifier prefix (e.g., wm2019)
 +
*The most recent versions of the scripts are here: /home/postgres/work/pgpopulation/pap_papers/abstracts/iwm2017
 +
*Note that we also run three scripts to populate species information for each abstract.
 +
**One script looks for string matches to species names in the text of the abstracts, the other associates species based on the species information for associated genes (gin_species).
 +
**Each of those scripts generates a separate out flat file that is then processed by a third script to associate species and populate pap_species with the correct NCBI taxon id.
 +
 
 +
=Proofreading=
 +
 
 +
Things to check for in parsed files:
 +
 
 +
*Does the abstract number in the identifier match the abstract number in the program?
 +
*Are author names correct, i.e. are there any foreign characters not translated correctly?
 +
*Does the text of the abstract start and stop at the right places?
 +
*Are symbols represented properly, e.g. 3'UTR?
 +
 
 +
 
 +
Back to [[Paper pipeline]]

Revision as of 18:29, 23 September 2019

Contact Information and File Specifications

  • Contact Anne Marie Mahoney at the Genetics Society of America. The GSA has worked with us in the past to get International Meetings into WormBase.
  • They will send us a parseable file of abstracts in which each abstract is a separate file, named according to its program number.
  • Also ask for a participants list, containing names, addresses, email addresses, and institutions, for Cecilia's records.
  • Juancarlos can parse the abstracts into mangolassi where we can proofread them before entering them into the editor on tazendra.

File Format

  • The International Meeting Abstracts are sent to us in a specific file format, with tag:value attributes that are readily parsed into the paper tables (pap_).
  • Each abstract is sent as an individual file, named according to the abstract number, e.g. 826C.txt
  • Below is the list of tag names. Authors and Institutions are listed sequentially, with increased numerical values, if needed.
    • AbstractNo :
    • Title :
    • Author 1 :
    • Presenting Author :
    • Study Group :
    • Author 1 Affiliation :
    • Institution 1 :
    • Body of Abstract :

Parsing Scripts

  • First place meeting abstracts in a folder on mangolassi
    • For 2019 abstracts: /home/acedb/kimberly/meeting_abstracts/2019_iwm_meeting_abstracts/AbsFiles_2019/AbsFilesWormBase-20190719174941
  • In the postgres account, meeting abstracts are located in directories, named according to the meeting, here: /home/postgres/work/pgpopulation/pap_papers/abstracts
  • There are two scripts that we run for processing the abstracts:
    • unaccentHtml.pl - this script converts HTML characters and names to text wherever possible, and converts accented characters to unaccented characters, for example:
      • æ represented in HTML characters as & # 230 ; (note that these symbols are strung together in the actual code)
      • ω represented in HTML names as & omega ; (note that these symbols are strung together in the actual code)
      • This script searches the text for each of the above type of HTML coding and then we manually make sure we have a mapping for each
      • This script also strips out HTML for things like paragraph, font, etc.
    • parse.pl - this script parses the abstract information into the correpsonding paper tables
      • The parse.pl script has some values that need to be set for each specific meeting:
        • year
        • identifier prefix (e.g., wm2019)
  • The most recent versions of the scripts are here: /home/postgres/work/pgpopulation/pap_papers/abstracts/iwm2017
  • Note that we also run three scripts to populate species information for each abstract.
    • One script looks for string matches to species names in the text of the abstracts, the other associates species based on the species information for associated genes (gin_species).
    • Each of those scripts generates a separate out flat file that is then processed by a third script to associate species and populate pap_species with the correct NCBI taxon id.

Proofreading

Things to check for in parsed files:

  • Does the abstract number in the identifier match the abstract number in the program?
  • Are author names correct, i.e. are there any foreign characters not translated correctly?
  • Does the text of the abstract start and stop at the right places?
  • Are symbols represented properly, e.g. 3'UTR?


Back to Paper pipeline