Contact Information and File Specifications

Contact Anne Marie Mahoney at the Genetics Society of America. The GSA has worked with us in the past to get International Meetings into WormBase.

They will send us a parseable file of abstracts in which each abstract is a separate file, named according to its program number.

Also ask for a participants list, containing names, addresses, email addresses, and institutions, for Cecilia's records.

Juancarlos can parse the abstracts into mangolassi where we can proofread them before entering them into the editor on tazendra.

File Format

The International Meeting Abstracts are sent to us in a specific file format, with tag:value attributes that are readily parsed into the paper tables (pap_).

Each abstract is sent as an individual file, named according to the abstract number, e.g. 826C.txt

Below is the list of tag names. Authors and Institutions are listed sequentially, with increased numerical values, if needed.

- AbstractNo :
- Title :
- Author 1 :
- Presenting Author :
- Study Group :
- Author 1 Affiliation :
- Institution 1 :
- Body of Abstract :

Parsing Scripts

First place meeting abstracts in a folder on mangolassi
- For 2021 abstracts: /home/acedb/kimberly/meeting_abstracts/2021_iwm_meeting_abstracts/AbsFiles_2021/AbsFilesWormBase-20210826124502/
In the postgres account, meeting abstracts are located in directories, named according to the meeting, here: /home/postgres/work/pgpopulation/pap_papers/abstracts
There are two scripts that we run for processing the abstracts:
- unaccentHtml.pl - this script converts HTML characters and names to text wherever possible, and converts accented characters to unaccented characters, for example:
  - æ represented in HTML characters as & # 230 ; (note that these symbols are strung together in the actual code)
  - ω represented in HTML names as & omega ; (note that these symbols are strung together in the actual code)
  - This script searches the text for each of the above type of HTML coding and then we manually make sure we have a mapping for each
  - This script also strips out HTML for things like paragraph, font, etc.
- parse.pl - this script parses the abstract information into the correpsonding paper tables
  - The parse.pl script has some values that need to be set for each specific meeting:
    - year
    - identifier prefix (e.g., wm2021)
The most recent versions of the scripts are here: /home/postgres/work/pgpopulation/pap_papers/abstracts/iwm2019
Note that we also run three scripts to populate species information for each abstract.
- One script looks for string matches to species names in the text of the abstracts, the other associates species based on the species information for associated genes (gin_species).
  - populate_pap_species_from_abstract_species.pl
  - populate_pap_species_from_gene_species.pl
  - populate_pap_species_from_flatfiles.pl*
  - populate_pap_species_from_flatfiles.pg
- Each of those scripts generates a separate out flat file that is then processed by a third script to associate species and populate pap_species with the correct NCBI taxon id.

Proofreading

Things to check for in parsed files:

Does the abstract number in the identifier match the abstract number in the program?
Are author names correct, i.e. are there any foreign characters not translated correctly?
Does the text of the abstract start and stop at the right places?
Are symbols represented properly, e.g. 3'UTR?

Back to Paper pipeline

International C. elegans Meetings - UCLA

Contents

Contact Information and File Specifications

File Format

Parsing Scripts

Proofreading

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools