Difference between revisions of "International C. elegans Meetings - UCLA"
From WormBaseWiki
Jump to navigationJump to search(16 intermediate revisions by the same user not shown) | |||
Line 29: | Line 29: | ||
*First place meeting abstracts in a folder on mangolassi | *First place meeting abstracts in a folder on mangolassi | ||
− | **For | + | **For 2021 abstracts: /home/acedb/kimberly/meeting_abstracts/2021_iwm_meeting_abstracts/AbsFiles_2021/AbsFilesWormBase-20210826124502/ |
*In the postgres account, meeting abstracts are located in directories, named according to the meeting, here: /home/postgres/work/pgpopulation/pap_papers/abstracts | *In the postgres account, meeting abstracts are located in directories, named according to the meeting, here: /home/postgres/work/pgpopulation/pap_papers/abstracts | ||
− | *There are two scripts that we run for processing the abstracts: | + | *There are two scripts that we run for processing the abstracts: |
− | ** | + | **unaccentHtml.pl - this script converts HTML characters and names to text wherever possible, and converts accented characters to unaccented characters, for example: |
+ | ***æ represented in HTML characters as & # 230 ; (note that these symbols are strung together in the actual code) | ||
+ | ***ω represented in HTML names as & omega ; (note that these symbols are strung together in the actual code) | ||
+ | ***This script searches the text for each of the above type of HTML coding and then we manually make sure we have a mapping for each | ||
+ | ***This script also strips out HTML for things like paragraph, font, etc. | ||
**parse.pl - this script parses the abstract information into the correpsonding paper tables | **parse.pl - this script parses the abstract information into the correpsonding paper tables | ||
***The parse.pl script has some values that need to be set for each specific meeting: | ***The parse.pl script has some values that need to be set for each specific meeting: | ||
****year | ****year | ||
− | ****identifier prefix (e.g., | + | ****identifier prefix (e.g., wm2021) |
− | *The | + | *The most recent versions of the scripts are here: /home/postgres/work/pgpopulation/pap_papers/abstracts/iwm2019 |
+ | *Note that we also run three scripts to populate species information for each abstract. | ||
+ | **One script looks for string matches to species names in the text of the abstracts, the other associates species based on the species information for associated genes (gin_species). | ||
+ | ***populate_pap_species_from_abstract_species.pl | ||
+ | ***populate_pap_species_from_gene_species.pl | ||
+ | ***populate_pap_species_from_flatfiles.pl* | ||
+ | ***populate_pap_species_from_flatfiles.pg | ||
+ | **Each of those scripts generates a separate out flat file that is then processed by a third script to associate species and populate pap_species with the correct NCBI taxon id. | ||
=Proofreading= | =Proofreading= |
Latest revision as of 13:33, 31 March 2022
Contents
Contact Information and File Specifications
- Contact Anne Marie Mahoney at the Genetics Society of America. The GSA has worked with us in the past to get International Meetings into WormBase.
- They will send us a parseable file of abstracts in which each abstract is a separate file, named according to its program number.
- Also ask for a participants list, containing names, addresses, email addresses, and institutions, for Cecilia's records.
- Juancarlos can parse the abstracts into mangolassi where we can proofread them before entering them into the editor on tazendra.
File Format
- The International Meeting Abstracts are sent to us in a specific file format, with tag:value attributes that are readily parsed into the paper tables (pap_).
- Each abstract is sent as an individual file, named according to the abstract number, e.g. 826C.txt
- Below is the list of tag names. Authors and Institutions are listed sequentially, with increased numerical values, if needed.
- AbstractNo :
- Title :
- Author 1 :
- Presenting Author :
- Study Group :
- Author 1 Affiliation :
- Institution 1 :
- Body of Abstract :
Parsing Scripts
- First place meeting abstracts in a folder on mangolassi
- For 2021 abstracts: /home/acedb/kimberly/meeting_abstracts/2021_iwm_meeting_abstracts/AbsFiles_2021/AbsFilesWormBase-20210826124502/
- In the postgres account, meeting abstracts are located in directories, named according to the meeting, here: /home/postgres/work/pgpopulation/pap_papers/abstracts
- There are two scripts that we run for processing the abstracts:
- unaccentHtml.pl - this script converts HTML characters and names to text wherever possible, and converts accented characters to unaccented characters, for example:
- æ represented in HTML characters as & # 230 ; (note that these symbols are strung together in the actual code)
- ω represented in HTML names as & omega ; (note that these symbols are strung together in the actual code)
- This script searches the text for each of the above type of HTML coding and then we manually make sure we have a mapping for each
- This script also strips out HTML for things like paragraph, font, etc.
- parse.pl - this script parses the abstract information into the correpsonding paper tables
- The parse.pl script has some values that need to be set for each specific meeting:
- year
- identifier prefix (e.g., wm2021)
- The parse.pl script has some values that need to be set for each specific meeting:
- unaccentHtml.pl - this script converts HTML characters and names to text wherever possible, and converts accented characters to unaccented characters, for example:
- The most recent versions of the scripts are here: /home/postgres/work/pgpopulation/pap_papers/abstracts/iwm2019
- Note that we also run three scripts to populate species information for each abstract.
- One script looks for string matches to species names in the text of the abstracts, the other associates species based on the species information for associated genes (gin_species).
- populate_pap_species_from_abstract_species.pl
- populate_pap_species_from_gene_species.pl
- populate_pap_species_from_flatfiles.pl*
- populate_pap_species_from_flatfiles.pg
- Each of those scripts generates a separate out flat file that is then processed by a third script to associate species and populate pap_species with the correct NCBI taxon id.
- One script looks for string matches to species names in the text of the abstracts, the other associates species based on the species information for associated genes (gin_species).
Proofreading
Things to check for in parsed files:
- Does the abstract number in the identifier match the abstract number in the program?
- Are author names correct, i.e. are there any foreign characters not translated correctly?
- Does the text of the abstract start and stop at the right places?
- Are symbols represented properly, e.g. 3'UTR?
Back to Paper pipeline