Entering WBG Articles
Entering WBG Article into Postgres
We would like to enter WBG articles to postgres as soon as they become available on line so they can be incorporated into WormBase as WBPapers.
- Check text files of abstract titles, authors, and affiliations for special characters (e.g. vowels with accents) and convert them to standard characters before reading into the database, since ACeDB doesn't allow special characters.
- Articles can be read into the development database and checked using the paper editor on mangolassi.caltech.edu
- Parsing script: /home/postgres/work/pgpopulation/pap_papers/abstracts/wbg_sample/parse.pl
- To Juancarlos: Before running parsing script, be sure to convert from dos to unix with 'fromdos <filename>' -- J
General File Format
Here is a general file format for uploading to postgres, followed by a specific example. Note that the parsing script expects field names (tags) and field values (data) to be separate by tabs.
Note that for WBG articles, there are some values that do not change, e.g. Status (Valid), Type (Gazette_article), Journal (Worm Breeder's Gazette), and Primary_data (not_designated). We also need to flag these papers for author_person curation under the curation flags. Also, the identifiers are created according to the information in the volume, number, and page. These special fields are italicized below.
Identifier wbg.nn.n.nnnn (where n = a number)
Title First letter capitalized
Journal Worm Breeder's Gazette
Volume nn (where n = a number) (We currently don't have a field for number in postgres, so should we create Volume numbers that are 18.2 or 18.3 using information from the Number of the latest Gazette issue? Or ignore the number field entirely? --K.) we do, it's between pages and year, and it has an 18 for this set from Daniel's file -- J Need to clarify - I meant that I didn't see a field specifically for Number. For example, if a journal was Volume 18 Number 2, I was thinking about the Number part. Does that make sense now? --K. Oh, I see, that's because the Volume is a test (text?) field, so it could say 18.2 for example. If you do this query SELECT * FROM pap_volume WHERE pap_volume ~ '[^0-9]'; You'll see 3130 entries that already have a non-digit in it. So we could do it, but then we'd have to go through it and separate what we already have. Well, potentially have to do that -- J Yes, I think that for now we could add the Number (or Issue to use PubMed-speak) in the form of 18.2 or 18.3 and then think about how to handle this going forward. --K
Year 20nn (where n = a number)
Month nn (where n = a number)
Page n or nn (where n = a number)
Author listed as Last Name, First Name//Last Name, First Name//Last Name, First Name etc.
Affiliation list all affiliations (How do we want to handle multiple affiliations? Comma separate? --K.) It's a multivalue field, so if the data file had them in separate lines with their own tag, it would presumably work, but I haven't tested that. The file, at a glance, has one affiliation entry per paper entry -- J Okay, we will need to try adding multiple affiliations in separate lines. The first entry for the Volume 18 Number 2 issue of the WBG has three different affiliatios. See http://www.wormbook.org/wbg/articles/volume-18-number-2/antibiotic-markers-for-rapid-selection-and-easy-maintenance-of-transgenic-nematodes-2/ --K. Sure, just edit Daniel's files and let me know when it's done so I can repopulate it. If you don't mind me just making another set in the sandbox, that'd be easier, but if you'd like me to repopulate the database from an old dump I can do that too (it just takes a while, though not tooo long) -- J If I'm understanding correctly, making another set in the sandbox should be fine. I'll just do the wbg18 query again and look at what we've got? --K.
URL - corresponds to the URL of the article in the PDF document
A specific example with all information dispayed:
Title Antibiotic markers for rapid selection and easy maintenance of transgenic nematodes
Journal Worm Breeder's Gazette
Author Giordano-Santini, Rosina//Semple, Jennifer//Dupuy, Denis//Lehner, Ben
Affiliation Genome Regulation and Evolution, Inserm Unit 869, and Institut Europeen de Chimie et Biologie, Bordeaux, France
Back to Paper Pipeline