Difference between revisions of "Entering WBG Articles"
Line 15: | Line 15: | ||
Here is a general file format for uploading to postgres, followed by a specific example. '''Note that the parsing script expects field names (tags) and field values (data) to be separate by tabs.''' | Here is a general file format for uploading to postgres, followed by a specific example. '''Note that the parsing script expects field names (tags) and field values (data) to be separate by tabs.''' | ||
− | Note that for WBG articles, there are some values that do not change, e.g. Status (Valid), Type (Gazette_article), Journal (Worm Breeder's Gazette), and Primary_data (not_designated). Also, the identifiers are created according to the information in the volume, number, and page. These special fields are italicized below. | + | Note that for WBG articles, there are some values that do not change, e.g. Status (Valid), Type (Gazette_article), Journal (Worm Breeder's Gazette), and Primary_data (not_designated). We also need to flag these papers for author_person curation under the curation flags. Also, the identifiers are created according to the information in the volume, number, and page. These special fields are italicized below. |
''Identifier wbg.nn.n.nnnn (where n = a number)'' | ''Identifier wbg.nn.n.nnnn (where n = a number)'' |
Latest revision as of 21:04, 10 February 2012
Entering WBG Article into Postgres
We would like to enter WBG articles to postgres as soon as they become available on line so they can be incorporated into WormBase as WBPapers.
- Check text files of abstract titles, authors, and affiliations for special characters (e.g. vowels with accents) and convert them to standard characters before reading into the database, since ACeDB doesn't allow special characters.
- Articles can be read into the development database and checked using the paper editor on mangolassi.caltech.edu
- Parsing script: /home/postgres/work/pgpopulation/pap_papers/abstracts/wbg_sample/parse.pl
- To Juancarlos: Before running parsing script, be sure to convert from dos to unix with 'fromdos <filename>' -- J
General File Format
Here is a general file format for uploading to postgres, followed by a specific example. Note that the parsing script expects field names (tags) and field values (data) to be separate by tabs.
Note that for WBG articles, there are some values that do not change, e.g. Status (Valid), Type (Gazette_article), Journal (Worm Breeder's Gazette), and Primary_data (not_designated). We also need to flag these papers for author_person curation under the curation flags. Also, the identifiers are created according to the information in the volume, number, and page. These special fields are italicized below.
Identifier wbg.nn.n.nnnn (where n = a number)
Status Valid
Title First letter capitalized
Journal Worm Breeder's Gazette
Volume nn (where n = a number) (We currently don't have a field for number in postgres, so should we create Volume numbers that are 18.2 or 18.3 using information from the Number of the latest Gazette issue? Or ignore the number field entirely? --K.) we do, it's between pages and year, and it has an 18 for this set from Daniel's file -- J Need to clarify - I meant that I didn't see a field specifically for Number. For example, if a journal was Volume 18 Number 2, I was thinking about the Number part. Does that make sense now? --K. Oh, I see, that's because the Volume is a test (text?) field, so it could say 18.2 for example. If you do this query SELECT * FROM pap_volume WHERE pap_volume ~ '[^0-9]'; You'll see 3130 entries that already have a non-digit in it. So we could do it, but then we'd have to go through it and separate what we already have. Well, potentially have to do that -- J Yes, I think that for now we could add the Number (or Issue to use PubMed-speak) in the form of 18.2 or 18.3 and then think about how to handle this going forward. --K
Year 20nn (where n = a number)
Month nn (where n = a number)
Page n or nn (where n = a number)
Author listed as Last Name, First Name//Last Name, First Name//Last Name, First Name etc.
Affiliation list all affiliations (How do we want to handle multiple affiliations? Comma separate? --K.) It's a multivalue field, so if the data file had them in separate lines with their own tag, it would presumably work, but I haven't tested that. The file, at a glance, has one affiliation entry per paper entry -- J Okay, we will need to try adding multiple affiliations in separate lines. The first entry for the Volume 18 Number 2 issue of the WBG has three different affiliatios. See http://www.wormbook.org/wbg/articles/volume-18-number-2/antibiotic-markers-for-rapid-selection-and-easy-maintenance-of-transgenic-nematodes-2/ --K. Sure, just edit Daniel's files and let me know when it's done so I can repopulate it. If you don't mind me just making another set in the sandbox, that'd be easier, but if you'd like me to repopulate the database from an old dump I can do that too (it just takes a while, though not tooo long) -- J If I'm understanding correctly, making another set in the sandbox should be fine. I'll just do the wbg18 query again and look at what we've got? --K.
URL - corresponds to the URL of the article in the PDF document
Type Gazette_article
Primary_data not_designated
Specific Example
A specific example with all information dispayed:
Identifier wbg18.2.0002
Status Valid
Title Antibiotic markers for rapid selection and easy maintenance of transgenic nematodes
Journal Worm Breeder's Gazette
Volume 18
Year 2010
Month 06
Page 2
Author Giordano-Santini, Rosina//Semple, Jennifer//Dupuy, Denis//Lehner, Ben
Affiliation Genome Regulation and Evolution, Inserm Unit 869, and Institut Europeen de Chimie et Biologie, Bordeaux, France
URL http://www.wormbook.org/wbg/volumes/volume-18-number-2/pdf/wbg-volume-18-number-2.02.pdf
Type Gazette_article
Primary_data not_designated
Back to Paper Pipeline