Paper Pipeline - To Do List
Short-Term
Correct invalid PMIDs and their associated paper types - Kimberly
- How often do we want to check for invalid PMIDs?
- PubMed maintains a file of obsolete PMIDs on their ftp site: ftp://ftp.ncbi.nlm.nih.gov/pubmed/deleted_pmids.txt
Write up a summary of the weekly checking script - Kimberly and Juancarlos
- Runs every Sunday at 1AM.
- Checks papers entries in postgres that are missing pages, title, type, or year.
Add a limited number of new paper types to allow for single type classification - Kimberly and Juancarlos
- From our conversion of papers with type OTHER, we found several cases where papers also had a single type, but that type is not on our current list of paper types:
- Type (number of papers)
- Interview (4)
- Lectures (2)
- Congresses (1)
- Interactive tutorial (1)
Finish documentation of current paper type mappings (PubMed vs postgres vs Journal) for informing SVMs and Textpresso searches - Caltech
- For CCC, I'd initially just searched through all paper types, but removing some paper types from the searches would more accurately reflect the real curation pipeline.
- Eliminating some types is obvious, e.g. MEETING and GAZETTE ABSTRACTS, REVIEWS, but what about the others, e.g. COMMENT, LETTERS, etc?
- It occurred to me that I don't really know what types of papers are included in some of the other paper types like COMMENT and LETTERS, so it'd be good to know so I can make a better decision about what to include.
- To start to address this, I looked at all papers in postgres with type COMMENT and checked both their PubMed paper type classification and what these papers actually are in the journals: Summary of Papers Labeled COMMENT
- This lead to the conclusion that type COMMENT should NOT be included in the CCC pipeline.
- Additional postgres types that should be checked: NEWS, NOTE, LETTER, MONOGRAPH, EDITORIAL and perhaps a few others like CORRECTION and ERRATUM
- This process has also been informative for eventually converting to multiple paper types since we can see what paper types get assigned in PubMed and if there will be any problems if/when we convert to the multiple type model.
Decide how to handle upload of Genetics papers - Karen, Juancarlos, Kimberly
- At the point where Genetics approves a paper for publication and assigns a DOI, the DOI is to be entered into a CGI ticket form to generate a WBPaperID.
- In addition to the WBPaperID the ticket form produces a password protected URL for a journal first-pass form specific for GENETICS papers undergoing Textpresso mark-up, which is sent in an e-mail by Tracey to the author. This form is a special version of the normal author first-pass form in that it specifically asks for object names and the entered data is made available to Arun for mark-up.
- (Juancarlos needs to fill this part out) The DOI information for the paper was to be used to filter out that paper coming through the normal pipeline.
However, we hit a snag, as it seems that the journal releases to PubMed a Published Ahead of Print version of the paper very soon (1-7 days?) after the DOI is assigned. The ahead-of-print version is then available to the WBPaper pipeline, and will be entered and assigned a WBPaperID independently.
The immediate impact right now is that there is a possibility of assigning at least two WBPaperID's to the two papers that are being marked up for the August print release of Genetics. We will deal with this manually for now.
For the future, there are many ways to solve this. But two main issues that need to be addressed are:
- We need to maintain an unchanged and journal provided identifier of the paper until the final print version is released, since between the GENETICS XML, the ahead-of-print XML and the print version there may be changes in the title or author or whatnot. The identifier needs to be with the paper at the point Arun receives the XML.
- We need to avoid these authors getting sent the author first-pass form as the form they should be filling out should be the journal first-pass form. Otherwise, they will not be prompted for specific objects that might be missing from Arun's markup files.
Long-Term
Discuss timeline and implications for changing the Paper models and Paper editor to allow for multiple types - WormBase
- Which types should we include? What if we need to add more types later?
- Will any of the current paper type mappings be retired? If so, which ones?
- Do we want to include paper types that PubMed doesn't have, e.g. Book?
- How will we update our records? What should be done with history?
- Can we check for differences between PubMed and postgres mappings before changing paper type to find possible discrepancies before we override paper type in postgres? For example, if PubMed has something labeled as a Journal Article, but we have it in postgres as a Review, which paper type should be kept? How many of these cases will there be?
- If we want to add more types than exists in PubMed, can we distinguish how each type was added?
- Will this affect the web display for papers in any way?
- Current list of all PubMed types represented by papers previously classified as type OTHER in postgres:
journal article
comparative study
evaluation studies
review
english abstract
validation studies
case reports
in vitro
interview
letter
lectures
comment
historical article
controlled clinical trial
interactive tutorial
biography
randomized controlled trial
retracted publication
congresses
multicenter study
Decide if, and how, we want to run a script that cross-checks PubMed and postgres data - Caltech
Explore idea of initial binary paper classification in PubMed for Journal Articles - primary experimental data, no primary experimental data - WormBase, others