Missing PMIDs

From WormBaseWiki
Jump to: navigation, search

email thread subject: XML of PubMed-indexed journals to filter search

Objectives:

1) Add missing PMIDs to WBPaper objects wherever possible

2) When papers have not been indexed by PubMed, add that information to postgres along with details of indexing; display on paper editor

3) When no PMID is found, check for possible bibliographic errors in WBPaper objects, make corrections if needed

4) Where possible, add doi's for papers not indexed by PubMed

5) If time permits, replace journal titles with standard NLM journal abbreviations


1) Add missing PMIDs to WBPaper objects wherever possible

As of 9/1/11, there were 997 WBPaper objects that did not have a PMID. We know that some paper objects in WB will likely never have a PMID, for example corrections and errata, but if there are papers that *do* have a missing PMID, then we want to add the PMID wherever possible.

Strategy: Using XML file of journals currently indexed for MEDLINE, compile a list of WBPaper objects that lack a PMID but for which the corresponding journal is currently indexed for MEDLINE.

Results:

189 WBPapers lacking a PMID were published in a journal indexed for MEDLINE.

 Daniel looked over this list and found: 82 papers (~43%) had a PMID; 79 papers (~42%) could still not be found in PubMed; 26    
 papers (~14%) correspond to published errata.  

79 remaining papers not found will need to be checked to make sure bibliography is correct.

Published errata are not indexed by PubMed; they won't get a PMID, but should always have a Erratum_for/Erratum_in relationship with the original paper.

645 WBPapers lacking a PMID also lack a Journal entry.

 To address these WBPapers we need to break this list of 645 papers down by Paper Type, e.g. Review, Book chapter, so that we 
    can initially focus our efforts on tracking down journal titles where we're likely to find them.  Book chapters, for example, 
    will not have a journal title.

163 (?) WBPapers are published in a journal not indexed for MEDLINE.

 The question mark after this number reflects that this number is the difference between 997 and the sum of the other two 
 categories, but I haven't verified this number any other way.

Next Steps:

For 189 papers that Daniel examined:

82 WBPapers with a PMID: add the identifier to the pap_identifier table

Kimberly sent tab-delimited text file list of 82 to Juancarlos for entry into postgres.

26 WBPapers are errata: don't add the PMID of the original article (PubMed doesn't index errata or corrections, so we don't want to give the errata a PMID. We should, however, enter the doi of the errata (see below) wherever we can.)

If we could get the list of papers published in journals NOT indexed for MEDLINE, then we could possibly add doi's or check the bibliographic record, etc (see below).


2) When papers have not been indexed by PubMed, add that information to postgres along with details of indexing; display on paper editor:

I'd like to be able to record in postgres and display in the paper editor the fact that someone has checked a WBPaper for a PMID and can't find it. That way, we'll know we looked and won't keep wondering why a paper doesn't have a PMID. It is possible that MEDLINE will index a paper long after it's come out; we can decide how often it makes sense to re-check PubMed or if in any of these cases it makes sense to contact PubMed and ask them to index a paper.

First need to check bibliographic record to make sure it is correct in postgres. Kimberly will do this. See #3 below.

Next steps:

Here's an idea for recording the outcome of 'not found' in postgres:

Create a new paper table, pap_pmid_check

Populate that table with the WBPaper ID, the result of the search ('not found'), and the Coverage information from the corresponding journal's XML file that pertains specifically to PubMed indexing:

<IndexingSourceList> <IndexingSource>

           <IndexingSourceName IndexingTreatment="Full" IndexingStatus="Currently-indexed">PubMed</IndexingSourceName>
           <Coverage>v1n1, July 2005-</Coverage>
       </IndexingSource>
   </IndexingSourceList>

So in this example, the table would hold the WBPaper ID, the text string 'not found', and Coverage v1n1, July 2005-

The reason I'd like to store the coverage information is that is helps to indicate why a particular article from a journal may not have a PMID, e.g. if PubMed indexing for that journal happened after this article was published.

We could display this information (not found, Coverage v1n1, July 2005-) in the paper editor in a field named pmid_check below the identifier field. We wouldn't have to dump this in the .ace file (unless Textpresso would like to have this information?). The field could be editable by a curator, though.

Still deciding on how to store this information in postgres. Probably will not dump it into .ace file, as lack of PMID is already being handled well on the WB web site.


3) When no PMID is found, check for possible bibliographic errors in WBPaper objects, make corrections if needed:

If a paper is published in a journal that *is* currently indexed for PubMed, then there are several possibilities I can think of:

a) as described above, the paper was published before the journal starting being indexed for PubMed and they haven't retroactively indexed the older articles

b) the journal is normally indexed, but the particular article we have is of a type not typically indexed (see WBPaper00013369 and its accompanying pdf for an example)

c) there's something incorrect about our WBPaper entry - see WBPaper00000816 for an example. WBPaper00000816 is attributed to the journal Science, but if you look at the pdf, the article actually looks to be from the periodical Science85. We'll also need to correct the author entry for WBPaper00000816; somehow Stephen morphed into Sarah :-).

Next steps: For articles that should have been indexed (i.e. they fall within the correct time frame and are typically those that would be indexed) double-check the bibliographic data to see if there's a mistake in the WBPaper entry.

This will take a bit of time, but probably the first step would be to determine if the article was published within the correct indexing time frame, then if the article is of a type typically indexed, then look for bibliographical errors.


4) Where possible, add doi's for papers not indexed by PubMed:

For WBPaper objects that don't have a PMID, it's probably especially critical to try to get a doi for that WBPaper if it's available.

Since we can't use PubMed records to help retrieve the doi's, this could take some manual effort unless we can come up with a way to get doi's semi-automatically from publishers' sites.

This might also be an opportunity to collect missing pdf's, see WBPaper00013320 for example.

Next steps:

From Daniela:

I have found a page on the crossref website that allows to retrieve doi numbers through bibliographical data.

They say: "This interface is not intended for automated querying. If you would like to query CrossRef on an automated batch basis, please obtain an account on our system." so we might do it in a programmatic way.

http://www.crossref.org/guestquery/

From Daniela:

Hi guys,

I registered to crossref for batch queries. I tried to copy 3 random references from Wormbase and fed it into the query box and it worked fine -see attached screenshot. It is very straightforward. Once I am back from vacation I will talk to Juancarlos so we can create a flat file containing the list of citations that do not have a PMID and for which we want to retrieve the doi and we'll paste it into their browser. We'll see how many we can retrieve.

From Juancarlos:

I'm not sure if the issue is whether to add the ones (J - do you mean PMIDs or doi's here? It sounds like PMIDs. I would advocate parsing the file that Daniel sent and can help with figuring out how to write the parsing script. -k) that Daniel found or not. If they should be added, whoever finds them should add them through the paper editor as they find them (or however they want to do it). I added the last set of 30something manually because it was faster than writing a script and checking on the sandbox then tazendra then checking again. The current set of 80something is probably getting close to making a script worth it, assuming there are no mistakes entering data. (I still don't know if they should be added though)

If you can figure out how we can do this (Here, I'm assuming this means query the service that Daniela sent - k.), then I can try to do it programmatically, but again, be aware that a given title + author could match to different papers (abstracts ?) or title / journal abbreviations could be different. The point being, I'm not sure how much you can trust the results, or how much you want to check on the correctness of it. Also, who do we use as the curator for evidence ?

I don't think we'll be able to automatically enter any kind of identifier into the paper editor without a curator first verifying that the identifier is correct. So, we'll need to set up an efficient way for a curator to check the doi's. It makes sense to me to use the curator who checked the doi query results as evidence. -k

This is just do add DOI info to wbpapers ?

Yes. This is, I think, particularly important for WBPapers that don't have a corresponding PMID. -k


ACTION ITEMS:

Daniela requested a list of papers with no PMID to Juancarlos in order to retrieve the DOI via XREF. (11-28-11):

Our aim is to find the DOI number for papers that do not have PMID.

We need a list of Papers that DO NOT have a PMID. For each paper we need the full citation, like in the examples below:

Yeh E, Kawano T, Ng S, Fetter R, Hung W, Wang Y, Zhen M. 2009. Caenorhabditis elegans innexins regulate active zone differentiation. J Neurosci 29:5207-17.

Altun ZF , Chen B , Wang ZW , Hall DH . 2009. High resolution map of Caenorhabditis elegans gap junction proteins. Dev Dyn 238:1936-50.

A text file will be perfect with each reference separated by the other by a line break.

RESULT: the output file with the list of objects not having a PMID nor a DOI is located here: /home/postgres/work/pgpopulation/pap_papers/20111128_nopmid_nodoi/out

the list is separated by paper type.

List summary:
128 papers no type (possible to retrieve DOI)
416 Reviews (possible to retrieve DOI)
25 Published Errata (possible to retrieve DOI)
464 Journal Articles (possible to retrieve DOI)
1 Letter Comment (possible to retrieve DOI)
12 News (possible to retrieve DOI)
4 Other (possible to retrieve DOI)
142 Books (Books should have DOI but the with the current formatting XREF gives parsing error. Should look into it)
175 Book Chapters (possible to retrieve DOI only for 1 of them --included it in the list Boyle et al., 2008. 4984 37-47)
1 Editorial (crashed server should try again. Tried again 2x. Did not work--ignore)
17964 Meeting Abstracts (no DOI)
3005 Gazette Articles (no DOI)
1 Monograph (no DOI)
2 Letters (no DOI)
15 Wormbook Book Chapters (no DOI)

Pilot: We focus on Research articles. We tried to retrieve 40 papers in 2 ways: 1) copy pasting the full reference form Wormbase (manual approach) and copy pasting directly in the XREF site Juancarlos' output (automatic approach).

RESULTS:

  • Manual approach: 18 retrieved DOI; when no DOI -> 9 retrieved PMID. 2 had both DOI and PMID
  • Automatic approach: 15 retrieved DOI; when no DOI -> 9 retrieved PMID. 2 had both DOI and PMID

We will go with the automatic as we lose only 3 papers. We can retrieve manually the remaining ones in a second step.

  • Trial on Reviews: pilot on 15 reviews: 10 retrieved DOI (5 of which had also PMID)
  • Trial on Published Errata: pilot on 15 Published Errata: 7 retrieved DOI
  • Trial on Others: tried on 4: 1 retrieved DOI
  • Trial on News: tried on 12: 3 retrieved DOI
  • Trial on Letter Comment: tried on 1: 1 retrieved DOI
  • Trial on Book Chapters: tried on 15: 1 retrieved DOI
  • Trial on Others: tried on 15: 6 retrieved DOI


FINAL RESULTS: Out of 1223 objects (Book Chapters; Journal Articles; Letter, Comment; News; Other; Published_erratum; Review) retrieved 594 dois and 174 PMIDs. Spot checked 60 dois to see if they were corresponding to the right paper -> all ok.

RESULTS FOR PMID CHECKING: Checked 10 of the 174 PMIDs. They all look fine. Found one case, pmid15044786, where we will need to create another paper entry with the same PMID since the pmid refers to two letters (different authors) to the editor about the same article. This can be done manually. -kmv

PAPERS WITH PMID BUT NO DOI in PUBMED XML: I found two papers, pmid12628167 and pmid10564810, that have a pmid but there is no doi in the corresponding PubMed XML. For the first one, crossref found the doi, but for the second one it didn't. Perhaps it's also worth trying to get additional doi's from crossref for papers that do have a PMID, but no doi? -kmv

Dec 09 2011: Asked J to parse the DOI and PMIDs into the pap_identifier tables in postgres. File located on Tazendra: /home/acedb/draciti/DOI/doi_automatic_retrieval. dr After populating the tables 16 citations did not match the source because there was a missing tab between references. Added them manually on tanzendra. In case anybody need to see it for future reference the list is in Daniela's computer lario:~ danielaraciti$/Users/danielaraciti/Desktop/Elsevier_linking/Retrieving DOIs/PMID_added_manually.rtf.

TODO next: ask J the list of papers that do have PMID but NO DOI

5) If time permits, replace journal titles with standard NLM journal abbreviations:

This is lower priority, but we could probably start by using the PubMed XML to replace journal names with the standard NLM abbreviations, since the abbreviation is what we typically try to use for WBPaper objects. For example, every instance of Annals of the New York Academy of Sciences would be replaced with 'Ann N Y Acad Sci'.

Next steps:


Back to Paper Pipeline