This is a canonical data source description, The current dataSourceVersion described by this documentation is 1. The dataSource name for this data is Public Library of Science.

Coverage: All PLoS-published journals, regularly updated; currently to January 22, 2020 Copyright: Reserved by individual authors; see each article record License: CC-BY 4.0

How we got it

This data is downloaded directly from PLoS, via their open-source allofplos Python scraper. This scraper not only allows us to download a complete copy of all PLoS journals, it also permits incremental updates, so we regularly refresh our corpus of PLoS content.


  • JATS XML to Canonical JSON: direct parsing from XML

  • PMIDs, PMCIDs, and PubMed Manuscript IDs: PubMed scraping

  • Keywords and Tags: PLoS does not use author-provided keywords. The "subject categories" visible on each article page are saved as tags.


  • Data Source Version 1: complete rework of our prior PLoS data (none of which was kept), from the new 'allofplos' data source in JATS XML