PLoS #
This is a canonical data source description,
https://data.sciveyor.com/source/plos
.
The current dataSourceVersion
described by this documentation is 1. The
dataSource
name for this data is Public Library of Science
.
Coverage: All PLoS-published journals, regularly updated; currently to July
23, 2021
Size: 297,090 articles
Copyright: Reserved by individual authors; see each article record
License: CC-BY 4.0
Credits: C.H. Pence
How we got it #
This data is downloaded directly from PLoS, via their open-source allofplos Python scraper. This scraper not only allows us to download a complete copy of all PLoS journals, it also permits incremental updates, so we regularly refresh our corpus of PLoS content.
Processing #
- JATS XML to Canonical JSON: direct parsing from the JATS XML format
- PMIDs, PMCIDs, and PubMed Manuscript IDs: PubMed scraping
- Keywords and Tags: PLoS does not use author-provided keywords. The “subject categories” visible on each article page are saved as tags.
Changelog #
- Data Source Version 1 (2021-08-03): complete rework of our prior PLoS data
(none of which was kept), from the new
allofplos
data source in JATS XML.