This is a canonical data source description,
dataSourceVersion described by this documentation is 6. The
dataSource name for this data is either
Nature Publishing Group (for our
initial PDFs obtained under license with NPG, publication dates prior to
Springer Nature (for all others).
Coverage: all to vol. 475, no. 7355 (2011-07-14)
Size: 364,109 articles
Copyright: Springer Nature
License: Springer Nature TDM Policy
How we got it #
This data was downloaded directly, using a custom-built Python scraper for the Nature Publishing Group family of journals. (Scrapers age quickly; I believe that ours is no longer functional as of around 2018.)
As our data is now covered by the Springer Nature TDM Policy (since 2015), we can now update this corpus from 2011 to the present. We have yet to do so, but hope to implement a workflow for this very soon.
- PDF OCR to plain text: prior to 2011-07-14, [OCR process
- The older portions of Nature (from 1869–mid-2011) are some of the first files we obtained for the project. They transitioned through, at least, an early version of our Solr XML schema and then the final version of our Solr XML schema. Because the intermediate plain-text files were lost, this final Solr XML was the source of the plain-text used to construct our canonical JSON data. We are not aware of any further errors (beyond OCR error) introduced by this process.
- Basic bibliographic data: CrossRef
articles before mid-2011)
- Author affiliations as well as publication dates were added from PubMed data when available.
- Abstracts: Scraped from two data sources. If the lengths of the two abstracts are less than 20 characters apart from one another (i.e., they are roughly the same text), the PubMed abstract was used, as these have fewer formatting errors. Otherwise, the longer abstract was used.
- PMIDs, PMCIDs, and PubMed Manuscript IDs: PubMed scraping
- Keywords and Tags: No keywords are currently used by Nature, and no tags were extracted from the Nature bibliographic data.
- Data Source Version 6 (2021-02-26): Fixed a number of dates that did not
parse correctly as ISO-8601. Validated against the JSON schema, fixing a
number of errors (mostly around
nullfield values). Corrected a number of duplicated DOI/ID values resulting from a metadata parsing error.
- Data Source Version 5 (2020-08-02): Upgraded to JSON schema v5.
- Data Source Version 4 (2020-07-21): Upgraded to JSON schema v4, adding
- Data Source Version 3 (2020-02-05): Empty
externalIdsand occasional empty
authorsvalues were detected. These have been removed.
- Data Source Version 2 (2020-01-23): A bug was detected in the parsing of our PMCID, PMMID, and PMID values. The bug was fixed, and we’ve re-run the PubMed extraction against the entire corpus to provide correct values for this metadata.
- Data Source Version 1: First parsing of the Nature data into canonical JSON format from our original XML source, adding abstracts, PMIDs, formatted names, etc.