This is a canonical data source description, https://data.sciveyor.com/source/jstor.

The current dataSourceVersion described by this documentation is 1. The dataSource name for this data is JSTOR.

Coverage: see list of journals below; coverage usually through around mid-2018, when data was delivered, though this can change for individual journals as a result of embargoes and moving walls
Size: 1,265,573 articles
Copyright: varies per article; see licensing fields
License: an agreement between the Sciveyor project and JSTOR DFR, now a paid project called “Constellate”

How we got it #

This data was delivered to us directly by JSTOR, under the terms of our licensing agreement. We cannot extend our coverage without a further agreement.

Journals included #

The following journals are available in this dataset. For each journal, you’ll find a title and an approximate number of articles available. Only journals with more than 100 articles available are listed here; searches may find occasional articles from rarer journals. Note also that if you are attempting to collect all of a journal’s print run into a single dataset, you should search for alternative and variant titles, as some journals have changed title over time.

Processing #

  • OCR to plain text: Unknown, performed by JSTOR. Via our communications with the folks at JSTOR, we know that they OCR text using whatever they believe is the current “best practice” software. But they do not ever, to our knowledge, re-OCR their back catalog, nor do they provide us with any information about the OCR package used for each article.
  • Metadata: Provided directly by JSTOR.
  • PMIDs, PMCIDs, and PubMed Manuscript IDs: PubMed scraping
  • Keywords and Tags: Keywords, if present, are proper author-generated keywords. Tags are a variety of strange automatically generated JSTOR tags and categories, including, for instance, differentiating “normal” articles from book reviews.

Changelog #

  • Data Source Version 1 (2021-07-02): First import of our JSTOR data into the new data format.