JSTOR Workflow #

There is no scraping step for this data source, because we receive data dumps directly from JSTOR themselves.

  1. Convert metadata from JSTOR XML to internal JSON:

    scripts/parse/jats_to_json \
      --data-source 'JSTOR' \
      --data-source-url 'https://data.sciveyor.com/source/jstor' \
      --data-source-version 1 \
      --default-license 'Copyright © JSTOR' \
      --default-license-url 'https://www.jstor.org/dfr' \
      --no-full-text \
      --log-file ~/jstor-parse.log \

    The metadata dumps that we get from JSTOR are stored in the JATS format, and our JATS conversion script has explicit support for detecting JSTOR files.

    We then check through the log file for any errors that actually matter. Numerous articles lack authors, and there are very few hits from PubMed or Crossref, because DOIs are actually reasonably rare in the JSTOR corpus.

  2. Add ‘.txt’ full text to the JSON files:

    scripts/parse/jstor/add_txt $PATH

    This command strips the XML tags from the text files that JSTOR provides us and adds the full text to the JSON files.

  3. Validate files:

    mongo-tool/mongo-tool validate-files $PATH