PLoS

PLoS Workflow #

The PLoS workflow is a little strange, because the allofplos Python script, released by PLoS themselves, downloads the corpus to a single, massive folder that it can automatically update. That means that we have to carefully check for file updates from that folder against the way in which we normally store our parsed JSON output.

  1. Scrape the PLoS corpus using the allofplos package:

    cd ~/Development/Packages
    
    # If allofplos is not already installed, run:
    virtualenv venv-allofplos
    source venv-allofplos/bin/activate
    pip install allofplos
    
    # Or otherwise, run:
    source venv-allofplos/bin/activate
    
    # Then, to update the corpus, run:
    PLOS_CORPUS=/mnt/falstaff/Sciveyor/Source/PLoS python -m allofplos.update
    
  2. Convert metadata from PLoS XML to internal JSON:

    scripts/parse/jats-to-json \
      --data-source 'Public Library of Science' \
      --data-source-url 'https://data.sciveyor.com/source/plos' \
      --data-source-version 1 \
      --default-license 'CC-BY 4.0' \
      --default-license-url 'https://creativecommons.org/licenses/by/4.0/' \
      --log-file ~/plos-parse.log \
      /mnt/falstaff/Sciveyor/Source/PLoS
    

    We then check through the log file for any errors that actually matter.

  3. Deal with any newly generated references files:

    For example, we’ve lately been doing:

    # For each of the three extensions .xml-refs, .pm-refs-json, and
    # .cr-refs-json, run:
    scripts/move/plos/from_source \
      --extension '.xml-refs' \
      --check /mnt/falstaff/Sciveyor/Extracted-References/PLoS \
      --dest /mnt/falstaff/Sciveyor/Extracted-References/temp \
      --move \
      /mnt/falstaff/Sciveyor/Source/PLoS
    
    # For each folder in temp (pone, pbio, etc.), run:
    scripts/move/in_hashed_directories \
      --ignore-chars 13 \
      --main-extension '.xml-refs' \
      --output /mnt/falstaff/Sciveyor/Extracted-References/PLoS/pone \
      /mnt/falstaff/Sciveyor/Extracted-References/temp/pone
    
  4. Copy new generated JSON to final folders:

    scripts/move/plos/from_source \
      --extension '.json' \
      --check /mnt/falstaff/Sciveyor/Content/PLoS \
      --dest /mnt/falstaff/Sciveyor/Content/temp \
      /mnt/falstaff/Sciveyor/Source/PLoS
    
    # For each folder in temp (pone, pbio, etc.), run:
    scripts/move/in_hashed_directories \
      --ignore-chars 13 \
      --main-extension '.json' \
      --output /mnt/falstaff/Sciveyor/Content/PLoS/pone \
      /mnt/falstaff/Sciveyor/Content/temp/pone
    
  5. Validate files:

    mongo-tool/mongo-tool validate-files \
      /mnt/falstaff/Sciveyor/Content/PLoS/**/*.json