PLoS Workflow #
The PLoS workflow is a little strange, because the allofplos
Python script,
released by PLoS themselves, downloads the corpus to a single, massive folder
that it can automatically update. That means that we have to carefully check for
file updates from that folder against the way in which we normally store our
parsed JSON output.
-
Scrape the PLoS corpus using the
allofplos
package:cd ~/Development/Packages # If allofplos is not already installed, run: virtualenv venv-allofplos source venv-allofplos/bin/activate pip install allofplos # Or otherwise, run: source venv-allofplos/bin/activate # Then, to update the corpus, run: PLOS_CORPUS=/mnt/falstaff/Sciveyor/Source/PLoS python -m allofplos.update
-
Convert metadata from PLoS XML to internal JSON:
scripts/parse/jats-to-json \ --data-source 'Public Library of Science' \ --data-source-url 'https://data.sciveyor.com/source/plos' \ --data-source-version 1 \ --default-license 'CC-BY 4.0' \ --default-license-url 'https://creativecommons.org/licenses/by/4.0/' \ --log-file ~/plos-parse.log \ /mnt/falstaff/Sciveyor/Source/PLoS
We then check through the log file for any errors that actually matter.
-
Deal with any newly generated references files:
For example, we’ve lately been doing:
# For each of the three extensions .xml-refs, .pm-refs-json, and # .cr-refs-json, run: scripts/move/plos/from_source \ --extension '.xml-refs' \ --check /mnt/falstaff/Sciveyor/Extracted-References/PLoS \ --dest /mnt/falstaff/Sciveyor/Extracted-References/temp \ --move \ /mnt/falstaff/Sciveyor/Source/PLoS # For each folder in temp (pone, pbio, etc.), run: scripts/move/in_hashed_directories \ --ignore-chars 13 \ --main-extension '.xml-refs' \ --output /mnt/falstaff/Sciveyor/Extracted-References/PLoS/pone \ /mnt/falstaff/Sciveyor/Extracted-References/temp/pone
-
Copy new generated JSON to final folders:
scripts/move/plos/from_source \ --extension '.json' \ --check /mnt/falstaff/Sciveyor/Content/PLoS \ --dest /mnt/falstaff/Sciveyor/Content/temp \ /mnt/falstaff/Sciveyor/Source/PLoS # For each folder in temp (pone, pbio, etc.), run: scripts/move/in_hashed_directories \ --ignore-chars 13 \ --main-extension '.json' \ --output /mnt/falstaff/Sciveyor/Content/PLoS/pone \ /mnt/falstaff/Sciveyor/Content/temp/pone
-
Validate files:
mongo-tool/mongo-tool validate-files \ /mnt/falstaff/Sciveyor/Content/PLoS/**/*.json