We often have a number of data-management tasks that are unfinished (as with every digital humanities project). We keep track of those here, so that the public can see the status of our corpus.
Document the data from PLoS, JSTOR, and the Complexity journals, convert all into canonical JSON
Build proper workflows for updating PLoS and the Complexity journals (this will involve standardizing and publishing the scrapers as well)
Other data sources that we would soon like to investigate:
Funded grant proposal databases (NSF, ERC, others?)
After TREE processing, more journals from Elsevier TDM API