Bundle first community-facing release of arXMLiv data
I have finally made my way to the gitlab issues, will keep everyone up-to-date via issues in this repository.
I have started a "bundle dataset" job over the current cortex data available on mercury, which will contain unvetted HTML5 files upto and including the 08.2017 arXiv data. The intended structure is:
- Three top level directories - no_problem, warning and error
- Each directory contains a flat list of
.html
files, each of which is the result of running latexml 0.8.2 over the respective arXiv article. - The filenames are the original arxiv identifiers for the paper, for easy cross-referencing with the original sources
For this first release I will not bundle any auxiliary files (notably figures), to keep the size small as we figure out the best practices for external groups to reuse our data. I expect the final bundle will be in the 100-200GB range archived.
I intend to provide a web page with:
- An exhaustive description on how the data was produced, what it covers, and what the intent of distributing is
- The MD5 hash of the archive file, to enable verifying download integrity
- An explanation of the 3 severities included, and recommendations on their use
I also want to regenerate and provide the GloVe data over this dataset as showcased in my AITP talk a year ago, as requested by Josef Urban.
Happily, the utilities I had written the last time I generated the dataset dumps are still fully operational, which makes things easy. I should have more to report very soon.
I would need to learn more about Gitlab's large-file-storage capabilities, to see if I can easily add the data to them - in what I assume would be a dedicated password-protected repository? Given the size of the dumps, the ideal scenario is to have the option to download them one by one, independently of each other.