Skip to content

Bundle first community-facing release of arXMLiv data

I have finally made my way to the gitlab issues, will keep everyone up-to-date via issues in this repository.

I have started a "bundle dataset" job over the current cortex data available on mercury, which will contain unvetted HTML5 files upto and including the 08.2017 arXiv data. The intended structure is:

  • Three top level directories - no_problem, warning and error
  • Each directory contains a flat list of .html files, each of which is the result of running latexml 0.8.2 over the respective arXiv article.
  • The filenames are the original arxiv identifiers for the paper, for easy cross-referencing with the original sources

For this first release I will not bundle any auxiliary files (notably figures), to keep the size small as we figure out the best practices for external groups to reuse our data. I expect the final bundle will be in the 100-200GB range archived.

I intend to provide a web page with:

  • An exhaustive description on how the data was produced, what it covers, and what the intent of distributing is
  • The MD5 hash of the archive file, to enable verifying download integrity
  • An explanation of the 3 severities included, and recommendations on their use

I also want to regenerate and provide the GloVe data over this dataset as showcased in my AITP talk a year ago, as requested by Josef Urban.

Happily, the utilities I had written the last time I generated the dataset dumps are still fully operational, which makes things easy. I should have more to report very soon.

I would need to learn more about Gitlab's large-file-storage capabilities, to see if I can easily add the data to them - in what I assume would be a dedicated password-protected repository? Given the size of the dumps, the ideal scenario is to have the option to download them one by one, independently of each other.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information