- Three separate archive bundles, separated by LaTeXML conversion severity
- For more information on severity, see [LaTeXML manual: Error Codes](https://math.nist.gov/~BMiller/LaTeXML/manual/errorcodes/)
- The HTML sources total `318 GB` packaged, and `2.83 TB` unpacked.
- you also need 2.18 million free inodes to unpack the full data (check via `df -ih .`)
- This dataset is HTML-only and does not include images.
| subset ID | number of documents | size archived | size unpacked |
| :--- | ---: | ---: | ---: |
| no\_problem | 366,232 | 20 GB | 155 GB |
| warning | 1,304,052 | 216 GB | 1989 GB |
| error | 500,515 | 82 GB | 753 GB |
### Download and License
- The dataset is licensed under [C-UDA-1.0](https://github.com/microsoft/Computational-Use-of-Data-Agreement/blob/a28ca06f6f8ecac0b5856ca6179ac49e55f00104/C-UDA-1.0.md).
- To receive a download link, please submit the [License Agreement form (click here)](https://docs.google.com/forms/d/e/1FAIpQLSd3fK-HcS3XUlWlzRt5cGHnAV-pXk4rddirH-E3TpleRnwtsg/viewform?usp=sf_link).
### Description
This is the first public release of the ar5iv dataset generated by the [KWARC](https://kwarc.info/) research group.
It contains HTML5+MathML conversions of the scientific documents from the arXiv.org preprint server, upto the start of April 2024.
As of April 2024, the provided HTML here also seeds the live [ar5iv Lab](https://ar5iv.labs.arxiv.org/) site, maintained by the same author.
For articles with multiple published versions, the underlying TeX sources are the newest ones available, updated as of February 22, 2024.