-
Deyan Ginev authoredDeyan Ginev authored
ar5iv-dataset-2024.md 3.69 KiB
layout: page
title: ar5iv 04.2024 - An HTML5 dataset for arXiv.org

Release
- This page documents: ar5iv 04.2024 (latest)
Contents
- 2,170,799 HTML documents
- Three separate archive bundles, separated by LaTeXML conversion severity
- For more information on severity, see LaTeXML manual: Error Codes
- The HTML sources total
318 GB
packaged, and2.83 TB
unpacked.- you also need 2.18 million free inodes to unpack the full data (check via
df -ih .
)
- you also need 2.18 million free inodes to unpack the full data (check via
- This dataset is HTML-only and does not include images.
subset ID | number of documents | size archived | size unpacked |
---|---|---|---|
no_problem | 366,232 | 20 GB | 155 GB |
warning | 1,304,052 | 216 GB | 1989 GB |
error | 500,515 | 82 GB | 753 GB |
Download and License
- The dataset is licensed under C-UDA-1.0.
- To receive a download link, please submit the License Agreement form (click here).
Description
This is the first public release of the ar5iv dataset generated by the KWARC research group. It contains HTML5+MathML conversions of the scientific documents from the arXiv.org preprint server, upto the start of April 2024.
As of April 2024, the provided HTML here also seeds the live ar5iv Lab site, maintained by the same author.
For articles with multiple published versions, the underlying TeX sources are the newest ones available, updated as of February 22, 2024.
MD5 file integrity
6ffa80fa273f29716527db36e1841abf ar5iv-04-2024-no-problem.zip
51582b218f55286e5fe08431eb5e299d ar5iv-04-2024-warnings.zip
9178d9635085a657956402077b4f8301 ar5iv-04-2024-errors.zip
Citing this Resource
pure bibTeX
@MISC{SML:ar5iv:04:2024,
author = {Deyan Ginev},
title = {ar5iv:04.2024 dataset, an HTML5 conversion of arXiv.org},
howpublished = {hosted at \url{https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/}},
note = {SIGMathLing -- Special Interest Group on Math Linguistics},
year = {2024} }
bibTeX for the bibLaTeX package (preferred)
@online{SML:ar5iv:04:2024,
author = {Deyan Ginev},
title = {ar5iv:04.2024 dataset, an HTML5 conversion of arXiv.org},
url = {https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/},
note = {SIGMathLing -- Special Interest Group on Math Linguistics},
year = {2024} }
EndNote
%0 Generic
%T ar5iv:04:2024 dataset, an HTML5 conversion of arXiv.org
%A Ginev, Deyan
%D 2024
%I hosted at https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/
%F SML:ar5iv:04:2024b
%O SIGMathLing – Special Interest Group on Math Linguistics
Generated via
About
This release is part of the arXMLiv project at the KWARC research group. We are also the team which created and maintains the ar5iv Lab.
The dataset is distributed through hosting provided by the University of Erlangen-Nuremberg (FAU).
Author: Deyan Ginev