Skip to content
Snippets Groups Projects
Commit 5e74febd authored by Deyan Ginev's avatar Deyan Ginev
Browse files

release ar5iv-04-2024 dataset

parent cdb7478b
Branches
No related tags found
1 merge request!16ar5iv-04.2024 dataset
public/ar5iv-04-2024-c-uda.png

31.1 KiB

---
layout: page
title: ar5iv 04.2024 - An HTML5 dataset for arXiv.org
---
<img src="/public/ar5iv-04-2024-c-uda.png" width="800px">
### Release
- This page documents: ar5iv 04.2024 (latest)
### Contents
- 2,170,799 HTML documents
- Three separate archive bundles, separated by LaTeXML conversion severity
- For more information on severity, see [LaTeXML manual: Error Codes](https://math.nist.gov/~BMiller/LaTeXML/manual/errorcodes/)
- The HTML sources total `318 GB` packaged, and `2.83 TB` unpacked.
- you also need 2.18 million free inodes to unpack the full data (check via `df -ih .`)
- This dataset is HTML-only and does not include images.
| subset ID | number of documents | size archived | size unpacked |
| :--- | ---: | ---: | ---: |
| no\_problem | 366,232 | 20 GB | 155 GB |
| warning | 1,304,052 | 216 GB | 1989 GB |
| error | 500,515 | 82 GB | 753 GB |
### Download and License
- The dataset is licensed under [C-UDA-1.0](https://github.com/microsoft/Computational-Use-of-Data-Agreement/blob/a28ca06f6f8ecac0b5856ca6179ac49e55f00104/C-UDA-1.0.md).
- To receive a download link, please submit the [License Agreement form (click here)](https://docs.google.com/forms/d/e/1FAIpQLSd3fK-HcS3XUlWlzRt5cGHnAV-pXk4rddirH-E3TpleRnwtsg/viewform?usp=sf_link).
### Description
This is the first public release of the ar5iv dataset generated by the [KWARC](https://kwarc.info/) research group.
It contains HTML5+MathML conversions of the scientific documents from the arXiv.org preprint server, upto the start of April 2024.
As of April 2024, the provided HTML here also seeds the live [ar5iv Lab](https://ar5iv.labs.arxiv.org/) site, maintained by the same author.
For articles with multiple published versions, the underlying TeX sources are the newest ones available, updated as of February 22, 2024.
### MD5 file integrity
```
6ffa80fa273f29716527db36e1841abf ar5iv-04-2024-no-problem.zip
51582b218f55286e5fe08431eb5e299d ar5iv-04-2024-warnings.zip
9178d9635085a657956402077b4f8301 ar5iv-04-2024-errors.zip
```
### Citing this Resource
#### pure bibTeX
```
@MISC{SML:ar5iv:04:2024,
author = {Deyan Ginev},
title = {ar5iv:04.2024 dataset, an HTML5 conversion of arXiv.org},
howpublished = {hosted at \url{https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/}},
note = {SIGMathLing -- Special Interest Group on Math Linguistics},
year = {2024} }
```
#### bibTeX for the bibLaTeX package (preferred)
```
@online{SML:ar5iv:04:2024,
author = {Deyan Ginev},
title = {ar5iv:04.2024 dataset, an HTML5 conversion of arXiv.org},
url = {https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/},
note = {SIGMathLing -- Special Interest Group on Math Linguistics},
year = {2024} }
```
#### EndNote
```
%0 Generic
%T ar5iv:04:2024 dataset, an HTML5 conversion of arXiv.org
%A Ginev, Deyan
%D 2024
%I hosted at https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/
%F SML:ar5iv:04:2024b
%O SIGMathLing – Special Interest Group on Math Linguistics
```
### Generated via
- [LaTeXML 0.8.8](https://github.com/brucemiller/LaTeXML/releases/tag/v0.8.8),
- [CorTeX 0.4.5](https://github.com/dginev/CorTeX/releases/tag/0.4.5),
- [latexml-plugin-cortex 2.2](https://github.com/dginev/LaTeXML-Plugin-Cortex/releases/tag/2.2.0)
### About
This release is part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWARC](https://kwarc.info/) research group.
We are also the team which created and maintains the [ar5iv Lab](https://ar5iv.labs.arxiv.org/).
The dataset is distributed through hosting provided by the University of Erlangen-Nuremberg (FAU).
Author: Deyan Ginev
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment