-[SIGMathLing members](/member/) only. Joining is free and mostly a legal checkmark on our end - all researchers welcome!
### Description
This is the fourth public release of the arXMLiv dataset generated by the [KWARC](https://kwarc.info/) research group.
It contains HTML5+MathML conversions of the scientific documents from the arXiv.org preprint server, upto and including the end of 2020. It offers a 15% increase in available articles over our 08.2019 release.
The release also provides the associated conversion metadata under `meta/grouped_by_severity.zip`. The severity information allows to filter by whether the latexml process completed cleanly, with warnings or with recoverable errors.
A unique feature of the arXMLiv generation process is latexml's cross-referenced and lexematized MathML representation for math syntax. Scroll to the bottom of the page for an example snippet.
This version of the dataset has had minimal manual quality control, and we offer no additional warranty beyond the latexml severity reported.
### Citing this Resource
#### pure bibTeX
```
@MISC{SML:arXMLiv:2020,
author = {Deyan Ginev},
title = {arXMLiv:2020 dataset, an HTML5 conversion of arXiv.org},
howpublished = {hosted at \url{https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/}},
note = {SIGMathLing -- Special Interest Group on Math Linguistics},
year = 2020}
```
#### bibTeX for the bibLaTeX package (preferred)
```
@online{SML:arXMLiv:2020,
author = {Deyan Ginev},
title = {arXMLiv:2020 dataset, an HTML5 conversion of arXiv.org},