Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision
  • 13-collect-a-sigmathling-bibliography-2
  • add_bibliography
  • ar5iv-04-2024-dataset
  • arxmliv-2018
  • fix-sidebar-gitlab-link
  • grounding-dataset-v1
  • master
7 results

Target

Select target project
  • SIGMathLing/website
1 result
Select Git revision
  • 13-collect-a-sigmathling-bibliography-2
  • add_bibliography
  • ar5iv-04-2024-dataset
  • arxmliv-2018
  • fix-sidebar-gitlab-link
  • grounding-dataset-v1
  • master
7 results
Show changes
Commits on Source (3)
public/ar5iv-04-2024-c-uda.png

31.1 KiB

---
layout: page
title: ar5iv 04.2024 - An HTML5 dataset for arXiv.org
---
<img src="/public/ar5iv-04-2024-c-uda.png" width="800px">
### Release
- This page documents: ar5iv 04.2024 (latest)
### Contents
- 2,170,799 HTML documents
- Three separate archive bundles, separated by LaTeXML conversion severity
- For more information on severity, see [LaTeXML manual: Error Codes](https://math.nist.gov/~BMiller/LaTeXML/manual/errorcodes/)
- The HTML sources total `318 GB` packaged, and `2.83 TB` unpacked.
- you also need 2.18 million free inodes to unpack the full data (check via `df -ih .`)
- This dataset is HTML-only and does not include images.
| subset ID | number of documents | size archived | size unpacked |
| :--- | ---: | ---: | ---: |
| no\_problem | 366,232 | 20 GB | 155 GB |
| warning | 1,304,052 | 216 GB | 1989 GB |
| error | 500,515 | 82 GB | 753 GB |
### Download and License
- The dataset is licensed under [C-UDA-1.0](https://github.com/microsoft/Computational-Use-of-Data-Agreement/blob/a28ca06f6f8ecac0b5856ca6179ac49e55f00104/C-UDA-1.0.md).
- To receive a download link, please submit the [License Agreement form (click here)](https://docs.google.com/forms/d/e/1FAIpQLSd3fK-HcS3XUlWlzRt5cGHnAV-pXk4rddirH-E3TpleRnwtsg/viewform?usp=sf_link).
### Description
This is the first public release of the ar5iv dataset generated by the [KWARC](https://kwarc.info/) research group.
It contains HTML5+MathML conversions of the scientific documents from the arXiv.org preprint server, upto the start of April 2024.
As of April 2024, the provided HTML here also seeds the live [ar5iv Lab](https://ar5iv.labs.arxiv.org/) site, maintained by the same author.
For articles with multiple published versions, the underlying TeX sources are the newest ones available, updated as of February 22, 2024.
### MD5 file integrity
```
6ffa80fa273f29716527db36e1841abf ar5iv-04-2024-no-problem.zip
51582b218f55286e5fe08431eb5e299d ar5iv-04-2024-warnings.zip
9178d9635085a657956402077b4f8301 ar5iv-04-2024-errors.zip
```
### Citing this Resource
#### pure bibTeX
```
@MISC{SML:ar5iv:04:2024,
author = {Deyan Ginev},
title = {ar5iv:04.2024 dataset, an HTML5 conversion of arXiv.org},
howpublished = {hosted at \url{https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/}},
note = {SIGMathLing -- Special Interest Group on Math Linguistics},
year = {2024} }
```
#### bibTeX for the bibLaTeX package (preferred)
```
@online{SML:ar5iv:04:2024,
author = {Deyan Ginev},
title = {ar5iv:04.2024 dataset, an HTML5 conversion of arXiv.org},
url = {https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/},
note = {SIGMathLing -- Special Interest Group on Math Linguistics},
year = {2024} }
```
#### EndNote
```
%0 Generic
%T ar5iv:04:2024 dataset, an HTML5 conversion of arXiv.org
%A Ginev, Deyan
%D 2024
%I hosted at https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/
%F SML:ar5iv:04:2024b
%O SIGMathLing – Special Interest Group on Math Linguistics
```
### Generated via
- [LaTeXML 0.8.8](https://github.com/brucemiller/LaTeXML/releases/tag/v0.8.8),
- [CorTeX 0.4.5](https://github.com/dginev/CorTeX/releases/tag/0.4.5),
- [latexml-plugin-cortex 2.2](https://github.com/dginev/LaTeXML-Plugin-Cortex/releases/tag/2.2.0)
### About
This release is part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWARC](https://kwarc.info/) research group.
We are also the team which created and maintains the [ar5iv Lab](https://ar5iv.labs.arxiv.org/).
The dataset is distributed through hosting provided by the University of Erlangen-Nuremberg (FAU).
Author: Deyan Ginev
......@@ -9,7 +9,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR
### Release
- This page documents: 08.2017
- Latest: [2020](/resources/arxmliv-dataset-2020/)
- Latest: [04.2024](/resources/ar5iv-dataset-2024/)
### Accessibility and License
The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
......
......@@ -9,7 +9,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR
### Release
- This page documents: 08.2018
- Latest: [2020](/resources/arxmliv-dataset-2020/)
- Latest: [04.2024](/resources/ar5iv-dataset-2024/)
### Accessibility and License
The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
......
......@@ -9,7 +9,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR
### Release
- This page documents: 08.2019
- Latest: [2020](/resources/arxmliv-dataset-2020/)
- Latest: [04.2024](/resources/ar5iv-dataset-2024/)
### Accessibility and License
The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
......
......@@ -4,7 +4,8 @@ title: arXMLiv 2020 - An HTML5 dataset for arXiv.org
---
### Release
- This page documents: arxmliv 2020 (latest)
- This page documents: arxmliv 2020
- Latest: [04.2024](/resources/ar5iv-dataset-2024/)
### Contents
- 1,581,037 HTML5 documents
......
......@@ -2,6 +2,10 @@
layout: page
title: SIGMathLing - arXMLiv Project Datasets and Resources
---
## 2024
1. [ar5iv corpus, 2024 release](/resources/ar5iv-dataset-2024/)
## 2020
1. [arXMLiv corpus, 2020 release](/resources/arxmliv-dataset-2020/)
......
......@@ -3,6 +3,7 @@ layout: page
title: SIGMathLing - Datasets and Resources
---
## Resources hosted on the SIGMathLing Repository
1. [ar5iv corpus, 04.2024 release](/resources/ar5iv-dataset-2024/)
1. [argot dataset 2021](/resources/argot-dataset-2021/)
1. [arXMLiv corpus 2020](/resources/arxmliv-dataset-2020/)
1. [arXMLiv corpus, 08.2019 release](/resources/arxmliv-dataset-082019/)
......