diff --git a/public/ar5iv-04-2024-c-uda.png b/public/ar5iv-04-2024-c-uda.png new file mode 100644 index 0000000000000000000000000000000000000000..cbd66c06a11d2f755b5e9360f7a2291756fafd7b Binary files /dev/null and b/public/ar5iv-04-2024-c-uda.png differ diff --git a/resources/ar5iv-dataset-2024.md b/resources/ar5iv-dataset-2024.md new file mode 100644 index 0000000000000000000000000000000000000000..060db597f631624602742d55b7d79129e9a50cb3 --- /dev/null +++ b/resources/ar5iv-dataset-2024.md @@ -0,0 +1,92 @@ +--- +layout: page +title: ar5iv 04.2024 - An HTML5 dataset for arXiv.org +--- + +<img src="/public/ar5iv-04-2024-c-uda.png" width="800px"> + +### Release + - This page documents: ar5iv 04.2024 (latest) + +### Contents + - 2,170,799 HTML documents + - Three separate archive bundles, separated by LaTeXML conversion severity + - For more information on severity, see [LaTeXML manual: Error Codes](https://math.nist.gov/~BMiller/LaTeXML/manual/errorcodes/) + - The HTML sources total `318 GB` packaged, and `2.83 TB` unpacked. + - you also need 2.18 million free inodes to unpack the full data (check via `df -ih .`) + - This dataset is HTML-only and does not include images. + + + | subset ID | number of documents | size archived | size unpacked | + | :--- | ---: | ---: | ---: | + | no\_problem | 366,232 | 20 GB | 155 GB | + | warning | 1,304,052 | 216 GB | 1989 GB | + | error | 500,515 | 82 GB | 753 GB | + + +### Download and License + - The dataset is licensed under [C-UDA-1.0](https://github.com/microsoft/Computational-Use-of-Data-Agreement/blob/a28ca06f6f8ecac0b5856ca6179ac49e55f00104/C-UDA-1.0.md). + - To receive a download link, please submit the [License Agreement form (click here)](https://docs.google.com/forms/d/e/1FAIpQLSd3fK-HcS3XUlWlzRt5cGHnAV-pXk4rddirH-E3TpleRnwtsg/viewform?usp=sf_link). + +### Description + +This is the first public release of the ar5iv dataset generated by the [KWARC](https://kwarc.info/) research group. +It contains HTML5+MathML conversions of the scientific documents from the arXiv.org preprint server, upto the start of April 2024. + +As of April 2024, the provided HTML here also seeds the live [ar5iv Lab](https://ar5iv.labs.arxiv.org/) site, maintained by the same author. + +For articles with multiple published versions, the underlying TeX sources are the newest ones available, updated as of February 22, 2024. + +### MD5 file integrity + +``` +6ffa80fa273f29716527db36e1841abf ar5iv-04-2024-no-problem.zip +51582b218f55286e5fe08431eb5e299d ar5iv-04-2024-warnings.zip +9178d9635085a657956402077b4f8301 ar5iv-04-2024-errors.zip +``` + +### Citing this Resource + +#### pure bibTeX +``` +@MISC{SML:ar5iv:04:2024, + author = {Deyan Ginev}, + title = {ar5iv:04.2024 dataset, an HTML5 conversion of arXiv.org}, + howpublished = {hosted at \url{https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/}}, + note = {SIGMathLing -- Special Interest Group on Math Linguistics}, + year = {2024} } +``` + +#### bibTeX for the bibLaTeX package (preferred) +``` +@online{SML:ar5iv:04:2024, + author = {Deyan Ginev}, + title = {ar5iv:04.2024 dataset, an HTML5 conversion of arXiv.org}, + url = {https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/}, + note = {SIGMathLing -- Special Interest Group on Math Linguistics}, + year = {2024} } +``` + +#### EndNote +``` +%0 Generic +%T ar5iv:04:2024 dataset, an HTML5 conversion of arXiv.org +%A Ginev, Deyan +%D 2024 +%I hosted at https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/ +%F SML:ar5iv:04:2024b +%O SIGMathLing – Special Interest Group on Math Linguistics +``` + +### Generated via + - [LaTeXML 0.8.8](https://github.com/brucemiller/LaTeXML/releases/tag/v0.8.8), + - [CorTeX 0.4.5](https://github.com/dginev/CorTeX/releases/tag/0.4.5), + - [latexml-plugin-cortex 2.2](https://github.com/dginev/LaTeXML-Plugin-Cortex/releases/tag/2.2.0) + +### About +This release is part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWARC](https://kwarc.info/) research group. +We are also the team which created and maintains the [ar5iv Lab](https://ar5iv.labs.arxiv.org/). + +The dataset is distributed through hosting provided by the University of Erlangen-Nuremberg (FAU). + +Author: Deyan Ginev