diff --git a/public/ar5iv-04-2024-c-uda.png b/public/ar5iv-04-2024-c-uda.png new file mode 100644 index 0000000000000000000000000000000000000000..cbd66c06a11d2f755b5e9360f7a2291756fafd7b Binary files /dev/null and b/public/ar5iv-04-2024-c-uda.png differ diff --git a/resources/ar5iv-dataset-2024.md b/resources/ar5iv-dataset-2024.md new file mode 100644 index 0000000000000000000000000000000000000000..060db597f631624602742d55b7d79129e9a50cb3 --- /dev/null +++ b/resources/ar5iv-dataset-2024.md @@ -0,0 +1,92 @@ +--- +layout: page +title: ar5iv 04.2024 - An HTML5 dataset for arXiv.org +--- + +<img src="/public/ar5iv-04-2024-c-uda.png" width="800px"> + +### Release + - This page documents: ar5iv 04.2024 (latest) + +### Contents + - 2,170,799 HTML documents + - Three separate archive bundles, separated by LaTeXML conversion severity + - For more information on severity, see [LaTeXML manual: Error Codes](https://math.nist.gov/~BMiller/LaTeXML/manual/errorcodes/) + - The HTML sources total `318 GB` packaged, and `2.83 TB` unpacked. + - you also need 2.18 million free inodes to unpack the full data (check via `df -ih .`) + - This dataset is HTML-only and does not include images. + + + | subset ID | number of documents | size archived | size unpacked | + | :--- | ---: | ---: | ---: | + | no\_problem | 366,232 | 20 GB | 155 GB | + | warning | 1,304,052 | 216 GB | 1989 GB | + | error | 500,515 | 82 GB | 753 GB | + + +### Download and License + - The dataset is licensed under [C-UDA-1.0](https://github.com/microsoft/Computational-Use-of-Data-Agreement/blob/a28ca06f6f8ecac0b5856ca6179ac49e55f00104/C-UDA-1.0.md). + - To receive a download link, please submit the [License Agreement form (click here)](https://docs.google.com/forms/d/e/1FAIpQLSd3fK-HcS3XUlWlzRt5cGHnAV-pXk4rddirH-E3TpleRnwtsg/viewform?usp=sf_link). + +### Description + +This is the first public release of the ar5iv dataset generated by the [KWARC](https://kwarc.info/) research group. +It contains HTML5+MathML conversions of the scientific documents from the arXiv.org preprint server, upto the start of April 2024. + +As of April 2024, the provided HTML here also seeds the live [ar5iv Lab](https://ar5iv.labs.arxiv.org/) site, maintained by the same author. + +For articles with multiple published versions, the underlying TeX sources are the newest ones available, updated as of February 22, 2024. + +### MD5 file integrity + +``` +6ffa80fa273f29716527db36e1841abf ar5iv-04-2024-no-problem.zip +51582b218f55286e5fe08431eb5e299d ar5iv-04-2024-warnings.zip +9178d9635085a657956402077b4f8301 ar5iv-04-2024-errors.zip +``` + +### Citing this Resource + +#### pure bibTeX +``` +@MISC{SML:ar5iv:04:2024, + author = {Deyan Ginev}, + title = {ar5iv:04.2024 dataset, an HTML5 conversion of arXiv.org}, + howpublished = {hosted at \url{https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/}}, + note = {SIGMathLing -- Special Interest Group on Math Linguistics}, + year = {2024} } +``` + +#### bibTeX for the bibLaTeX package (preferred) +``` +@online{SML:ar5iv:04:2024, + author = {Deyan Ginev}, + title = {ar5iv:04.2024 dataset, an HTML5 conversion of arXiv.org}, + url = {https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/}, + note = {SIGMathLing -- Special Interest Group on Math Linguistics}, + year = {2024} } +``` + +#### EndNote +``` +%0 Generic +%T ar5iv:04:2024 dataset, an HTML5 conversion of arXiv.org +%A Ginev, Deyan +%D 2024 +%I hosted at https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/ +%F SML:ar5iv:04:2024b +%O SIGMathLing – Special Interest Group on Math Linguistics +``` + +### Generated via + - [LaTeXML 0.8.8](https://github.com/brucemiller/LaTeXML/releases/tag/v0.8.8), + - [CorTeX 0.4.5](https://github.com/dginev/CorTeX/releases/tag/0.4.5), + - [latexml-plugin-cortex 2.2](https://github.com/dginev/LaTeXML-Plugin-Cortex/releases/tag/2.2.0) + +### About +This release is part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWARC](https://kwarc.info/) research group. +We are also the team which created and maintains the [ar5iv Lab](https://ar5iv.labs.arxiv.org/). + +The dataset is distributed through hosting provided by the University of Erlangen-Nuremberg (FAU). + +Author: Deyan Ginev diff --git a/resources/arxmliv-dataset-082017.md b/resources/arxmliv-dataset-082017.md index 7a4da03009702e0007f649327b74a9ccdb187b4b..b552e25fd27122fd3dadf20e55a80fde33191a8c 100644 --- a/resources/arxmliv-dataset-082017.md +++ b/resources/arxmliv-dataset-082017.md @@ -9,7 +9,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR ### Release - This page documents: 08.2017 - - Latest: [2020](/resources/arxmliv-dataset-2020/) + - Latest: [04.2024](/resources/ar5iv-dataset-2024/) ### Accessibility and License The content of this Dataset is licensed to [SIGMathLing members](/member/) for research diff --git a/resources/arxmliv-dataset-082018.md b/resources/arxmliv-dataset-082018.md index 43d133ad62a0cfb4a483d3f6bdf9773152f59456..7d8020e04713d04352f6109d1697054375d34979 100644 --- a/resources/arxmliv-dataset-082018.md +++ b/resources/arxmliv-dataset-082018.md @@ -9,7 +9,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR ### Release - This page documents: 08.2018 - - Latest: [2020](/resources/arxmliv-dataset-2020/) + - Latest: [04.2024](/resources/ar5iv-dataset-2024/) ### Accessibility and License The content of this Dataset is licensed to [SIGMathLing members](/member/) for research diff --git a/resources/arxmliv-dataset-082019.md b/resources/arxmliv-dataset-082019.md index 22e33297bfc93e42324e8e832cc1274d49db4561..f26e0a12a489e721b915e62537152b32c1b88791 100644 --- a/resources/arxmliv-dataset-082019.md +++ b/resources/arxmliv-dataset-082019.md @@ -9,7 +9,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR ### Release - This page documents: 08.2019 - - Latest: [2020](/resources/arxmliv-dataset-2020/) + - Latest: [04.2024](/resources/ar5iv-dataset-2024/) ### Accessibility and License The content of this Dataset is licensed to [SIGMathLing members](/member/) for research diff --git a/resources/arxmliv-dataset-2020.md b/resources/arxmliv-dataset-2020.md index ef3eb33c7033d67e3b937cc371cbb7865b220013..2466b13e4882e3c54eba59b220ef211195a24d9d 100644 --- a/resources/arxmliv-dataset-2020.md +++ b/resources/arxmliv-dataset-2020.md @@ -4,7 +4,8 @@ title: arXMLiv 2020 - An HTML5 dataset for arXiv.org --- ### Release - - This page documents: arxmliv 2020 (latest) + - This page documents: arxmliv 2020 + - Latest: [04.2024](/resources/ar5iv-dataset-2024/) ### Contents - 1,581,037 HTML5 documents diff --git a/resources/arxmliv.md b/resources/arxmliv.md index 04d076efadfb0e2f6ec5345fc811ad750728e194..1f9c43eac35317d789129972ea0493d51cbea9b3 100644 --- a/resources/arxmliv.md +++ b/resources/arxmliv.md @@ -2,6 +2,10 @@ layout: page title: SIGMathLing - arXMLiv Project Datasets and Resources --- + +## 2024 + 1. [ar5iv corpus, 2024 release](/resources/ar5iv-dataset-2024/) + ## 2020 1. [arXMLiv corpus, 2020 release](/resources/arxmliv-dataset-2020/) diff --git a/resources/index.md b/resources/index.md index 31d9a318b207b7f5e8e5514fed37b812677df64e..9ba293c2d10333437c761e5cef84e443ef8069c0 100644 --- a/resources/index.md +++ b/resources/index.md @@ -3,6 +3,7 @@ layout: page title: SIGMathLing - Datasets and Resources --- ## Resources hosted on the SIGMathLing Repository + 1. [ar5iv corpus, 04.2024 release](/resources/ar5iv-dataset-2024/) 1. [argot dataset 2021](/resources/argot-dataset-2021/) 1. [arXMLiv corpus 2020](/resources/arxmliv-dataset-2020/) 1. [arXMLiv corpus, 08.2019 release](/resources/arxmliv-dataset-082019/)