diff --git a/resources/arxmliv-dataset-082017.md b/resources/arxmliv-dataset-082017.md index 7231db01f14075e2099ff9a716ef7b0ed38724a1..332d24b1c6ad44e38f7d2ac9ee30da357ddf2eb9 100644 --- a/resources/arxmliv-dataset-082017.md +++ b/resources/arxmliv-dataset-082017.md @@ -9,7 +9,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR ### Release - This page documents: 08.2017 - - Latest: [08.2019](/resources/arxmliv-dataset-082019/) + - Latest: [2020](/resources/arxmliv-dataset-2020/) ### Accessibility and License The content of this Dataset is licensed to [SIGMathLing members](/member/) for research diff --git a/resources/arxmliv-dataset-082018.md b/resources/arxmliv-dataset-082018.md index 0606ec789d180ca7659e9dc8f685a582d32a6da1..7601fcbbf9b3fa7a917ed6b0f572141cc269f09a 100644 --- a/resources/arxmliv-dataset-082018.md +++ b/resources/arxmliv-dataset-082018.md @@ -9,7 +9,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR ### Release - This page documents: 08.2018 - - Latest: [08.2019](/resources/arxmliv-dataset-082019/) + - Latest: [2020](/resources/arxmliv-dataset-2020/) ### Accessibility and License The content of this Dataset is licensed to [SIGMathLing members](/member/) for research diff --git a/resources/arxmliv-dataset-082019.md b/resources/arxmliv-dataset-082019.md index 830303254cab54cd4ae55416109d1d8ed8602fe1..6df240a32edab8582505b3c14c930b9a94079ce2 100644 --- a/resources/arxmliv-dataset-082019.md +++ b/resources/arxmliv-dataset-082019.md @@ -9,7 +9,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR ### Release - This page documents: 08.2019 - - Latest: [08.2019](/resources/arxmliv-dataset-082019/) + - Latest: [2020](/resources/arxmliv-dataset-2020/) ### Accessibility and License The content of this Dataset is licensed to [SIGMathLing members](/member/) for research diff --git a/resources/arxmliv-dataset-2020.md b/resources/arxmliv-dataset-2020.md new file mode 100644 index 0000000000000000000000000000000000000000..fcb86ce9f5d8682b4fe0754cdc16035dedad3ba4 --- /dev/null +++ b/resources/arxmliv-dataset-2020.md @@ -0,0 +1,105 @@ +--- +layout: page +title: arXMLiv 2020 - An HTML5 dataset for arXiv.org +--- + +### Release + - This page documents: arxmliv 2020 (latest) + +### Contents + - 1,581,037 HTML5 documents + - 354 ZIP archives, in arXiv's Year-Month `yymm` naming scheme. + - e.g. `2012` stands for December 2020, and **not** for the year 2012. + - The HTML sources total `236 GB` packaged, and `2.1 TB` unpacked. + - you also need 1.6 million free inodes to unpack the full data (check via `df -ih .`) + +### Download + - [Download link](https://gl.kwarc.info/SIGMathLing/dataset-arxmliv-2020) + - [SIGMathLing members](/member/) only. Joining is free and mostly a legal checkmark on our end - all researchers welcome! + +### Description + +This is the fourth public release of the arXMLiv dataset generated by the [KWARC](https://kwarc.info/) research group. +It contains HTML5+MathML conversions of the scientific documents from the arXiv.org preprint server, upto and including the end of 2020. It offers a 15% increase in available articles over our 08.2019 release. + +The release also provides the associated conversion metadata under `meta/grouped_by_severity.zip`. The severity information allows to filter by whether the latexml process completed cleanly, with warnings or with recoverable errors. + +A unique feature of the arXMLiv generation process is latexml's cross-referenced and lexematized MathML representation for math syntax. Scroll to the bottom of the page for an example snippet. + +This version of the dataset has had minimal manual quality control, and we offer no additional warranty beyond the latexml severity reported. + +### Citing this Resource + +#### pure bibTeX +``` +@MISC{SML:arXMLiv:2020, + author = {Deyan Ginev}, + title = {arXMLiv:2020 dataset, an HTML5 conversion of arXiv.org}, + howpublished = {hosted at \url{https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/}}, + note = {SIGMathLing -- Special Interest Group on Math Linguistics}, + year = 2020} +``` + +#### bibTeX for the bibLaTeX package (preferred) +``` +@online{SML:arXMLiv:2020, + author = {Deyan Ginev}, + title = {arXMLiv:2020 dataset, an HTML5 conversion of arXiv.org}, + url = {https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/}, + note = {SIGMathLing -- Special Interest Group on Math Linguistics}, + year = 2020} +``` + +#### EndNote +``` +%0 Generic +%T arXMLiv:2020 dataset, an HTML5 conversion of arXiv.org +%A Ginev, Deyan +%D 2020 +%I hosted at https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/ +%F SML:arXMLiv:2020b +%O SIGMathLing – Special Interest Group on Math Linguistics +``` + +### Accessibility and License +The content of this Dataset is licensed to [SIGMathLing members](/member/) for research +and tool development purposes. + +Access is restricted to [SIGMathLing members](/member/) under the +[SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org) +articles, the right of distribution was only given (or assumed) to arXiv itself. + +### Generated via + - [LaTeXML 0.8.5](https://github.com/brucemiller/LaTeXML/releases/tag/v0.8.5), + - [CorTeX 0.4.3](https://github.com/dginev/CorTeX/releases/tag/0.4.3) + - [latexml-plugin-cortex 1.1](https://hub.docker.com/repository/docker/dginev/latexml-plugin-cortex) + +### About +Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWARC](https://kwarc.info/) research group. Author: Deyan Ginev + +### Appendix +**MathML formula example:** + +```xml +<math id="Sx2.p1.1.m1.1" class="ltx_Math" alttext="\mathbb{E}_{x}" display="inline"> + <semantics id="Sx2.p1.1.m1.1a"> + <msub id="Sx2.p1.1.m1.1.1" xref="Sx2.p1.1.m1.1.1.cmml"> + <mi id="Sx2.p1.1.m1.1.1.2" xref="Sx2.p1.1.m1.1.1.2.cmml">𝔼</mi> + <mi id="Sx2.p1.1.m1.1.1.3" xref="Sx2.p1.1.m1.1.1.3.cmml">x</mi> + </msub> + <annotation-xml encoding="MathML-Content" id="Sx2.p1.1.m1.1b"> + <apply id="Sx2.p1.1.m1.1.1.cmml" xref="Sx2.p1.1.m1.1.1"> + <csymbol cd="ambiguous" id="Sx2.p1.1.m1.1.1.1.cmml" xref="Sx2.p1.1.m1.1.1">subscript</csymbol> + <ci id="Sx2.p1.1.m1.1.1.2.cmml" xref="Sx2.p1.1.m1.1.1.2">𝔼</ci> + <ci id="Sx2.p1.1.m1.1.1.3.cmml" xref="Sx2.p1.1.m1.1.1.3">𝑥</ci> + </apply> + </annotation-xml> + <annotation encoding="application/x-tex" id="Sx2.p1.1.m1.1c"> + \mathbb{E}_{x} + </annotation> + <annotation encoding="application/x-llamapun" id="Sx2.p1.1.m1.1d"> + blackboard_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + </annotation> + </semantics> +</math> +``` diff --git a/resources/arxmliv.md b/resources/arxmliv.md index 82ad9a1f7e43d0a17284d1286eeb02400f475969..04d076efadfb0e2f6ec5345fc811ad750728e194 100644 --- a/resources/arxmliv.md +++ b/resources/arxmliv.md @@ -2,6 +2,8 @@ layout: page title: SIGMathLing - arXMLiv Project Datasets and Resources --- +## 2020 + 1. [arXMLiv corpus, 2020 release](/resources/arxmliv-dataset-2020/) ## 2019 1. [arXMLiv corpus, 08.2019 release](/resources/arxmliv-dataset-082019/)