From 0951163b92eb4b7cccfef99e8348fc8cdba479a9 Mon Sep 17 00:00:00 2001 From: Deyan Ginev <deyan.ginev@gmail.com> Date: Sun, 24 Jan 2021 15:15:39 -0500 Subject: [PATCH] announce arxmliv 2020 release --- resources/arxmliv-dataset-082017.md | 2 +- resources/arxmliv-dataset-082018.md | 2 +- resources/arxmliv-dataset-082019.md | 2 +- resources/arxmliv-dataset-2020.md | 105 ++++++++++++++++++++++++++++ resources/arxmliv.md | 2 + 5 files changed, 110 insertions(+), 3 deletions(-) create mode 100644 resources/arxmliv-dataset-2020.md diff --git a/resources/arxmliv-dataset-082017.md b/resources/arxmliv-dataset-082017.md index 7231db0..332d24b 100644 --- a/resources/arxmliv-dataset-082017.md +++ b/resources/arxmliv-dataset-082017.md @@ -9,7 +9,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR ### Release - This page documents: 08.2017 - - Latest: [08.2019](/resources/arxmliv-dataset-082019/) + - Latest: [2020](/resources/arxmliv-dataset-2020/) ### Accessibility and License The content of this Dataset is licensed to [SIGMathLing members](/member/) for research diff --git a/resources/arxmliv-dataset-082018.md b/resources/arxmliv-dataset-082018.md index 0606ec7..7601fcb 100644 --- a/resources/arxmliv-dataset-082018.md +++ b/resources/arxmliv-dataset-082018.md @@ -9,7 +9,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR ### Release - This page documents: 08.2018 - - Latest: [08.2019](/resources/arxmliv-dataset-082019/) + - Latest: [2020](/resources/arxmliv-dataset-2020/) ### Accessibility and License The content of this Dataset is licensed to [SIGMathLing members](/member/) for research diff --git a/resources/arxmliv-dataset-082019.md b/resources/arxmliv-dataset-082019.md index 8303032..6df240a 100644 --- a/resources/arxmliv-dataset-082019.md +++ b/resources/arxmliv-dataset-082019.md @@ -9,7 +9,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR ### Release - This page documents: 08.2019 - - Latest: [08.2019](/resources/arxmliv-dataset-082019/) + - Latest: [2020](/resources/arxmliv-dataset-2020/) ### Accessibility and License The content of this Dataset is licensed to [SIGMathLing members](/member/) for research diff --git a/resources/arxmliv-dataset-2020.md b/resources/arxmliv-dataset-2020.md new file mode 100644 index 0000000..fcb86ce --- /dev/null +++ b/resources/arxmliv-dataset-2020.md @@ -0,0 +1,105 @@ +--- +layout: page +title: arXMLiv 2020 - An HTML5 dataset for arXiv.org +--- + +### Release + - This page documents: arxmliv 2020 (latest) + +### Contents + - 1,581,037 HTML5 documents + - 354 ZIP archives, in arXiv's Year-Month `yymm` naming scheme. + - e.g. `2012` stands for December 2020, and **not** for the year 2012. + - The HTML sources total `236 GB` packaged, and `2.1 TB` unpacked. + - you also need 1.6 million free inodes to unpack the full data (check via `df -ih .`) + +### Download + - [Download link](https://gl.kwarc.info/SIGMathLing/dataset-arxmliv-2020) + - [SIGMathLing members](/member/) only. Joining is free and mostly a legal checkmark on our end - all researchers welcome! + +### Description + +This is the fourth public release of the arXMLiv dataset generated by the [KWARC](https://kwarc.info/) research group. +It contains HTML5+MathML conversions of the scientific documents from the arXiv.org preprint server, upto and including the end of 2020. It offers a 15% increase in available articles over our 08.2019 release. + +The release also provides the associated conversion metadata under `meta/grouped_by_severity.zip`. The severity information allows to filter by whether the latexml process completed cleanly, with warnings or with recoverable errors. + +A unique feature of the arXMLiv generation process is latexml's cross-referenced and lexematized MathML representation for math syntax. Scroll to the bottom of the page for an example snippet. + +This version of the dataset has had minimal manual quality control, and we offer no additional warranty beyond the latexml severity reported. + +### Citing this Resource + +#### pure bibTeX +``` +@MISC{SML:arXMLiv:2020, + author = {Deyan Ginev}, + title = {arXMLiv:2020 dataset, an HTML5 conversion of arXiv.org}, + howpublished = {hosted at \url{https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/}}, + note = {SIGMathLing -- Special Interest Group on Math Linguistics}, + year = 2020} +``` + +#### bibTeX for the bibLaTeX package (preferred) +``` +@online{SML:arXMLiv:2020, + author = {Deyan Ginev}, + title = {arXMLiv:2020 dataset, an HTML5 conversion of arXiv.org}, + url = {https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/}, + note = {SIGMathLing -- Special Interest Group on Math Linguistics}, + year = 2020} +``` + +#### EndNote +``` +%0 Generic +%T arXMLiv:2020 dataset, an HTML5 conversion of arXiv.org +%A Ginev, Deyan +%D 2020 +%I hosted at https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/ +%F SML:arXMLiv:2020b +%O SIGMathLing – Special Interest Group on Math Linguistics +``` + +### Accessibility and License +The content of this Dataset is licensed to [SIGMathLing members](/member/) for research +and tool development purposes. + +Access is restricted to [SIGMathLing members](/member/) under the +[SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org) +articles, the right of distribution was only given (or assumed) to arXiv itself. + +### Generated via + - [LaTeXML 0.8.5](https://github.com/brucemiller/LaTeXML/releases/tag/v0.8.5), + - [CorTeX 0.4.3](https://github.com/dginev/CorTeX/releases/tag/0.4.3) + - [latexml-plugin-cortex 1.1](https://hub.docker.com/repository/docker/dginev/latexml-plugin-cortex) + +### About +Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWARC](https://kwarc.info/) research group. Author: Deyan Ginev + +### Appendix +**MathML formula example:** + +```xml +<math id="Sx2.p1.1.m1.1" class="ltx_Math" alttext="\mathbb{E}_{x}" display="inline"> + <semantics id="Sx2.p1.1.m1.1a"> + <msub id="Sx2.p1.1.m1.1.1" xref="Sx2.p1.1.m1.1.1.cmml"> + <mi id="Sx2.p1.1.m1.1.1.2" xref="Sx2.p1.1.m1.1.1.2.cmml">𝔼</mi> + <mi id="Sx2.p1.1.m1.1.1.3" xref="Sx2.p1.1.m1.1.1.3.cmml">x</mi> + </msub> + <annotation-xml encoding="MathML-Content" id="Sx2.p1.1.m1.1b"> + <apply id="Sx2.p1.1.m1.1.1.cmml" xref="Sx2.p1.1.m1.1.1"> + <csymbol cd="ambiguous" id="Sx2.p1.1.m1.1.1.1.cmml" xref="Sx2.p1.1.m1.1.1">subscript</csymbol> + <ci id="Sx2.p1.1.m1.1.1.2.cmml" xref="Sx2.p1.1.m1.1.1.2">𝔼</ci> + <ci id="Sx2.p1.1.m1.1.1.3.cmml" xref="Sx2.p1.1.m1.1.1.3">𝑥</ci> + </apply> + </annotation-xml> + <annotation encoding="application/x-tex" id="Sx2.p1.1.m1.1c"> + \mathbb{E}_{x} + </annotation> + <annotation encoding="application/x-llamapun" id="Sx2.p1.1.m1.1d"> + blackboard_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + </annotation> + </semantics> +</math> +``` diff --git a/resources/arxmliv.md b/resources/arxmliv.md index 82ad9a1..04d076e 100644 --- a/resources/arxmliv.md +++ b/resources/arxmliv.md @@ -2,6 +2,8 @@ layout: page title: SIGMathLing - arXMLiv Project Datasets and Resources --- +## 2020 + 1. [arXMLiv corpus, 2020 release](/resources/arxmliv-dataset-2020/) ## 2019 1. [arXMLiv corpus, 08.2019 release](/resources/arxmliv-dataset-082019/) -- GitLab