Skip to content
Snippets Groups Projects
Commit 0951163b authored by Deyan Ginev's avatar Deyan Ginev
Browse files

announce arxmliv 2020 release

parent fbb22ec3
No related branches found
No related tags found
1 merge request!13Announce arxmliv 2020 release
This commit is part of merge request !13. Comments created here will be created in the context of that merge request.
...@@ -9,7 +9,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR ...@@ -9,7 +9,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR
### Release ### Release
- This page documents: 08.2017 - This page documents: 08.2017
- Latest: [08.2019](/resources/arxmliv-dataset-082019/) - Latest: [2020](/resources/arxmliv-dataset-2020/)
### Accessibility and License ### Accessibility and License
The content of this Dataset is licensed to [SIGMathLing members](/member/) for research The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
......
...@@ -9,7 +9,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR ...@@ -9,7 +9,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR
### Release ### Release
- This page documents: 08.2018 - This page documents: 08.2018
- Latest: [08.2019](/resources/arxmliv-dataset-082019/) - Latest: [2020](/resources/arxmliv-dataset-2020/)
### Accessibility and License ### Accessibility and License
The content of this Dataset is licensed to [SIGMathLing members](/member/) for research The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
......
...@@ -9,7 +9,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR ...@@ -9,7 +9,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR
### Release ### Release
- This page documents: 08.2019 - This page documents: 08.2019
- Latest: [08.2019](/resources/arxmliv-dataset-082019/) - Latest: [2020](/resources/arxmliv-dataset-2020/)
### Accessibility and License ### Accessibility and License
The content of this Dataset is licensed to [SIGMathLing members](/member/) for research The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
......
---
layout: page
title: arXMLiv 2020 - An HTML5 dataset for arXiv.org
---
### Release
- This page documents: arxmliv 2020 (latest)
### Contents
- 1,581,037 HTML5 documents
- 354 ZIP archives, in arXiv's Year-Month `yymm` naming scheme.
- e.g. `2012` stands for December 2020, and **not** for the year 2012.
- The HTML sources total `236 GB` packaged, and `2.1 TB` unpacked.
- you also need 1.6 million free inodes to unpack the full data (check via `df -ih .`)
### Download
- [Download link](https://gl.kwarc.info/SIGMathLing/dataset-arxmliv-2020)
- [SIGMathLing members](/member/) only. Joining is free and mostly a legal checkmark on our end - all researchers welcome!
### Description
This is the fourth public release of the arXMLiv dataset generated by the [KWARC](https://kwarc.info/) research group.
It contains HTML5+MathML conversions of the scientific documents from the arXiv.org preprint server, upto and including the end of 2020. It offers a 15% increase in available articles over our 08.2019 release.
The release also provides the associated conversion metadata under `meta/grouped_by_severity.zip`. The severity information allows to filter by whether the latexml process completed cleanly, with warnings or with recoverable errors.
A unique feature of the arXMLiv generation process is latexml's cross-referenced and lexematized MathML representation for math syntax. Scroll to the bottom of the page for an example snippet.
This version of the dataset has had minimal manual quality control, and we offer no additional warranty beyond the latexml severity reported.
### Citing this Resource
#### pure bibTeX
```
@MISC{SML:arXMLiv:2020,
author = {Deyan Ginev},
title = {arXMLiv:2020 dataset, an HTML5 conversion of arXiv.org},
howpublished = {hosted at \url{https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/}},
note = {SIGMathLing -- Special Interest Group on Math Linguistics},
year = 2020}
```
#### bibTeX for the bibLaTeX package (preferred)
```
@online{SML:arXMLiv:2020,
author = {Deyan Ginev},
title = {arXMLiv:2020 dataset, an HTML5 conversion of arXiv.org},
url = {https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/},
note = {SIGMathLing -- Special Interest Group on Math Linguistics},
year = 2020}
```
#### EndNote
```
%0 Generic
%T arXMLiv:2020 dataset, an HTML5 conversion of arXiv.org
%A Ginev, Deyan
%D 2020
%I hosted at https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/
%F SML:arXMLiv:2020b
%O SIGMathLing – Special Interest Group on Math Linguistics
```
### Accessibility and License
The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
and tool development purposes.
Access is restricted to [SIGMathLing members](/member/) under the
[SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org)
articles, the right of distribution was only given (or assumed) to arXiv itself.
### Generated via
- [LaTeXML 0.8.5](https://github.com/brucemiller/LaTeXML/releases/tag/v0.8.5),
- [CorTeX 0.4.3](https://github.com/dginev/CorTeX/releases/tag/0.4.3)
- [latexml-plugin-cortex 1.1](https://hub.docker.com/repository/docker/dginev/latexml-plugin-cortex)
### About
Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWARC](https://kwarc.info/) research group. Author: Deyan Ginev
### Appendix
**MathML formula example:**
```xml
<math id="Sx2.p1.1.m1.1" class="ltx_Math" alttext="\mathbb{E}_{x}" display="inline">
<semantics id="Sx2.p1.1.m1.1a">
<msub id="Sx2.p1.1.m1.1.1" xref="Sx2.p1.1.m1.1.1.cmml">
<mi id="Sx2.p1.1.m1.1.1.2" xref="Sx2.p1.1.m1.1.1.2.cmml">𝔼</mi>
<mi id="Sx2.p1.1.m1.1.1.3" xref="Sx2.p1.1.m1.1.1.3.cmml">x</mi>
</msub>
<annotation-xml encoding="MathML-Content" id="Sx2.p1.1.m1.1b">
<apply id="Sx2.p1.1.m1.1.1.cmml" xref="Sx2.p1.1.m1.1.1">
<csymbol cd="ambiguous" id="Sx2.p1.1.m1.1.1.1.cmml" xref="Sx2.p1.1.m1.1.1">subscript</csymbol>
<ci id="Sx2.p1.1.m1.1.1.2.cmml" xref="Sx2.p1.1.m1.1.1.2">𝔼</ci>
<ci id="Sx2.p1.1.m1.1.1.3.cmml" xref="Sx2.p1.1.m1.1.1.3">𝑥</ci>
</apply>
</annotation-xml>
<annotation encoding="application/x-tex" id="Sx2.p1.1.m1.1c">
\mathbb{E}_{x}
</annotation>
<annotation encoding="application/x-llamapun" id="Sx2.p1.1.m1.1d">
blackboard_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT
</annotation>
</semantics>
</math>
```
...@@ -2,6 +2,8 @@ ...@@ -2,6 +2,8 @@
layout: page layout: page
title: SIGMathLing - arXMLiv Project Datasets and Resources title: SIGMathLing - arXMLiv Project Datasets and Resources
--- ---
## 2020
1. [arXMLiv corpus, 2020 release](/resources/arxmliv-dataset-2020/)
## 2019 ## 2019
1. [arXMLiv corpus, 08.2019 release](/resources/arxmliv-dataset-082019/) 1. [arXMLiv corpus, 08.2019 release](/resources/arxmliv-dataset-082019/)
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment