Commit d206b7b7 authored by Deyan Ginev's avatar Deyan Ginev
Browse files

Merge branch 'arxmliv-2020' into 'master'

Announce arxmliv 2020 release

See merge request !13
parents fbb22ec3 0951163b
Pipeline #2964 passed with stage
in 1 minute and 15 seconds
......@@ -9,7 +9,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR
### Release
- This page documents: 08.2017
- Latest: [08.2019](/resources/arxmliv-dataset-082019/)
- Latest: [2020](/resources/arxmliv-dataset-2020/)
### Accessibility and License
The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
......
......@@ -9,7 +9,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR
### Release
- This page documents: 08.2018
- Latest: [08.2019](/resources/arxmliv-dataset-082019/)
- Latest: [2020](/resources/arxmliv-dataset-2020/)
### Accessibility and License
The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
......
......@@ -9,7 +9,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR
### Release
- This page documents: 08.2019
- Latest: [08.2019](/resources/arxmliv-dataset-082019/)
- Latest: [2020](/resources/arxmliv-dataset-2020/)
### Accessibility and License
The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
......
---
layout: page
title: arXMLiv 2020 - An HTML5 dataset for arXiv.org
---
### Release
- This page documents: arxmliv 2020 (latest)
### Contents
- 1,581,037 HTML5 documents
- 354 ZIP archives, in arXiv's Year-Month `yymm` naming scheme.
- e.g. `2012` stands for December 2020, and **not** for the year 2012.
- The HTML sources total `236 GB` packaged, and `2.1 TB` unpacked.
- you also need 1.6 million free inodes to unpack the full data (check via `df -ih .`)
### Download
- [Download link](https://gl.kwarc.info/SIGMathLing/dataset-arxmliv-2020)
- [SIGMathLing members](/member/) only. Joining is free and mostly a legal checkmark on our end - all researchers welcome!
### Description
This is the fourth public release of the arXMLiv dataset generated by the [KWARC](https://kwarc.info/) research group.
It contains HTML5+MathML conversions of the scientific documents from the arXiv.org preprint server, upto and including the end of 2020. It offers a 15% increase in available articles over our 08.2019 release.
The release also provides the associated conversion metadata under `meta/grouped_by_severity.zip`. The severity information allows to filter by whether the latexml process completed cleanly, with warnings or with recoverable errors.
A unique feature of the arXMLiv generation process is latexml's cross-referenced and lexematized MathML representation for math syntax. Scroll to the bottom of the page for an example snippet.
This version of the dataset has had minimal manual quality control, and we offer no additional warranty beyond the latexml severity reported.
### Citing this Resource
#### pure bibTeX
```
@MISC{SML:arXMLiv:2020,
author = {Deyan Ginev},
title = {arXMLiv:2020 dataset, an HTML5 conversion of arXiv.org},
howpublished = {hosted at \url{https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/}},
note = {SIGMathLing -- Special Interest Group on Math Linguistics},
year = 2020}
```
#### bibTeX for the bibLaTeX package (preferred)
```
@online{SML:arXMLiv:2020,
author = {Deyan Ginev},
title = {arXMLiv:2020 dataset, an HTML5 conversion of arXiv.org},
url = {https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/},
note = {SIGMathLing -- Special Interest Group on Math Linguistics},
year = 2020}
```
#### EndNote
```
%0 Generic
%T arXMLiv:2020 dataset, an HTML5 conversion of arXiv.org
%A Ginev, Deyan
%D 2020
%I hosted at https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/
%F SML:arXMLiv:2020b
%O SIGMathLing – Special Interest Group on Math Linguistics
```
### Accessibility and License
The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
and tool development purposes.
Access is restricted to [SIGMathLing members](/member/) under the
[SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org)
articles, the right of distribution was only given (or assumed) to arXiv itself.
### Generated via
- [LaTeXML 0.8.5](https://github.com/brucemiller/LaTeXML/releases/tag/v0.8.5),
- [CorTeX 0.4.3](https://github.com/dginev/CorTeX/releases/tag/0.4.3)
- [latexml-plugin-cortex 1.1](https://hub.docker.com/repository/docker/dginev/latexml-plugin-cortex)
### About
Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWARC](https://kwarc.info/) research group. Author: Deyan Ginev
### Appendix
**MathML formula example:**
```xml
<math id="Sx2.p1.1.m1.1" class="ltx_Math" alttext="\mathbb{E}_{x}" display="inline">
<semantics id="Sx2.p1.1.m1.1a">
<msub id="Sx2.p1.1.m1.1.1" xref="Sx2.p1.1.m1.1.1.cmml">
<mi id="Sx2.p1.1.m1.1.1.2" xref="Sx2.p1.1.m1.1.1.2.cmml">𝔼</mi>
<mi id="Sx2.p1.1.m1.1.1.3" xref="Sx2.p1.1.m1.1.1.3.cmml">x</mi>
</msub>
<annotation-xml encoding="MathML-Content" id="Sx2.p1.1.m1.1b">
<apply id="Sx2.p1.1.m1.1.1.cmml" xref="Sx2.p1.1.m1.1.1">
<csymbol cd="ambiguous" id="Sx2.p1.1.m1.1.1.1.cmml" xref="Sx2.p1.1.m1.1.1">subscript</csymbol>
<ci id="Sx2.p1.1.m1.1.1.2.cmml" xref="Sx2.p1.1.m1.1.1.2">𝔼</ci>
<ci id="Sx2.p1.1.m1.1.1.3.cmml" xref="Sx2.p1.1.m1.1.1.3">𝑥</ci>
</apply>
</annotation-xml>
<annotation encoding="application/x-tex" id="Sx2.p1.1.m1.1c">
\mathbb{E}_{x}
</annotation>
<annotation encoding="application/x-llamapun" id="Sx2.p1.1.m1.1d">
blackboard_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT
</annotation>
</semantics>
</math>
```
......@@ -2,6 +2,8 @@
layout: page
title: SIGMathLing - arXMLiv Project Datasets and Resources
---
## 2020
1. [arXMLiv corpus, 2020 release](/resources/arxmliv-dataset-2020/)
## 2019
1. [arXMLiv corpus, 08.2019 release](/resources/arxmliv-dataset-082019/)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment