Skip to content
Snippets Groups Projects

Document 08.2017 arxlmiv dataset release

Merged Deyan Ginev requested to merge arxmliv-release into master
Files
4
+ 18
0
---
layout: post
title: First Data Set (1.1 Million scientific HTML5 documents from arXiv)
---
SIGMathLing has published a first data set, which also acts as a template for future data
sets. The content of this data set is licensed to [SIGMathLing members](/member/) for research
and tool development purposes subject to the [SIGMathLing Non-Disclosure-Agreement](/nda/).
This collection of 1.1 Million HTML5 documents
has been developed as part of the [arXMLiv](https://kwarc.info/systems/arXMLiv/) project at
the [KWARC](https://kwarc.info/) research group. It was created by converting the
[arXiv collection of scientific preprints until August 2017](http://arxiv.org) via
[LaTeXML](https://github.com/brucemiller/LaTeXML) using the
[CorTeX corpus management system](https://github.com/dginev/CorTeX).
Details can be found on the [SIGMathLing Resource page](/resources/arxmliv/).
Loading