more

84d41370 · Michael Kohlhase · d63bd576 · 84d41370
Commit 84d41370 authored 5 years ago by Michael Kohlhase
--- a/_posts/2018-09-13.md
+++ b/_posts/2018-09-13.md
+---
+layout: post
+title: 2018 arXiv Datasets (1.2 Million scientific HTML5 documents from arXiv and token models)
+---
+SIGMathLing has published the second set of arXiv data sets. The content of these data
+sets are licensed to [SIGMathLing members](/member/) for research 
+and tool development purposes subject to the [SIGMathLing Non-Disclosure-Agreement](/nda/).
+
+This collection of 1.2 Million HTML5 documents
+has been developed as part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at
+the [KWARC](https://kwarc.info/) research group.  It was created by converting the
+[arXiv collection of scientific preprints until August 2018](http://arxiv.org) via
+[LaTeXML](https://github.com/brucemiller/LaTeXML) using the
+[CorTeX corpus management system](https://github.com/dginev/CorTeX).
+
+The token models are generated from this document collection via the
+[LLaMaPuN 0.2](https://github.com/KWARC/llamapun/releases/tag/0.2.0) and
+[GloVe 1.2](https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8)
+libraries. 
+
+Details can be found on the [SIGMathLing Resource page](/resources/).
+
+