Commit 84d41370 authored by Michael Kohlhase's avatar Michael Kohlhase

more

parent d63bd576
Pipeline #1694 passed with stage
in 2 minutes and 10 seconds
---
layout: post
title: 2018 arXiv Datasets (1.2 Million scientific HTML5 documents from arXiv and token models)
---
SIGMathLing has published the second set of arXiv data sets. The content of these data
sets are licensed to [SIGMathLing members](/member/) for research
and tool development purposes subject to the [SIGMathLing Non-Disclosure-Agreement](/nda/).
This collection of 1.2 Million HTML5 documents
has been developed as part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at
the [KWARC](https://kwarc.info/) research group. It was created by converting the
[arXiv collection of scientific preprints until August 2018](http://arxiv.org) via
[LaTeXML](https://github.com/brucemiller/LaTeXML) using the
[CorTeX corpus management system](https://github.com/dginev/CorTeX).
The token models are generated from this document collection via the
[LLaMaPuN 0.2](https://github.com/KWARC/llamapun/releases/tag/0.2.0) and
[GloVe 1.2](https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8)
libraries.
Details can be found on the [SIGMathLing Resource page](/resources/).
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment