Skip to content
Snippets Groups Projects
Commit 84d41370 authored by Michael Kohlhase's avatar Michael Kohlhase
Browse files

more

parent d63bd576
Branches
No related tags found
No related merge requests found
Pipeline #1694 passed
---
layout: post
title: 2018 arXiv Datasets (1.2 Million scientific HTML5 documents from arXiv and token models)
---
SIGMathLing has published the second set of arXiv data sets. The content of these data
sets are licensed to [SIGMathLing members](/member/) for research
and tool development purposes subject to the [SIGMathLing Non-Disclosure-Agreement](/nda/).
This collection of 1.2 Million HTML5 documents
has been developed as part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at
the [KWARC](https://kwarc.info/) research group. It was created by converting the
[arXiv collection of scientific preprints until August 2018](http://arxiv.org) via
[LaTeXML](https://github.com/brucemiller/LaTeXML) using the
[CorTeX corpus management system](https://github.com/dginev/CorTeX).
The token models are generated from this document collection via the
[LLaMaPuN 0.2](https://github.com/KWARC/llamapun/releases/tag/0.2.0) and
[GloVe 1.2](https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8)
libraries.
Details can be found on the [SIGMathLing Resource page](/resources/).
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment