Commit 84d41370 authored by Michael Kohlhase's avatar Michael Kohlhase
Browse files


parent d63bd576
Pipeline #1694 passed with stage
in 2 minutes and 10 seconds
layout: post
title: 2018 arXiv Datasets (1.2 Million scientific HTML5 documents from arXiv and token models)
SIGMathLing has published the second set of arXiv data sets. The content of these data
sets are licensed to [SIGMathLing members](/member/) for research
and tool development purposes subject to the [SIGMathLing Non-Disclosure-Agreement](/nda/).
This collection of 1.2 Million HTML5 documents
has been developed as part of the [arXMLiv]( project at
the [KWARC]( research group. It was created by converting the
[arXiv collection of scientific preprints until August 2018]( via
[LaTeXML]( using the
[CorTeX corpus management system](
The token models are generated from this document collection via the
[LLaMaPuN 0.2]( and
[GloVe 1.2](
Details can be found on the [SIGMathLing Resource page](/resources/).
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment