Commit 2646770a authored by Michael Kohlhase's avatar Michael Kohlhase

two data sets actually

parent 5a3f2844
Pipeline #550 passed with stage
in 24 seconds
---
layout: post
title: First Data Set (1.1 Million scientific HTML5 documents from arXiv)
title: First Data Sets (1.1 Million scientific HTML5 documents from arXiv and token models)
---
SIGMathLing has published a first data set, which also acts as a template for future data
sets. The content of this data set is licensed to [SIGMathLing members](/member/) for research
SIGMathLing has published the first data sets. They also act as templates for future data
sets. The content of these data sets are licensed to [SIGMathLing members](/member/) for research
and tool development purposes subject to the [SIGMathLing Non-Disclosure-Agreement](/nda/).
This collection of 1.1 Million HTML5 documents
......@@ -13,6 +13,11 @@ the [KWARC](https://kwarc.info/) research group. It was created by converting t
[LaTeXML](https://github.com/brucemiller/LaTeXML) using the
[CorTeX corpus management system](https://github.com/dginev/CorTeX).
Details can be found on the [SIGMathLing Resource page](/resources/arxmliv/).
The token models are generated from this document collection via the
[LLaMaPuN](https://github.com/KWARC/llamapun/releases/tag/0.1) and
[GloVe](https://github.com/stanfordnlp/GloVe/tree/765074642a6544e47849bb85d8dc2e11e44c2922)
libraries.
Details can be found on the [SIGMathLing Resource page](/resources/).
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment