Skip to content
Snippets Groups Projects
Commit 2646770a authored by Michael Kohlhase's avatar Michael Kohlhase
Browse files

two data sets actually

parent 5a3f2844
No related branches found
No related tags found
No related merge requests found
Pipeline #
---
layout: post
title: First Data Set (1.1 Million scientific HTML5 documents from arXiv)
title: First Data Sets (1.1 Million scientific HTML5 documents from arXiv and token models)
---
SIGMathLing has published a first data set, which also acts as a template for future data
sets. The content of this data set is licensed to [SIGMathLing members](/member/) for research
SIGMathLing has published the first data sets. They also act as templates for future data
sets. The content of these data sets are licensed to [SIGMathLing members](/member/) for research
and tool development purposes subject to the [SIGMathLing Non-Disclosure-Agreement](/nda/).
This collection of 1.1 Million HTML5 documents
......@@ -13,6 +13,11 @@ the [KWARC](https://kwarc.info/) research group. It was created by converting t
[LaTeXML](https://github.com/brucemiller/LaTeXML) using the
[CorTeX corpus management system](https://github.com/dginev/CorTeX).
Details can be found on the [SIGMathLing Resource page](/resources/arxmliv/).
The token models are generated from this document collection via the
[LLaMaPuN](https://github.com/KWARC/llamapun/releases/tag/0.1) and
[GloVe](https://github.com/stanfordnlp/GloVe/tree/765074642a6544e47849bb85d8dc2e11e44c2922)
libraries.
Details can be found on the [SIGMathLing Resource page](/resources/).
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment