From 2646770a7a3c5671612b7d37a5a1d0cbb08e5c80 Mon Sep 17 00:00:00 2001 From: Michael Kohlhase <michael.kohlhase@fau.de> Date: Wed, 24 Jan 2018 20:06:17 +0100 Subject: [PATCH] two data sets actually --- _posts/2018-01-24-dataset.md | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/_posts/2018-01-24-dataset.md b/_posts/2018-01-24-dataset.md index 43ebcab..f23ef47 100644 --- a/_posts/2018-01-24-dataset.md +++ b/_posts/2018-01-24-dataset.md @@ -1,9 +1,9 @@ --- layout: post -title: First Data Set (1.1 Million scientific HTML5 documents from arXiv) +title: First Data Sets (1.1 Million scientific HTML5 documents from arXiv and token models) --- -SIGMathLing has published a first data set, which also acts as a template for future data -sets. The content of this data set is licensed to [SIGMathLing members](/member/) for research +SIGMathLing has published the first data sets. They also act as templates for future data +sets. The content of these data sets are licensed to [SIGMathLing members](/member/) for research and tool development purposes subject to the [SIGMathLing Non-Disclosure-Agreement](/nda/). This collection of 1.1 Million HTML5 documents @@ -13,6 +13,11 @@ the [KWARC](https://kwarc.info/) research group. It was created by converting t [LaTeXML](https://github.com/brucemiller/LaTeXML) using the [CorTeX corpus management system](https://github.com/dginev/CorTeX). -Details can be found on the [SIGMathLing Resource page](/resources/arxmliv/). +The token models are generated from this document collection via the +[LLaMaPuN](https://github.com/KWARC/llamapun/releases/tag/0.1) and +[GloVe](https://github.com/stanfordnlp/GloVe/tree/765074642a6544e47849bb85d8dc2e11e44c2922) +libraries. + +Details can be found on the [SIGMathLing Resource page](/resources/). -- GitLab