two data sets actually

2646770a · Michael Kohlhase · 5a3f2844 · 2646770a
Commit 2646770a authored 7 years ago by Michael Kohlhase
--- a/_posts/2018-01-24-dataset.md
+++ b/_posts/2018-01-24-dataset.md
 ---
 layout: post
-title: First Data Set (1.1 Million scientific HTML5 documents from arXiv)
+title: First Data Sets (1.1 Million scientific HTML5 documents from arXiv and token models)
 ---
-SIGMathLing has published a first data set, which also acts as a template for future data
-sets. The content of this data set is licensed to [SIGMathLing members](/member/) for research
+SIGMathLing has published the first data sets. They also act as templates for future data
+sets. The content of these data sets are licensed to [SIGMathLing members](/member/) for research
 and tool development purposes subject to the [SIGMathLing Non-Disclosure-Agreement](/nda/).

 This collection of 1.1 Million HTML5 documents
@@ -13,6 +13,11 @@ the [KWARC](https://kwarc.info/) research group.  It was created by converting t
 [LaTeXML](https://github.com/brucemiller/LaTeXML) using the
 [CorTeX corpus management system](https://github.com/dginev/CorTeX).

-Details can be found on the [SIGMathLing Resource page](/resources/arxmliv/).
+The token models are generated from this document collection via the
+[LLaMaPuN](https://github.com/KWARC/llamapun/releases/tag/0.1) and
+[GloVe](https://github.com/stanfordnlp/GloVe/tree/765074642a6544e47849bb85d8dc2e11e44c2922)
+libraries. 
+
+Details can be found on the [SIGMathLing Resource page](/resources/).