diff --git a/_posts/2018-01-24-dataset.md b/_posts/2018-01-24-dataset.md index 43ebcab5d98eb61918ec160d119f1425e730196a..f23ef4764bb8fb0bbd952d069d4b22a7aed0a5e3 100644 --- a/_posts/2018-01-24-dataset.md +++ b/_posts/2018-01-24-dataset.md @@ -1,9 +1,9 @@ --- layout: post -title: First Data Set (1.1 Million scientific HTML5 documents from arXiv) +title: First Data Sets (1.1 Million scientific HTML5 documents from arXiv and token models) --- -SIGMathLing has published a first data set, which also acts as a template for future data -sets. The content of this data set is licensed to [SIGMathLing members](/member/) for research +SIGMathLing has published the first data sets. They also act as templates for future data +sets. The content of these data sets are licensed to [SIGMathLing members](/member/) for research and tool development purposes subject to the [SIGMathLing Non-Disclosure-Agreement](/nda/). This collection of 1.1 Million HTML5 documents @@ -13,6 +13,11 @@ the [KWARC](https://kwarc.info/) research group. It was created by converting t [LaTeXML](https://github.com/brucemiller/LaTeXML) using the [CorTeX corpus management system](https://github.com/dginev/CorTeX). -Details can be found on the [SIGMathLing Resource page](/resources/arxmliv/). +The token models are generated from this document collection via the +[LLaMaPuN](https://github.com/KWARC/llamapun/releases/tag/0.1) and +[GloVe](https://github.com/stanfordnlp/GloVe/tree/765074642a6544e47849bb85d8dc2e11e44c2922) +libraries. + +Details can be found on the [SIGMathLing Resource page](/resources/).