From b2df67ad533823ad1e2f3278d7c2e9dc44616cbe Mon Sep 17 00:00:00 2001 From: Deyan Ginev <deyan.ginev@gmail.com> Date: Thu, 19 Sep 2019 09:25:12 -0400 Subject: [PATCH] some omissions in updating from 2018 embedding page --- resources/arxmliv-embeddings-082019.md | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/resources/arxmliv-embeddings-082019.md b/resources/arxmliv-embeddings-082019.md index f068cca..d959fe2 100644 --- a/resources/arxmliv-embeddings-082019.md +++ b/resources/arxmliv-embeddings-082019.md @@ -20,16 +20,10 @@ Access is restricted to [SIGMathLing members](/member/) under the articles, the right of distribution was only given (or assumed) to arXiv itself. ### Contents - - An 11.5 billion token model for the arXMLiv 08.2019 dataset, including subformula lexemes + - An 15.2 billion token model for the arXMLiv 08.2019 dataset, including subformula lexemes - `token_model.zip` - 300 dimensional GloVe word embeddings for the arXMLiv 08.2019 dataset - - `glove.arxmliv.11B.300d.zip` and `vocab.arxmliv.zip` - - 300d GloVe word embeddings for individual subsets - - `glove.subsets.zip` - - Embeddings and vocabulary with math lexemes omitted - - `glove.arxmliv.nomath.11B.300d.zip` and `vocab.arxmliv.nomath.zip` - - added on July 20, 2019 - - used as a control when evaluating the contribution of formula lexemes + - `glove.arxmliv.15B.300d.zip` and `vocab.arxmliv.zip` - the main arXMLiv dataset is available separately [here](/resources/arxmliv-dataset-082019/) #### Token Model Statistics @@ -65,7 +59,7 @@ Please cite the main dataset when using the word embeddings, as they are generat ([SIGMathLing members](/member/) only) ### Generated via - - [llamapun 0.2.0](https://github.com/KWARC/llamapun/releases/tag/0.2.0), + - [llamapun 0.3.3](https://github.com/KWARC/llamapun/releases/tag/0.3.3), - [GloVe 1.2, 2019](https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8) ### Generation Parameters -- GitLab