diff --git a/resources/arxmliv-embeddings-082019.md b/resources/arxmliv-embeddings-082019.md index f068cca0a48fb6c3bd990a409dd8df1c0ac1ce78..d959fe2942e92c0ef24a6a3fa35d9ccc70db5245 100644 --- a/resources/arxmliv-embeddings-082019.md +++ b/resources/arxmliv-embeddings-082019.md @@ -20,16 +20,10 @@ Access is restricted to [SIGMathLing members](/member/) under the articles, the right of distribution was only given (or assumed) to arXiv itself. ### Contents - - An 11.5 billion token model for the arXMLiv 08.2019 dataset, including subformula lexemes + - An 15.2 billion token model for the arXMLiv 08.2019 dataset, including subformula lexemes - `token_model.zip` - 300 dimensional GloVe word embeddings for the arXMLiv 08.2019 dataset - - `glove.arxmliv.11B.300d.zip` and `vocab.arxmliv.zip` - - 300d GloVe word embeddings for individual subsets - - `glove.subsets.zip` - - Embeddings and vocabulary with math lexemes omitted - - `glove.arxmliv.nomath.11B.300d.zip` and `vocab.arxmliv.nomath.zip` - - added on July 20, 2019 - - used as a control when evaluating the contribution of formula lexemes + - `glove.arxmliv.15B.300d.zip` and `vocab.arxmliv.zip` - the main arXMLiv dataset is available separately [here](/resources/arxmliv-dataset-082019/) #### Token Model Statistics @@ -65,7 +59,7 @@ Please cite the main dataset when using the word embeddings, as they are generat ([SIGMathLing members](/member/) only) ### Generated via - - [llamapun 0.2.0](https://github.com/KWARC/llamapun/releases/tag/0.2.0), + - [llamapun 0.3.3](https://github.com/KWARC/llamapun/releases/tag/0.3.3), - [GloVe 1.2, 2019](https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8) ### Generation Parameters