Commit b2df67ad authored by Deyan Ginev's avatar Deyan Ginev

some omissions in updating from 2018 embedding page

parent 84be8ca3
Pipeline #1732 passed with stage
in 2 minutes and 19 seconds
......@@ -20,16 +20,10 @@ Access is restricted to [SIGMathLing members](/member/) under the
articles, the right of distribution was only given (or assumed) to arXiv itself.
### Contents
- An 11.5 billion token model for the arXMLiv 08.2019 dataset, including subformula lexemes
- An 15.2 billion token model for the arXMLiv 08.2019 dataset, including subformula lexemes
- `token_model.zip`
- 300 dimensional GloVe word embeddings for the arXMLiv 08.2019 dataset
- `glove.arxmliv.11B.300d.zip` and `vocab.arxmliv.zip`
- 300d GloVe word embeddings for individual subsets
- `glove.subsets.zip`
- Embeddings and vocabulary with math lexemes omitted
- `glove.arxmliv.nomath.11B.300d.zip` and `vocab.arxmliv.nomath.zip`
- added on July 20, 2019
- used as a control when evaluating the contribution of formula lexemes
- `glove.arxmliv.15B.300d.zip` and `vocab.arxmliv.zip`
- the main arXMLiv dataset is available separately [here](/resources/arxmliv-dataset-082019/)
#### Token Model Statistics
......@@ -65,7 +59,7 @@ Please cite the main dataset when using the word embeddings, as they are generat
([SIGMathLing members](/member/) only)
### Generated via
- [llamapun 0.2.0](https://github.com/KWARC/llamapun/releases/tag/0.2.0),
- [llamapun 0.3.3](https://github.com/KWARC/llamapun/releases/tag/0.3.3),
- [GloVe 1.2, 2019](https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8)
### Generation Parameters
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment