Skip to content
Snippets Groups Projects
Commit c09ed9ec authored by Deyan Ginev's avatar Deyan Ginev
Browse files

Update embeddings description with "nomath" controls

parent fe466e65
No related branches found
No related tags found
No related merge requests found
......@@ -12,7 +12,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR
### Accessibility and License
The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
and tool development purposes.
and tool development purposes.
Access is restricted to [SIGMathLing members](/member/) under the
[SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org)
......@@ -22,35 +22,39 @@ articles, the right of distribution was only given (or assumed) to arXiv itself.
- An 11.5 billion token model for the arXMLiv 08.2018 dataset, including subformula lexemes
- `token_model.zip`
- 300 dimensional GloVe word embeddings for the arXMLiv 08.2018 dataset
- `glove.arxmliv.5B.300d.zip` and `vocab.arxmliv.zip`
- `glove.arxmliv.11B.300d.zip` and `vocab.arxmliv.zip`
- 300d GloVe word embeddings for individual subsets
- `glove.subsets.zip`
- Embeddings and vocabulary with math lexemes omitted
- `glove.arxmliv.nomath.11B.300d.zip` and `vocab.arxmliv.nomath.zip`
- added on July 20, 2019
- used as a control when evaluating the contribution of formula lexemes
- the main arXMLiv dataset is available separately [here](/resources/arxmliv-dataset-082018/)
#### Token Model Statistics
subset | documents | paragraphs | sentences |
-----------|----------:|-----------:|------------:|
no_problem | 137,864 | 4,646,203 | 21,533,963 |
warning | 705,095 | 45,797,794 | 183,246,777 |
error | 389,225 | 26,759,524 | 99,641,978 |
complete | 1,232,184 | 77,203,521 | 304,422,718 |
| subset | documents | paragraphs | sentences |
| ---------- | --------: | ---------: | ----------: |
| no_problem | 137,864 | 4,646,203 | 21,533,963 |
| warning | 705,095 | 45,797,794 | 183,246,777 |
| error | 389,225 | 26,759,524 | 99,641,978 |
| complete | 1,232,184 | 77,203,521 | 304,422,718 |
subset | words | formulas | inline cite | numeric literals |
-----------|--------------:|-----------: |------------:|-----------------:|
no_problem | 430,217,995 | 20,910,732 | 3,709,520 | 11,177,753 |
warning | 3,175,663,430 | 281,832,412 | 25,337,574 | 83,606,897 |
error | 1,731,971,035 | 153,186,264 | 13,145,561 | 43,399,720 |
complete | 5,337,852,460 | 455,929,408 | 42,192,655 | 138,184,370 |
| subset | words | formulas | inline cite | numeric literals |
| ---------- | ------------: | ----------: | ----------: | ---------------: |
| no_problem | 430,217,995 | 20,910,732 | 3,709,520 | 11,177,753 |
| warning | 3,175,663,430 | 281,832,412 | 25,337,574 | 83,606,897 |
| error | 1,731,971,035 | 153,186,264 | 13,145,561 | 43,399,720 |
| complete | 5,337,852,460 | 455,929,408 | 42,192,655 | 138,184,370 |
#### GloVe Model Statistics
subset | tokens | unique words | unique words (freq 5+ )
-----------|--------------: |-------------:|-----------------------:
no_problem | 622,968,267 | 715,433 | 219,304
warning | 7,203,536,205 | 3,478,235 | 666,317
error | 3,691,805,321 | 2,444,532 | 574,467
complete | 11,518,309,793 | 5,285,379 | 1,000,295
| subset | tokens | unique words | unique words (freq 5+ ) |
| ---------- | -------------: | -----------: | ----------------------: |
| no_problem | 622,968,267 | 715,433 | 219,304 |
| warning | 7,203,536,205 | 3,478,235 | 666,317 |
| error | 3,691,805,321 | 2,444,532 | 574,467 |
| complete | 11,518,309,793 | 5,285,379 | 1,000,295 |
### Citing this Resource
......@@ -59,14 +63,14 @@ Please cite the main dataset when using the word embeddings, as they are generat
### Download
[Download link](https://gl.kwarc.info/SIGMathLing/embeddings-arXMLiv-08-2018)
([SIGMathLing members](/member/) only)
### Generated via
- [llamapun 0.2.0](https://github.com/KWARC/llamapun/releases/tag/0.2.0),
- [llamapun 0.2.0](https://github.com/KWARC/llamapun/releases/tag/0.2.0),
- [GloVe 1.2, 2018](https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8)
### Generation Parameters
* token model distributed as 3 subsets - no_problem, warning and error. complete model is derived via:
```
cat token_model_no_problem.txt \
token_model_warning.txt \
......@@ -102,20 +106,20 @@ Please cite the main dataset when using the word embeddings, as they are generat
* Highest score: "gram3-comparative.txt", 72.75% (969/1332)
4. complete
* Total accuracy: 35.48% (6298/17750)
* Total accuracy: 35.48% (6298/17750)
* Highest score: "gram3-comparative.txt", 76.65% (1021/1332)
5. demo baseline: text8 demo (first 100M characters of Wikipedia)
* Total accuracy: 23.62% (4211/17827)
* Highest score: "gram6-nationality-adjective.txt", 58.65% (892/1521)
**Evaluation note:** These in-built evlauation runs are provided as a sanity check that the generated GloVe models pass a basic baseline against the non-expert tasks in the default GloVe suite.
**Evaluation note:** These in-built evlauation runs are provided as a sanity check that the generated GloVe models pass a basic baseline against the non-expert tasks in the default GloVe suite.
One would need a scienctific discourse tailored set of test cases to evaluate the arXiv-based models competitively.
#### Measuring word analogy
In a cloned GloVe repository, start via:
```
python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt
python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt
```
1. `abelian` is to `group` as `disjoint` is to `?`
......@@ -137,13 +141,13 @@ python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_fil
* Top hit: `italic_b`, cosine distance `0.902608`
#### Nearest word vectors
#### Nearest word vectors
In a cloned GloVe repository, start via:
```
python2 eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt
python2 eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt
```
1. **lattice**
1. **lattice**
```
Word: lattice Position in vocabulary: 488
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment