...
 
Commits (2)
......@@ -12,7 +12,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR
### Accessibility and License
The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
and tool development purposes.
and tool development purposes.
Access is restricted to [SIGMathLing members](/member/) under the
[SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org)
......@@ -22,35 +22,39 @@ articles, the right of distribution was only given (or assumed) to arXiv itself.
- An 11.5 billion token model for the arXMLiv 08.2018 dataset, including subformula lexemes
- `token_model.zip`
- 300 dimensional GloVe word embeddings for the arXMLiv 08.2018 dataset
- `glove.arxmliv.5B.300d.zip` and `vocab.arxmliv.zip`
- `glove.arxmliv.11B.300d.zip` and `vocab.arxmliv.zip`
- 300d GloVe word embeddings for individual subsets
- `glove.subsets.zip`
- Embeddings and vocabulary with math lexemes omitted
- `glove.arxmliv.nomath.11B.300d.zip` and `vocab.arxmliv.nomath.zip`
- added on July 20, 2019
- used as a control when evaluating the contribution of formula lexemes
- the main arXMLiv dataset is available separately [here](/resources/arxmliv-dataset-082018/)
#### Token Model Statistics
subset | documents | paragraphs | sentences |
-----------|----------:|-----------:|------------:|
no_problem | 137,864 | 4,646,203 | 21,533,963 |
warning | 705,095 | 45,797,794 | 183,246,777 |
error | 389,225 | 26,759,524 | 99,641,978 |
complete | 1,232,184 | 77,203,521 | 304,422,718 |
| subset | documents | paragraphs | sentences |
| ---------- | --------: | ---------: | ----------: |
| no_problem | 137,864 | 4,646,203 | 21,533,963 |
| warning | 705,095 | 45,797,794 | 183,246,777 |
| error | 389,225 | 26,759,524 | 99,641,978 |
| complete | 1,232,184 | 77,203,521 | 304,422,718 |
subset | words | formulas | inline cite | numeric literals |
-----------|--------------:|-----------: |------------:|-----------------:|
no_problem | 430,217,995 | 20,910,732 | 3,709,520 | 11,177,753 |
warning | 3,175,663,430 | 281,832,412 | 25,337,574 | 83,606,897 |
error | 1,731,971,035 | 153,186,264 | 13,145,561 | 43,399,720 |
complete | 5,337,852,460 | 455,929,408 | 42,192,655 | 138,184,370 |
| subset | words | formulas | inline cite | numeric literals |
| ---------- | ------------: | ----------: | ----------: | ---------------: |
| no_problem | 430,217,995 | 20,910,732 | 3,709,520 | 11,177,753 |
| warning | 3,175,663,430 | 281,832,412 | 25,337,574 | 83,606,897 |
| error | 1,731,971,035 | 153,186,264 | 13,145,561 | 43,399,720 |
| complete | 5,337,852,460 | 455,929,408 | 42,192,655 | 138,184,370 |
#### GloVe Model Statistics
subset | tokens | unique words | unique words (freq 5+ )
-----------|--------------: |-------------:|-----------------------:
no_problem | 622,968,267 | 715,433 | 219,304
warning | 7,203,536,205 | 3,478,235 | 666,317
error | 3,691,805,321 | 2,444,532 | 574,467
complete | 11,518,309,793 | 5,285,379 | 1,000,295
| subset | tokens | unique words | unique words (freq 5+ ) |
| ---------- | -------------: | -----------: | ----------------------: |
| no_problem | 622,968,267 | 715,433 | 219,304 |
| warning | 7,203,536,205 | 3,478,235 | 666,317 |
| error | 3,691,805,321 | 2,444,532 | 574,467 |
| complete | 11,518,309,793 | 5,285,379 | 1,000,295 |
### Citing this Resource
......@@ -59,14 +63,14 @@ Please cite the main dataset when using the word embeddings, as they are generat
### Download
[Download link](https://gl.kwarc.info/SIGMathLing/embeddings-arXMLiv-08-2018)
([SIGMathLing members](/member/) only)
### Generated via
- [llamapun 0.2.0](https://github.com/KWARC/llamapun/releases/tag/0.2.0),
- [llamapun 0.2.0](https://github.com/KWARC/llamapun/releases/tag/0.2.0),
- [GloVe 1.2, 2018](https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8)
### Generation Parameters
* token model distributed as 3 subsets - no_problem, warning and error. complete model is derived via:
```
cat token_model_no_problem.txt \
token_model_warning.txt \
......@@ -102,20 +106,20 @@ Please cite the main dataset when using the word embeddings, as they are generat
* Highest score: "gram3-comparative.txt", 72.75% (969/1332)
4. complete
* Total accuracy: 35.48% (6298/17750)
* Total accuracy: 35.48% (6298/17750)
* Highest score: "gram3-comparative.txt", 76.65% (1021/1332)
5. demo baseline: text8 demo (first 100M characters of Wikipedia)
* Total accuracy: 23.62% (4211/17827)
* Highest score: "gram6-nationality-adjective.txt", 58.65% (892/1521)
**Evaluation note:** These in-built evlauation runs are provided as a sanity check that the generated GloVe models pass a basic baseline against the non-expert tasks in the default GloVe suite.
**Evaluation note:** These in-built evlauation runs are provided as a sanity check that the generated GloVe models pass a basic baseline against the non-expert tasks in the default GloVe suite.
One would need a scienctific discourse tailored set of test cases to evaluate the arXiv-based models competitively.
#### Measuring word analogy
In a cloned GloVe repository, start via:
```
python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt
python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt
```
1. `abelian` is to `group` as `disjoint` is to `?`
......@@ -137,13 +141,13 @@ python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_fil
* Top hit: `italic_b`, cosine distance `0.902608`
#### Nearest word vectors
#### Nearest word vectors
In a cloned GloVe repository, start via:
```
python2 eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt
python2 eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt
```
1. **lattice**
1. **lattice**
```
Word: lattice Position in vocabulary: 488
......