Skip to content
Snippets Groups Projects
Commit 52ff3b9b authored by Deyan Ginev's avatar Deyan Ginev
Browse files

update embeddings description with nomath controls

parent fe466e65
No related branches found
No related tags found
1 merge request!6Update embeddings description with "nomath" controls
...@@ -12,7 +12,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR ...@@ -12,7 +12,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR
### Accessibility and License ### Accessibility and License
The content of this Dataset is licensed to [SIGMathLing members](/member/) for research The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
and tool development purposes. and tool development purposes.
Access is restricted to [SIGMathLing members](/member/) under the Access is restricted to [SIGMathLing members](/member/) under the
[SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org) [SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org)
...@@ -25,32 +25,35 @@ articles, the right of distribution was only given (or assumed) to arXiv itself. ...@@ -25,32 +25,35 @@ articles, the right of distribution was only given (or assumed) to arXiv itself.
- `glove.arxmliv.5B.300d.zip` and `vocab.arxmliv.zip` - `glove.arxmliv.5B.300d.zip` and `vocab.arxmliv.zip`
- 300d GloVe word embeddings for individual subsets - 300d GloVe word embeddings for individual subsets
- `glove.subsets.zip` - `glove.subsets.zip`
- Embeddings and vocabulary with math lexemes omitted: `glove.arxmliv.nomath.11B.300d.zip` and `vocab.arxmliv.nomath.zip`
- added on July 20, 2019
- used as a control when evaluating the contribution of formula lexemes
- the main arXMLiv dataset is available separately [here](/resources/arxmliv-dataset-082018/) - the main arXMLiv dataset is available separately [here](/resources/arxmliv-dataset-082018/)
#### Token Model Statistics #### Token Model Statistics
subset | documents | paragraphs | sentences | | subset | documents | paragraphs | sentences |
-----------|----------:|-----------:|------------:| | ---------- | --------: | ---------: | ----------: |
no_problem | 137,864 | 4,646,203 | 21,533,963 | | no_problem | 137,864 | 4,646,203 | 21,533,963 |
warning | 705,095 | 45,797,794 | 183,246,777 | | warning | 705,095 | 45,797,794 | 183,246,777 |
error | 389,225 | 26,759,524 | 99,641,978 | | error | 389,225 | 26,759,524 | 99,641,978 |
complete | 1,232,184 | 77,203,521 | 304,422,718 | | complete | 1,232,184 | 77,203,521 | 304,422,718 |
subset | words | formulas | inline cite | numeric literals | | subset | words | formulas | inline cite | numeric literals |
-----------|--------------:|-----------: |------------:|-----------------:| | ---------- | ------------: | ----------: | ----------: | ---------------: |
no_problem | 430,217,995 | 20,910,732 | 3,709,520 | 11,177,753 | | no_problem | 430,217,995 | 20,910,732 | 3,709,520 | 11,177,753 |
warning | 3,175,663,430 | 281,832,412 | 25,337,574 | 83,606,897 | | warning | 3,175,663,430 | 281,832,412 | 25,337,574 | 83,606,897 |
error | 1,731,971,035 | 153,186,264 | 13,145,561 | 43,399,720 | | error | 1,731,971,035 | 153,186,264 | 13,145,561 | 43,399,720 |
complete | 5,337,852,460 | 455,929,408 | 42,192,655 | 138,184,370 | | complete | 5,337,852,460 | 455,929,408 | 42,192,655 | 138,184,370 |
#### GloVe Model Statistics #### GloVe Model Statistics
subset | tokens | unique words | unique words (freq 5+ ) | subset | tokens | unique words | unique words (freq 5+ ) |
-----------|--------------: |-------------:|-----------------------: | ---------- | -------------: | -----------: | ----------------------: |
no_problem | 622,968,267 | 715,433 | 219,304 | no_problem | 622,968,267 | 715,433 | 219,304 |
warning | 7,203,536,205 | 3,478,235 | 666,317 | warning | 7,203,536,205 | 3,478,235 | 666,317 |
error | 3,691,805,321 | 2,444,532 | 574,467 | error | 3,691,805,321 | 2,444,532 | 574,467 |
complete | 11,518,309,793 | 5,285,379 | 1,000,295 | complete | 11,518,309,793 | 5,285,379 | 1,000,295 |
### Citing this Resource ### Citing this Resource
...@@ -59,14 +62,14 @@ Please cite the main dataset when using the word embeddings, as they are generat ...@@ -59,14 +62,14 @@ Please cite the main dataset when using the word embeddings, as they are generat
### Download ### Download
[Download link](https://gl.kwarc.info/SIGMathLing/embeddings-arXMLiv-08-2018) [Download link](https://gl.kwarc.info/SIGMathLing/embeddings-arXMLiv-08-2018)
([SIGMathLing members](/member/) only) ([SIGMathLing members](/member/) only)
### Generated via ### Generated via
- [llamapun 0.2.0](https://github.com/KWARC/llamapun/releases/tag/0.2.0), - [llamapun 0.2.0](https://github.com/KWARC/llamapun/releases/tag/0.2.0),
- [GloVe 1.2, 2018](https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8) - [GloVe 1.2, 2018](https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8)
### Generation Parameters ### Generation Parameters
* token model distributed as 3 subsets - no_problem, warning and error. complete model is derived via: * token model distributed as 3 subsets - no_problem, warning and error. complete model is derived via:
``` ```
cat token_model_no_problem.txt \ cat token_model_no_problem.txt \
token_model_warning.txt \ token_model_warning.txt \
...@@ -102,20 +105,20 @@ Please cite the main dataset when using the word embeddings, as they are generat ...@@ -102,20 +105,20 @@ Please cite the main dataset when using the word embeddings, as they are generat
* Highest score: "gram3-comparative.txt", 72.75% (969/1332) * Highest score: "gram3-comparative.txt", 72.75% (969/1332)
4. complete 4. complete
* Total accuracy: 35.48% (6298/17750) * Total accuracy: 35.48% (6298/17750)
* Highest score: "gram3-comparative.txt", 76.65% (1021/1332) * Highest score: "gram3-comparative.txt", 76.65% (1021/1332)
5. demo baseline: text8 demo (first 100M characters of Wikipedia) 5. demo baseline: text8 demo (first 100M characters of Wikipedia)
* Total accuracy: 23.62% (4211/17827) * Total accuracy: 23.62% (4211/17827)
* Highest score: "gram6-nationality-adjective.txt", 58.65% (892/1521) * Highest score: "gram6-nationality-adjective.txt", 58.65% (892/1521)
**Evaluation note:** These in-built evlauation runs are provided as a sanity check that the generated GloVe models pass a basic baseline against the non-expert tasks in the default GloVe suite. **Evaluation note:** These in-built evlauation runs are provided as a sanity check that the generated GloVe models pass a basic baseline against the non-expert tasks in the default GloVe suite.
One would need a scienctific discourse tailored set of test cases to evaluate the arXiv-based models competitively. One would need a scienctific discourse tailored set of test cases to evaluate the arXiv-based models competitively.
#### Measuring word analogy #### Measuring word analogy
In a cloned GloVe repository, start via: In a cloned GloVe repository, start via:
``` ```
python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt
``` ```
1. `abelian` is to `group` as `disjoint` is to `?` 1. `abelian` is to `group` as `disjoint` is to `?`
...@@ -137,13 +140,13 @@ python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_fil ...@@ -137,13 +140,13 @@ python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_fil
* Top hit: `italic_b`, cosine distance `0.902608` * Top hit: `italic_b`, cosine distance `0.902608`
#### Nearest word vectors #### Nearest word vectors
In a cloned GloVe repository, start via: In a cloned GloVe repository, start via:
``` ```
python2 eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt python2 eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt
``` ```
1. **lattice** 1. **lattice**
``` ```
Word: lattice Position in vocabulary: 488 Word: lattice Position in vocabulary: 488
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment