diff --git a/resources/arxmliv-embeddings-082018.md b/resources/arxmliv-embeddings-082018.md index e3637bcedf735c2a5b5734bcfb2e28fbe042623a..360812ec8065b169d085f02f343d026ca3fd0232 100644 --- a/resources/arxmliv-embeddings-082018.md +++ b/resources/arxmliv-embeddings-082018.md @@ -12,7 +12,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR ### Accessibility and License The content of this Dataset is licensed to [SIGMathLing members](/member/) for research -and tool development purposes. +and tool development purposes. Access is restricted to [SIGMathLing members](/member/) under the [SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org) @@ -22,35 +22,39 @@ articles, the right of distribution was only given (or assumed) to arXiv itself. - An 11.5 billion token model for the arXMLiv 08.2018 dataset, including subformula lexemes - `token_model.zip` - 300 dimensional GloVe word embeddings for the arXMLiv 08.2018 dataset - - `glove.arxmliv.5B.300d.zip` and `vocab.arxmliv.zip` + - `glove.arxmliv.11B.300d.zip` and `vocab.arxmliv.zip` - 300d GloVe word embeddings for individual subsets - `glove.subsets.zip` + - Embeddings and vocabulary with math lexemes omitted + - `glove.arxmliv.nomath.11B.300d.zip` and `vocab.arxmliv.nomath.zip` + - added on July 20, 2019 + - used as a control when evaluating the contribution of formula lexemes - the main arXMLiv dataset is available separately [here](/resources/arxmliv-dataset-082018/) #### Token Model Statistics -subset | documents | paragraphs | sentences | ------------|----------:|-----------:|------------:| -no_problem | 137,864 | 4,646,203 | 21,533,963 | -warning | 705,095 | 45,797,794 | 183,246,777 | -error | 389,225 | 26,759,524 | 99,641,978 | -complete | 1,232,184 | 77,203,521 | 304,422,718 | +| subset | documents | paragraphs | sentences | +| ---------- | --------: | ---------: | ----------: | +| no_problem | 137,864 | 4,646,203 | 21,533,963 | +| warning | 705,095 | 45,797,794 | 183,246,777 | +| error | 389,225 | 26,759,524 | 99,641,978 | +| complete | 1,232,184 | 77,203,521 | 304,422,718 | -subset | words | formulas | inline cite | numeric literals | ------------|--------------:|-----------: |------------:|-----------------:| -no_problem | 430,217,995 | 20,910,732 | 3,709,520 | 11,177,753 | -warning | 3,175,663,430 | 281,832,412 | 25,337,574 | 83,606,897 | -error | 1,731,971,035 | 153,186,264 | 13,145,561 | 43,399,720 | -complete | 5,337,852,460 | 455,929,408 | 42,192,655 | 138,184,370 | +| subset | words | formulas | inline cite | numeric literals | +| ---------- | ------------: | ----------: | ----------: | ---------------: | +| no_problem | 430,217,995 | 20,910,732 | 3,709,520 | 11,177,753 | +| warning | 3,175,663,430 | 281,832,412 | 25,337,574 | 83,606,897 | +| error | 1,731,971,035 | 153,186,264 | 13,145,561 | 43,399,720 | +| complete | 5,337,852,460 | 455,929,408 | 42,192,655 | 138,184,370 | #### GloVe Model Statistics -subset | tokens | unique words | unique words (freq 5+ ) ------------|--------------: |-------------:|-----------------------: -no_problem | 622,968,267 | 715,433 | 219,304 -warning | 7,203,536,205 | 3,478,235 | 666,317 -error | 3,691,805,321 | 2,444,532 | 574,467 -complete | 11,518,309,793 | 5,285,379 | 1,000,295 +| subset | tokens | unique words | unique words (freq 5+ ) | +| ---------- | -------------: | -----------: | ----------------------: | +| no_problem | 622,968,267 | 715,433 | 219,304 | +| warning | 7,203,536,205 | 3,478,235 | 666,317 | +| error | 3,691,805,321 | 2,444,532 | 574,467 | +| complete | 11,518,309,793 | 5,285,379 | 1,000,295 | ### Citing this Resource @@ -59,14 +63,14 @@ Please cite the main dataset when using the word embeddings, as they are generat ### Download [Download link](https://gl.kwarc.info/SIGMathLing/embeddings-arXMLiv-08-2018) ([SIGMathLing members](/member/) only) - + ### Generated via - - [llamapun 0.2.0](https://github.com/KWARC/llamapun/releases/tag/0.2.0), + - [llamapun 0.2.0](https://github.com/KWARC/llamapun/releases/tag/0.2.0), - [GloVe 1.2, 2018](https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8) ### Generation Parameters * token model distributed as 3 subsets - no_problem, warning and error. complete model is derived via: - + ``` cat token_model_no_problem.txt \ token_model_warning.txt \ @@ -102,20 +106,20 @@ Please cite the main dataset when using the word embeddings, as they are generat * Highest score: "gram3-comparative.txt", 72.75% (969/1332) 4. complete - * Total accuracy: 35.48% (6298/17750) + * Total accuracy: 35.48% (6298/17750) * Highest score: "gram3-comparative.txt", 76.65% (1021/1332) 5. demo baseline: text8 demo (first 100M characters of Wikipedia) * Total accuracy: 23.62% (4211/17827) * Highest score: "gram6-nationality-adjective.txt", 58.65% (892/1521) -**Evaluation note:** These in-built evlauation runs are provided as a sanity check that the generated GloVe models pass a basic baseline against the non-expert tasks in the default GloVe suite. +**Evaluation note:** These in-built evlauation runs are provided as a sanity check that the generated GloVe models pass a basic baseline against the non-expert tasks in the default GloVe suite. One would need a scienctific discourse tailored set of test cases to evaluate the arXiv-based models competitively. #### Measuring word analogy In a cloned GloVe repository, start via: ``` -python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt +python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt ``` 1. `abelian` is to `group` as `disjoint` is to `?` @@ -137,13 +141,13 @@ python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_fil * Top hit: `italic_b`, cosine distance `0.902608` -#### Nearest word vectors +#### Nearest word vectors In a cloned GloVe repository, start via: ``` -python2 eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt +python2 eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt ``` -1. **lattice** +1. **lattice** ``` Word: lattice Position in vocabulary: 488