diff --git a/resources/arxmliv-embeddings-082018.md b/resources/arxmliv-embeddings-082018.md index c61d164f1d2ac1d97f9f35be3d52f31efbad7445..9dffa2ad077c6c716215ef5656f08efbfe571bbf 100644 --- a/resources/arxmliv-embeddings-082018.md +++ b/resources/arxmliv-embeddings-082018.md @@ -18,7 +18,7 @@ Access is restricted to [SIGMathLing members](/member/) under the articles, the right of distribution was only given (or assumed) to arXiv itself. ### Contents - - A 5 billion token model for the arXMLiv 08.2018 dataset + - An 11.5 billion token model for the arXMLiv 08.2018 dataset, including subformula lexemes - `token_model.zip` - 300 dimensional GloVe word embeddings for the arXMLiv 08.2018 dataset - `glove.arxmliv.5B.300d.zip` and `vocab.arxmliv.zip` @@ -30,26 +30,26 @@ articles, the right of distribution was only given (or assumed) to arXiv itself. subset | documents | paragraphs | sentences | -----------|----------:|-----------:|------------:| -no_problem | | | | -warning | | | | -error | | | | -complete | | | | - -subset | words |formulas | inline cite | numeric literals | ------------|--------------:|-----------:|------------:|-----------------:| -no_problem | | | | | -warning | | | | | -error | | | | | -complete | | | | | +no_problem | 137,864 | 4,646,203 | 21,533,963 | +warning | 705,095 | 45,797,794 | 183,246,777 | +error | 389,225 | 26,759,524 | 99,641,978 | +complete | 1,232,184 | 77,203,521 | 304,422,718 | + +subset | words | formulas | inline cite | numeric literals | +-----------|--------------:|-----------: |------------:|-----------------:| +no_problem | 430,217,995 | 20,910,732 | 3,709,520 | 11,177,753 | +warning | 3,175,663,430 | 281,832,412 | 25,337,574 | 83,606,897 | +error | 1,731,971,035 | 153,186,264 | 13,145,561 | 43,399,720 | +complete | 5,337,852,460 | 455,929,408 | 42,192,655 | 138,184,370 | #### GloVe Model Statistics -subset | tokens | unique words | unique words (freq 5+ ) ------------|--------------:|-------------:|-----------------------: -no_problem | | | -warning | | | -error | | | -complete | | | +subset | tokens | unique words | unique words (freq 5+ ) +-----------|--------------: |-------------:|-----------------------: +no_problem | 622,968,267 | 715,433 | 219,304 +warning | 7,203,536,205 | 3,478,235 | 666,317 +error | 3,691,805,321 | 2,444,532 | 574,467 +complete | 11,518,309,793 | 5,285,379 | 1,000,295 ### Citing this Resource @@ -60,7 +60,7 @@ Please cite the main dataset when using the word embeddings, as they are generat ([SIGMathLing members](/member/) only) ### Generated via - - [llamapun 0.2](https://github.com/KWARC/llamapun/releases/tag/0.2), + - [llamapun 0.2.0](https://github.com/KWARC/llamapun/releases/tag/0.2.0), - [GloVe 1.2, 2018](https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8) ### Generation Parameters @@ -73,12 +73,12 @@ Please cite the main dataset when using the word embeddings, as they are generat ``` * [llamapun v0.2](https://github.com/KWARC/llamapun/releases/tag/0.2), `corpus_token_model` example used for token model extraction - * used llamapun math-aware sentence and word tokenization - * processed logical paragraphs only (excluded non-textual modalities, e.g. tables, figures and their captions, bibliographies) - * marked up formulas replaced with `mathformula` token + * processed logical paragraphs only (excluded non-textual modalities, e.g. tables, figures and their captions, bibliographies, as well as abstracts) + * excluded paragraphs containing latexml errors (marked via a `ltx_ERROR` HTML class) + * used llamapun math-aware sentence and word tokenization, with subformula math lexemes * marked up inline citations replaced with `citationelement` token - * numeric literals replaced with `NUM` token - * ignored sentences with unnaturally long words (>30 characters) - almost always due to latexml conversion errors + * numeric literals replaced with `NUM` token (both in text and formulas) + * ignored sentences with unnaturally long words (>25 characters) - almost always due to latexml conversion errors * [GloVe repository at sha 07d59d](https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8) * build/vocab_count -min-count 5 * build/cooccur -memory 32.0 -window-size 15 @@ -89,24 +89,24 @@ Please cite the main dataset when using the word embeddings, as they are generat #### GloVe in-built evaluation (non-expert tasks e.g. language, relationships, geography) 1. no_problem - * Total accuracy: - * Highest score: + * Total accuracy: 27.47% (4028/14663) + * Highest score: "gram3-comparative.txt", 75.30% (1003/1332) 2. warning - * Total accuracy: - * Highest score: + * Total accuracy: 31.97% (5351/16736) + * Highest score: "gram3-comparative.txt", 77.40% (1031/1332) 3. error - * Total accuracy: - * Highest score: + * Total accuracy: 29.67% (4910/16549) + * Highest score: "gram3-comparative.txt", 72.75% (969/1332) 4. complete - * Total accuracy: - * Highest score: + * Total accuracy: 35.48% (6298/17750) + * Highest score: "gram3-comparative.txt", 76.65% (1021/1332) 5. demo baseline: text8 demo (first 100M characters of Wikipedia) - * Total accuracy: - * Highest score: + * Total accuracy: 23.62% (4211/17827) + * Highest score: "gram6-nationality-adjective.txt", 58.65% (892/1521) **Evaluation note:** These in-built evlauation runs are provided as a sanity check that the generated GloVe models pass a basic baseline against the non-expert tasks in the default GloVe suite. One would need a scienctific discourse tailored set of test cases to evaluate the arXiv-based models competitively. @@ -114,37 +114,187 @@ One would need a scienctific discourse tailored set of test cases to evaluate th #### Measuring word analogy In a cloned GloVe repository, start via: ``` -python eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.5B.300d.txt +python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt ``` 1. `abelian` is to `group` as `disjoint` is to `?` - * Top hit: + * Top hit: `union`, cosine distance `0.653377` 2. `convex` is to `concave` as `positive` is to `?` - * Top hit: + * Top hit: `negative`, cosine distance `0.786877` 3. `finite` is to `infinte` as `abelian` is to `?` - * Top hit: + * Top hit: `nonabelian`, cosine distance `0.676042` 4. `quantum` is to `classical` as `bottom` is to `?` - * Top hit: + * Top hit: `top`, cosine distance `0.734896` 5. `eq` is to `proves` as `figure` is to `?` - * Top hit: + * Top hit: `showing`, cosine distance `0.668502` + +6. `italic_x` is to `italic_y` as `italic_a` is to `?` + * Top hit: `italic_b`, cosine distance `0.902608` #### Nearest word vectors In a cloned GloVe repository, start via: ``` -python eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.5B.300d.txt +python2 eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt ``` 1. **lattice** +``` +Word: lattice Position in vocabulary: 488 + + Word Cosine distance + +--------------------------------------------------------- + + lattices 0.853103 + + triangular 0.637767 + + honeycomb 0.626426 + + crystal 0.624397 + + finite 0.614720 + + spacing 0.603067 +``` + 2. **entanglement** +``` +Word: entanglement Position in vocabulary: 1568 + + Word Cosine distance + +--------------------------------------------------------- + + entangled 0.780425 + + multipartite 0.730968 + + concurrence 0.691708 + + negativity 0.649595 + + tripartite 0.647623 + + quantum 0.640395 + + fidelity 0.640285 + + teleportation 0.616797 + + discord 0.613752 + + entropy 0.612341 + + bipartite 0.608034 + + coherence 0.606859 + + nonlocality 0.601337 +``` + 3. **forgetful** +``` +Word: forgetful Position in vocabulary: 11740 + + Word Cosine distance + +--------------------------------------------------------- + + functor 0.723472 + + functors 0.656184 + + morphism 0.598965 +``` + 4. **eigenvalue** +``` +Word: eigenvalue Position in vocabulary: 1448 + + Word Cosine distance + +--------------------------------------------------------- + + eigenvalues 0.893073 + + eigenvector 0.768380 + + eigenvectors 0.765241 + + eigenfunction 0.754222 + + eigenfunctions 0.686141 + + eigenspace 0.666098 + + eigen 0.641422 + + matrix 0.616723 + + eigenmode 0.613117 + + eigenstate 0.612188 + + laplacian 0.611396 + + largest 0.606122 + + smallest 0.605342 + + eigenmodes 0.604839 + +``` + 5. **riemannian** + +``` +Word: riemannian Position in vocabulary: 2285 + + Word Cosine distance + +--------------------------------------------------------- + + manifolds 0.765827 + + manifold 0.760806 + + metric 0.719817 + + finsler 0.687826 + + curvature 0.676100 + + ricci 0.664770 + + metrics 0.660804 + + riemmanian 0.651666 + + euclidean 0.644686 + + noncompact 0.643878 + + conformally 0.638984 + + riemanian 0.633814 + + kahler 0.632680 + + endowed 0.622035 + + submanifold 0.613868 + + submanifolds 0.612716 + + geodesic 0.604488 +``` \ No newline at end of file