diff --git a/resources/arxmliv-embeddings-082019.md b/resources/arxmliv-embeddings-082019.md index b14e2d6af97d804e4c33691c538ee95a12175aae..11a21b554832f9ec399e14b615b54d7d27757cc9 100644 --- a/resources/arxmliv-embeddings-082019.md +++ b/resources/arxmliv-embeddings-082019.md @@ -48,7 +48,7 @@ articles, the right of distribution was only given (or assumed) to arXiv itself. | subset | tokens | unique words | unique words (freq 5+ ) | | ---------- | -------------: | -----------: | ----------------------: | -| complete | 15,214,964,673 | 2,868,070 | 1,013,106 | +| complete | 15,192,564,807 | 2,782,667 | 989,136 | ### Citing this Resource @@ -59,7 +59,7 @@ Please cite the main dataset when using the word embeddings, as they are generat ([SIGMathLing members](/member/) only) ### Generated via - - [llamapun 0.3.3](https://github.com/KWARC/llamapun/releases/tag/0.3.3), + - [llamapun 0.3.4](https://github.com/KWARC/llamapun/releases/tag/0.3.4), - [GloVe 1.2, 2019](https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8) ### Generation Parameters @@ -71,7 +71,7 @@ Please cite the main dataset when using the word embeddings, as they are generat token_model_error.txt > token_model_complete.txt ``` - * [llamapun v0.3.3](https://github.com/KWARC/llamapun/releases/tag/0.3.3), `corpus_token_model` example used for token model extraction + * [llamapun v0.3.4](https://github.com/KWARC/llamapun/releases/tag/0.3.4), `corpus_token_model` example used for token model extraction * processed logical paragraphs, abstracts, captions and keywords, ignore all other content (e.g. tables, bibliography, others) * excluded paragraphs containing latexml errors (marked via a `ltx_ERROR` HTML class); also excluded when words over 25 characters were encountered. * used llamapun math-aware word tokenization, with sub-formula math lexemes (improved robustness since 2018) @@ -80,6 +80,7 @@ Please cite the main dataset when using the word embeddings, as they are generat * internal references replaced with `ref` token (e.g. `Figure ref`) * textual punctuation is included as-is, while mathematical punctuation is annotated via the latexml-generated lexemes. * words are downcased, while math content is kept cased, to mitigate lexical ambiguity. + * an update was provided on Sep 29, 2019, improving tokenization of apostrophe-including constructs such as `'s`. * [GloVe repository at sha 07d59d](https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8) * build/vocab_count -min-count 5 @@ -91,8 +92,8 @@ Please cite the main dataset when using the word embeddings, as they are generat #### GloVe in-built evaluation (non-expert tasks e.g. language, relationships, geography) 1. NEW; 2019 model - * Total accuracy: 38.30% (7017/18322) - * Highest score: "gram3-comparative.txt", 78.60% (1047/1332) + * Total accuracy: 37.76% (7017/18322) + * Highest score: "gram3-comparative.txt", 77.33% (1047/1332) 2. 2018 [GloVe embeddings](/resources/arxmliv-embeddings-082018/) @@ -110,23 +111,23 @@ python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_fil ``` 1. `abelian` is to `group` as `disjoint` is to `?` - * Top hit: `union`, cosine distance `0.618853` + * Top hit: `union`, cosine distance `0.649029` 2. `convex` is to `concave` as `positive` is to `?` - * Top hit: `negative`, cosine distance `0.806679` + * Top hit: `negative`, cosine distance `0.812031` 3. `finite` is to `infinte` as `abelian` is to `?` - * Top hit: `nonabelian`, cosine distance `0.698089` + * Top hit: `nonabelian`, cosine distance `0.689419` 4. `quantum` is to `classical` as `bottom` is to `?` - * Top hit: `middle`, cosine distance `0.769180` - * Close second: `top`, cosine distance `0.765937` + * Top hit: `middle`, cosine distance `0.770132` + * Close second: `top`, cosine distance `0.758245` 5. `eq` is to `proves` as `figure` is to `?` - * Top hit: `showing`, cosine distance `0.689938` + * Top hit: `shows`, cosine distance `0.675003` 6. `italic_x` is to `italic_y` as `italic_a` is to `?` - * Top hit: `italic_b`, cosine distance `0.915467` + * Top hit: `italic_b`, cosine distance `0.912827` #### Nearest word vectors @@ -139,81 +140,92 @@ python2 eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file gl ``` Word: lattice Position in vocabulary: 515 - Word Cosine distance + Word Cosine distance --------------------------------------------------------- - lattices 0.865888 + lattices 0.860839 - honeycomb 0.677004 + finite 0.662110 - finite 0.650216 + honeycomb 0.657155 - triangular 0.632165 + crystal 0.635061 - crystal 0.627800 + triangular 0.632298 - sublattice 0.619792 + spacing 0.619840 - cubic 0.609822 + square 0.613936 + sublattice 0.612161 + + hexagonal 0.606321 + + hypercubic 0.606101 + + latter 0.602747 + + symmetry 0.601192 + + cubic 0.601035 ``` 2. **entanglement** ``` - Word: entanglement Position in vocabulary: 1603 + Word: entanglement Position in vocabulary: 1605 - Word Cosine distance + Word Cosine distance --------------------------------------------------------- - entangled 0.803443 + entangled 0.795067 + + multipartite 0.745711 - multipartite 0.744602 + concurrence 0.708164 - negativity 0.698730 + negativity 0.695089 - concurrence 0.693703 + quantum 0.666254 - tripartite 0.669840 + tripartite 0.653771 - discord 0.660572 + fidelity 0.651990 - fidelity 0.657391 + teleportation 0.639430 - quantum 0.655452 + nonlocality 0.626717 - teleportation 0.628923 + discord 0.622995 - qubits 0.627504 + qubit 0.622836 - bipartite 0.622791 + bipartite 0.614907 - entangling 0.621139 + qubits 0.613029 - nonlocality 0.619905 + entropy 0.612276 - qubit 0.615623 - entropy 0.601869 ``` 3. **forgetful** ``` - Word: forgetful Position in vocabulary: 12259 + Word: forgetful Position in vocabulary: 12229 - Word Cosine distance + Word Cosine distance --------------------------------------------------------- - functor 0.749501 + functor 0.731004 - functors 0.686806 + functors 0.667090 - morphism 0.632394 + morphisms 0.605955 - morphisms 0.610589 + morphism 0.604947 ``` @@ -221,80 +233,88 @@ python2 eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file gl ``` Word: eigenvalue Position in vocabulary: 1527 - Word Cosine distance + Word Cosine distance --------------------------------------------------------- - eigenvalues 0.903885 + eigenvalues 0.894346 - eigenvector 0.781512 + eigenvector 0.775584 - eigenvectors 0.774260 + eigenfunction 0.772961 - eigenfunction 0.751316 + eigenvectors 0.762914 - eigenfunctions 0.707166 + eigenfunctions 0.700270 - eigenspace 0.683321 + eigenspace 0.686408 - eigen 0.657366 + eigen 0.664881 - laplacian 0.649859 + laplacian 0.646244 - matrix 0.645466 + eigenstate 0.629338 - eigenmode 0.628024 + eigenmode 0.626229 - operator 0.620245 + largest 0.620355 - eigenmodes 0.610912 + matrix 0.618085 - largest 0.607076 + eigenmodes 0.605928 + + operator 0.602806 + + smallest 0.600443 - eigenstates 0.603603 ``` 5. **riemannian** ``` - Word: riemannian Position in vocabulary: 2285 + Word: riemannian Position in vocabulary: 2428 - Word Cosine distance + Word Cosine distance --------------------------------------------------------- - manifold 0.771125 + manifolds 0.780788 + + manifold 0.771704 + + metric 0.725227 + + finsler 0.686441 - manifolds 0.770408 + ricci 0.678393 - metric 0.709820 + curvature 0.677207 - finsler 0.699053 + metrics 0.660825 - curvature 0.672640 + euclidean 0.659125 - ricci 0.667813 + noncompact 0.647109 - riemmanian 0.661929 + conformally 0.643647 - euclidean 0.645167 + riemmanian 0.641671 - metrics 0.641648 + submanifold 0.632707 - submanifold 0.638131 + kahler 0.623857 - kahler 0.635828 + geodesic 0.621973 - riemanian 0.626252 + submanifolds 0.617170 - noncompact 0.623363 + endowed 0.616036 - geodesic 0.620316 + riemanian 0.608523 - submanifolds 0.613058 + hyperbolic 0.603709 - endowed 0.608804 + submersion 0.600120 - foliation 0.601818 ``` \ No newline at end of file