Commit 175ee721 authored by Deyan Ginev's avatar Deyan Ginev
Browse files

updated 2019 embeddings handling apostrophe tokenization better

parent 5fad79dc
......@@ -48,7 +48,7 @@ articles, the right of distribution was only given (or assumed) to arXiv itself.
| subset | tokens | unique words | unique words (freq 5+ ) |
| ---------- | -------------: | -----------: | ----------------------: |
| complete | 15,214,964,673 | 2,868,070 | 1,013,106 |
| complete | 15,192,564,807 | 2,782,667 | 989,136 |
### Citing this Resource
......@@ -59,7 +59,7 @@ Please cite the main dataset when using the word embeddings, as they are generat
([SIGMathLing members](/member/) only)
### Generated via
- [llamapun 0.3.3](https://github.com/KWARC/llamapun/releases/tag/0.3.3),
- [llamapun 0.3.4](https://github.com/KWARC/llamapun/releases/tag/0.3.4),
- [GloVe 1.2, 2019](https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8)
### Generation Parameters
......@@ -71,7 +71,7 @@ Please cite the main dataset when using the word embeddings, as they are generat
token_model_error.txt > token_model_complete.txt
```
* [llamapun v0.3.3](https://github.com/KWARC/llamapun/releases/tag/0.3.3), `corpus_token_model` example used for token model extraction
* [llamapun v0.3.4](https://github.com/KWARC/llamapun/releases/tag/0.3.4), `corpus_token_model` example used for token model extraction
* processed logical paragraphs, abstracts, captions and keywords, ignore all other content (e.g. tables, bibliography, others)
* excluded paragraphs containing latexml errors (marked via a `ltx_ERROR` HTML class); also excluded when words over 25 characters were encountered.
* used llamapun math-aware word tokenization, with sub-formula math lexemes (improved robustness since 2018)
......@@ -80,6 +80,7 @@ Please cite the main dataset when using the word embeddings, as they are generat
* internal references replaced with `ref` token (e.g. `Figure ref`)
* textual punctuation is included as-is, while mathematical punctuation is annotated via the latexml-generated lexemes.
* words are downcased, while math content is kept cased, to mitigate lexical ambiguity.
* an update was provided on Sep 29, 2019, improving tokenization of apostrophe-including constructs such as `'s`.
* [GloVe repository at sha 07d59d](https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8)
* build/vocab_count -min-count 5
......@@ -91,8 +92,8 @@ Please cite the main dataset when using the word embeddings, as they are generat
#### GloVe in-built evaluation (non-expert tasks e.g. language, relationships, geography)
1. NEW; 2019 model
* Total accuracy: 38.30% (7017/18322)
* Highest score: "gram3-comparative.txt", 78.60% (1047/1332)
* Total accuracy: 37.76% (7017/18322)
* Highest score: "gram3-comparative.txt", 77.33% (1047/1332)
2. 2018 [GloVe embeddings](/resources/arxmliv-embeddings-082018/)
......@@ -110,23 +111,23 @@ python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_fil
```
1. `abelian` is to `group` as `disjoint` is to `?`
* Top hit: `union`, cosine distance `0.618853`
* Top hit: `union`, cosine distance `0.649029`
2. `convex` is to `concave` as `positive` is to `?`
* Top hit: `negative`, cosine distance `0.806679`
* Top hit: `negative`, cosine distance `0.812031`
3. `finite` is to `infinte` as `abelian` is to `?`
* Top hit: `nonabelian`, cosine distance `0.698089`
* Top hit: `nonabelian`, cosine distance `0.689419`
4. `quantum` is to `classical` as `bottom` is to `?`
* Top hit: `middle`, cosine distance `0.769180`
* Close second: `top`, cosine distance `0.765937`
* Top hit: `middle`, cosine distance `0.770132`
* Close second: `top`, cosine distance `0.758245`
5. `eq` is to `proves` as `figure` is to `?`
* Top hit: `showing`, cosine distance `0.689938`
* Top hit: `shows`, cosine distance `0.675003`
6. `italic_x` is to `italic_y` as `italic_a` is to `?`
* Top hit: `italic_b`, cosine distance `0.915467`
* Top hit: `italic_b`, cosine distance `0.912827`
#### Nearest word vectors
......@@ -139,81 +140,92 @@ python2 eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file gl
```
Word: lattice Position in vocabulary: 515
Word Cosine distance
Word Cosine distance
---------------------------------------------------------
lattices 0.865888
lattices 0.860839
honeycomb 0.677004
finite 0.662110
finite 0.650216
honeycomb 0.657155
triangular 0.632165
crystal 0.635061
crystal 0.627800
triangular 0.632298
sublattice 0.619792
spacing 0.619840
cubic 0.609822
square 0.613936
sublattice 0.612161
hexagonal 0.606321
hypercubic 0.606101
latter 0.602747
symmetry 0.601192
cubic 0.601035
```
2. **entanglement**
```
Word: entanglement Position in vocabulary: 1603
Word: entanglement Position in vocabulary: 1605
Word Cosine distance
Word Cosine distance
---------------------------------------------------------
entangled 0.803443
entangled 0.795067
multipartite 0.745711
multipartite 0.744602
concurrence 0.708164
negativity 0.698730
negativity 0.695089
concurrence 0.693703
quantum 0.666254
tripartite 0.669840
tripartite 0.653771
discord 0.660572
fidelity 0.651990
fidelity 0.657391
teleportation 0.639430
quantum 0.655452
nonlocality 0.626717
teleportation 0.628923
discord 0.622995
qubits 0.627504
qubit 0.622836
bipartite 0.622791
bipartite 0.614907
entangling 0.621139
qubits 0.613029
nonlocality 0.619905
entropy 0.612276
qubit 0.615623
entropy 0.601869
```
3. **forgetful**
```
Word: forgetful Position in vocabulary: 12259
Word: forgetful Position in vocabulary: 12229
Word Cosine distance
Word Cosine distance
---------------------------------------------------------
functor 0.749501
functor 0.731004
functors 0.686806
functors 0.667090
morphism 0.632394
morphisms 0.605955
morphisms 0.610589
morphism 0.604947
```
......@@ -221,80 +233,88 @@ python2 eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file gl
```
Word: eigenvalue Position in vocabulary: 1527
Word Cosine distance
Word Cosine distance
---------------------------------------------------------
eigenvalues 0.903885
eigenvalues 0.894346
eigenvector 0.781512
eigenvector 0.775584
eigenvectors 0.774260
eigenfunction 0.772961
eigenfunction 0.751316
eigenvectors 0.762914
eigenfunctions 0.707166
eigenfunctions 0.700270
eigenspace 0.683321
eigenspace 0.686408
eigen 0.657366
eigen 0.664881
laplacian 0.649859
laplacian 0.646244
matrix 0.645466
eigenstate 0.629338
eigenmode 0.628024
eigenmode 0.626229
operator 0.620245
largest 0.620355
eigenmodes 0.610912
matrix 0.618085
largest 0.607076
eigenmodes 0.605928
operator 0.602806
smallest 0.600443
eigenstates 0.603603
```
5. **riemannian**
```
Word: riemannian Position in vocabulary: 2285
Word: riemannian Position in vocabulary: 2428
Word Cosine distance
Word Cosine distance
---------------------------------------------------------
manifold 0.771125
manifolds 0.780788
manifold 0.771704
metric 0.725227
finsler 0.686441
manifolds 0.770408
ricci 0.678393
metric 0.709820
curvature 0.677207
finsler 0.699053
metrics 0.660825
curvature 0.672640
euclidean 0.659125
ricci 0.667813
noncompact 0.647109
riemmanian 0.661929
conformally 0.643647
euclidean 0.645167
riemmanian 0.641671
metrics 0.641648
submanifold 0.632707
submanifold 0.638131
kahler 0.623857
kahler 0.635828
geodesic 0.621973
riemanian 0.626252
submanifolds 0.617170
noncompact 0.623363
endowed 0.616036
geodesic 0.620316
riemanian 0.608523
submanifolds 0.613058
hyperbolic 0.603709
endowed 0.608804
submersion 0.600120
foliation 0.601818
```
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment