updated 2019 embeddings handling apostrophe tokenization better

175ee721 · Deyan Ginev · 5fad79dc · 175ee721
Commit 175ee721 authored Sep 29, 2019 by Deyan Ginev
--- a/resources/arxmliv-embeddings-082019.md
+++ b/resources/arxmliv-embeddings-082019.md
@@ -48,7 +48,7 @@ articles, the right of distribution was only given (or assumed) to arXiv itself.

 | subset     |         tokens | unique words | unique words (freq 5+ ) |
 | ---------- | -------------: | -----------: | ----------------------: |
-| complete   | 15,214,964,673 |   2,868,070  |              1,013,106  |
+| complete   | 15,192,564,807 |   2,782,667  |                989,136  |

 ### Citing this Resource

@@ -59,7 +59,7 @@ Please cite the main dataset when using the word embeddings, as they are generat
  ([SIGMathLing members](/member/) only)

 ### Generated via
- - [llamapun 0.3.3](https://github.com/KWARC/llamapun/releases/tag/0.3.3),
+ - [llamapun 0.3.4](https://github.com/KWARC/llamapun/releases/tag/0.3.4),
 - [GloVe 1.2, 2019](https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8)

 ### Generation Parameters
@@ -71,7 +71,7 @@ Please cite the main dataset when using the word embeddings, as they are generat
          token_model_error.txt > token_model_complete.txt
    ```

- * [llamapun v0.3.3](https://github.com/KWARC/llamapun/releases/tag/0.3.3), `corpus_token_model` example used for token model extraction
+ * [llamapun v0.3.4](https://github.com/KWARC/llamapun/releases/tag/0.3.4), `corpus_token_model` example used for token model extraction
   * processed logical paragraphs, abstracts, captions and keywords, ignore all other content (e.g. tables, bibliography, others)
   * excluded paragraphs containing latexml errors (marked via a `ltx_ERROR` HTML class); also excluded when words over 25 characters were encountered.
   * used llamapun math-aware word tokenization, with sub-formula math lexemes (improved robustness since 2018)
@@ -80,6 +80,7 @@ Please cite the main dataset when using the word embeddings, as they are generat
   * internal references replaced with `ref` token (e.g. `Figure ref`)
   * textual punctuation is included as-is, while mathematical punctuation is annotated via the latexml-generated lexemes.
   * words are downcased, while math content is kept cased, to mitigate lexical ambiguity.
+   * an update was provided on Sep 29, 2019, improving tokenization of apostrophe-including constructs such as `'s`.

 * [GloVe repository at sha 07d59d](https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8)
   * build/vocab_count -min-count 5
@@ -91,8 +92,8 @@ Please cite the main dataset when using the word embeddings, as they are generat

 #### GloVe in-built evaluation (non-expert tasks e.g. language, relationships, geography)
 1. NEW; 2019 model
-  * Total accuracy: 38.30%  (7017/18322)
-  * Highest score: "gram3-comparative.txt", 78.60% (1047/1332)
+  * Total accuracy: 37.76%  (7017/18322)
+  * Highest score: "gram3-comparative.txt", 77.33% (1047/1332)


 2. 2018 [GloVe embeddings](/resources/arxmliv-embeddings-082018/)
@@ -110,23 +111,23 @@ python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_fil
 ```

 1. `abelian` is to `group` as `disjoint` is to `?`
-  * Top hit: `union`, cosine distance	`0.618853`
+  * Top hit: `union`, cosine distance	`0.649029`

 2. `convex` is to `concave` as `positive` is to `?`
-  * Top hit: `negative`, cosine distance `0.806679`
+  * Top hit: `negative`, cosine distance `0.812031`

 3. `finite` is to `infinte` as `abelian` is to `?`
-  * Top hit: `nonabelian`, cosine distance `0.698089`
+  * Top hit: `nonabelian`, cosine distance `0.689419`

 4. `quantum` is to `classical` as `bottom` is to `?`
-  * Top hit: `middle`, cosine distance `0.769180`
-  * Close second: `top`, cosine distance `0.765937`
+  * Top hit: `middle`, cosine distance `0.770132`
+  * Close second: `top`, cosine distance `0.758245`

 5. `eq` is to `proves` as `figure` is to `?`
-  * Top hit: `showing`, cosine distance `0.689938`
+  * Top hit: `shows`, cosine distance `0.675003`

 6. `italic_x` is to `italic_y` as `italic_a` is to `?`
-  * Top hit: `italic_b`, cosine distance `0.915467`
+  * Top hit: `italic_b`, cosine distance `0.912827`


 #### Nearest word vectors
@@ -143,77 +144,88 @@ python2 eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file gl

    ---------------------------------------------------------

-                              lattices		0.865888
+                           lattices		0.860839

-                              honeycomb		0.677004
+                             finite		0.662110

-                                finite		0.650216
+                          honeycomb		0.657155

-                            triangular		0.632165
+                            crystal		0.635061

-                                crystal		0.627800
+                         triangular		0.632298

-                            sublattice		0.619792
+                            spacing		0.619840

-                                  cubic		0.609822
+                             square		0.613936

+                         sublattice		0.612161
+
+                          hexagonal		0.606321
+
+                         hypercubic		0.606101
+
+                             latter		0.602747
+
+                           symmetry		0.601192
+
+                              cubic		0.601035
    ```

 2. **entanglement**
    ```
-    Word: entanglement  Position in vocabulary: 1603
+    Word: entanglement  Position in vocabulary: 1605

                           Word       Cosine distance

    ---------------------------------------------------------

-                            entangled		0.803443
+                          entangled		0.795067
+
+                       multipartite		0.745711

-                          multipartite		0.744602
+                        concurrence		0.708164

-                            negativity		0.698730
+                         negativity		0.695089

-                            concurrence		0.693703
+                            quantum		0.666254

-                            tripartite		0.669840
+                         tripartite		0.653771

-                                discord		0.660572
+                           fidelity		0.651990

-                              fidelity		0.657391
+                      teleportation		0.639430

-                                quantum		0.655452
+                        nonlocality		0.626717

-                          teleportation		0.628923
+                            discord		0.622995

-                                qubits		0.627504
+                              qubit		0.622836

-                              bipartite		0.622791
+                          bipartite		0.614907

-                            entangling		0.621139
+                             qubits		0.613029

-                            nonlocality		0.619905
+                            entropy		0.612276

-                                  qubit		0.615623

-                                entropy		0.601869

    ```

 3. **forgetful**
    ```
-    Word: forgetful  Position in vocabulary: 12259
+    Word: forgetful  Position in vocabulary: 12229

                           Word       Cosine distance

    ---------------------------------------------------------

-                                functor		0.749501
+                            functor		0.731004

-                              functors		0.686806
+                           functors		0.667090

-                              morphism		0.632394
+                          morphisms		0.605955

-                              morphisms		0.610589
+                           morphism		0.604947

    ```

@@ -225,76 +237,84 @@ python2 eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file gl

    ---------------------------------------------------------

-                            eigenvalues		0.903885
+                        eigenvalues		0.894346

-                            eigenvector		0.781512
+                        eigenvector		0.775584

-                          eigenvectors		0.774260
+                      eigenfunction		0.772961

-                          eigenfunction		0.751316
+                      eigenvectors		0.762914

-                        eigenfunctions		0.707166
+                    eigenfunctions		0.700270

-                            eigenspace		0.683321
+                        eigenspace		0.686408

-                                  eigen		0.657366
+                              eigen		0.664881

-                              laplacian		0.649859
+                          laplacian		0.646244

-                                matrix		0.645466
+                        eigenstate		0.629338

-                              eigenmode		0.628024
+                          eigenmode		0.626229

-                              operator		0.620245
+                            largest		0.620355

-                            eigenmodes		0.610912
+                            matrix		0.618085

-                                largest		0.607076
+                        eigenmodes		0.605928
+
+                          operator		0.602806
+
+                          smallest		0.600443

-                            eigenstates		0.603603

    ```

 5. **riemannian**
    ```
-    Word: riemannian  Position in vocabulary: 2285
+    Word: riemannian  Position in vocabulary: 2428

                           Word       Cosine distance

    ---------------------------------------------------------

-                              manifold		0.771125
+                          manifolds		0.780788
+
+                           manifold		0.771704
+
+                             metric		0.725227
+
+                            finsler		0.686441

-                              manifolds		0.770408
+                              ricci		0.678393

-                                metric		0.709820
+                          curvature		0.677207

-                                finsler		0.699053
+                            metrics		0.660825

-                              curvature		0.672640
+                          euclidean		0.659125

-                                  ricci		0.667813
+                         noncompact		0.647109

-                            riemmanian		0.661929
+                        conformally		0.643647

-                              euclidean		0.645167
+                         riemmanian		0.641671

-                                metrics		0.641648
+                        submanifold		0.632707

-                            submanifold		0.638131
+                             kahler		0.623857

-                                kahler		0.635828
+                           geodesic		0.621973

-                              riemanian		0.626252
+                       submanifolds		0.617170

-                            noncompact		0.623363
+                            endowed		0.616036

-                              geodesic		0.620316
+                          riemanian		0.608523

-                          submanifolds		0.613058
+                         hyperbolic		0.603709

-                                endowed		0.608804
+                         submersion		0.600120

-                              foliation		0.601818

    ```
\ No newline at end of file