arxmliv embeddings 08.2018, now available

44c98cdf · Deyan Ginev · 1c6c6b5d · 44c98cdf
Commit 44c98cdf authored 6 years ago by Deyan Ginev
--- a/resources/arxmliv-embeddings-082018.md
+++ b/resources/arxmliv-embeddings-082018.md
@@ -18,7 +18,7 @@ Access is restricted to  [SIGMathLing members](/member/) under the
 articles, the right of distribution was only given (or assumed) to arXiv itself.

 ### Contents
-  - A 5 billion token model for the arXMLiv 08.2018 dataset
+  - An 11.5 billion token model for the arXMLiv 08.2018 dataset, including subformula lexemes
    - `token_model.zip`
  - 300 dimensional GloVe word embeddings for the arXMLiv 08.2018 dataset
    - `glove.arxmliv.5B.300d.zip` and `vocab.arxmliv.zip`
@@ -30,26 +30,26 @@ articles, the right of distribution was only given (or assumed) to arXiv itself.

 subset     | documents | paragraphs | sentences   |
 -----------|----------:|-----------:|------------:|
-no_problem | |  | |
-warning    | |  | |
-error      | |  | |
-complete   | |  | |
+no_problem | 137,864   | 4,646,203  | 21,533,963  |
+warning    | 705,095   | 45,797,794 | 183,246,777 |
+error      | 389,225   | 26,759,524 | 99,641,978  |
+complete   | 1,232,184 | 77,203,521 | 304,422,718 |

 subset     | words         | formulas    | inline cite | numeric literals |
 -----------|--------------:|-----------: |------------:|-----------------:|
-no_problem |  |  |  |       |
-warning    |  | |   |       |
-error      |  | |   |       |
-complete   |  | |   |       |
+no_problem | 430,217,995   | 20,910,732  | 3,709,520   | 11,177,753       |
+warning    | 3,175,663,430 | 281,832,412 | 25,337,574  | 83,606,897       |
+error      | 1,731,971,035 | 153,186,264 | 13,145,561  | 43,399,720       |
+complete   | 5,337,852,460 | 455,929,408 | 42,192,655  | 138,184,370      |

 #### GloVe Model Statistics

 subset     | tokens         | unique words | unique words (freq 5+ )
 -----------|--------------: |-------------:|-----------------------:
-no_problem |  |     |
-warning    |  |     |
-error      |  |     |
-complete   |  |     |
+no_problem | 622,968,267    | 715,433      | 219,304
+warning    | 7,203,536,205  | 3,478,235    | 666,317
+error      | 3,691,805,321  | 2,444,532    | 574,467
+complete   | 11,518,309,793 | 5,285,379    | 1,000,295

 ### Citing this Resource

@@ -60,7 +60,7 @@ Please cite the main dataset when using the word embeddings, as they are generat
  ([SIGMathLing members](/member/) only)
 
 ### Generated via
- - [llamapun 0.2](https://github.com/KWARC/llamapun/releases/tag/0.2), 
+ - [llamapun 0.2.0](https://github.com/KWARC/llamapun/releases/tag/0.2.0), 
 - [GloVe 1.2, 2018](https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8)

 ### Generation Parameters
@@ -73,12 +73,12 @@ Please cite the main dataset when using the word embeddings, as they are generat
    ```

 * [llamapun v0.2](https://github.com/KWARC/llamapun/releases/tag/0.2), `corpus_token_model` example used for token model extraction
-   * used llamapun math-aware sentence and word tokenization
-   * processed logical paragraphs only (excluded non-textual modalities, e.g. tables, figures and their captions, bibliographies)
-   * marked up formulas replaced with `mathformula` token
+   * processed logical paragraphs only (excluded non-textual modalities, e.g. tables, figures and their captions, bibliographies, as well as abstracts)
+   * excluded paragraphs containing latexml errors (marked via a `ltx_ERROR` HTML class)
+   * used llamapun math-aware sentence and word tokenization, with subformula math lexemes
   * marked up inline citations replaced with `citationelement` token
-   * numeric literals replaced with `NUM` token
-   * ignored sentences with unnaturally long words (>30 characters) - almost always due to latexml conversion errors
+   * numeric literals replaced with `NUM` token (both in text and formulas)
+   * ignored sentences with unnaturally long words (>25 characters) - almost always due to latexml conversion errors
 * [GloVe repository at sha 07d59d](https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8)
   * build/vocab_count -min-count 5
   * build/cooccur -memory 32.0 -window-size 15
@@ -89,24 +89,24 @@ Please cite the main dataset when using the word embeddings, as they are generat

 #### GloVe in-built evaluation (non-expert tasks e.g. language, relationships, geography)
 1. no_problem
-  * Total accuracy:
-  * Highest score: 
+  * Total accuracy: 27.47%  (4028/14663)
+  * Highest score: "gram3-comparative.txt", 75.30% (1003/1332)

 2. warning
-  * Total accuracy:
-  * Highest score: 
+  * Total accuracy: 31.97%  (5351/16736)
+  * Highest score: "gram3-comparative.txt", 77.40% (1031/1332)

 3. error
-  * Total accuracy:
-  * Highest score: 
+  * Total accuracy: 29.67%  (4910/16549)
+  * Highest score: "gram3-comparative.txt", 72.75% (969/1332)

 4. complete
-  * Total accuracy:
-  * Highest score: 
+  * Total accuracy: 35.48%  (6298/17750) 
+  * Highest score: "gram3-comparative.txt", 76.65% (1021/1332)

 5. demo baseline: text8 demo (first 100M characters of Wikipedia)
-  * Total accuracy: 
-  * Highest score: 
+  * Total accuracy: 23.62%  (4211/17827)
+  * Highest score: "gram6-nationality-adjective.txt", 58.65% (892/1521)

 **Evaluation note:** These in-built evlauation runs are provided as a sanity check that the generated GloVe models pass a basic baseline against the non-expert tasks in the default GloVe suite. 
 One would need a scienctific discourse tailored set of test cases to evaluate the arXiv-based models competitively.
@@ -114,37 +114,187 @@ One would need a scienctific discourse tailored set of test cases to evaluate th
 #### Measuring word analogy
 In a cloned GloVe repository, start via:
 ```
-python eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.5B.300d.txt 
+python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt 
 ```

 1. `abelian` is to `group` as `disjoint` is to `?`
-  * Top hit: 
+  * Top hit: `union`, cosine distance	`0.653377`

 2. `convex` is to `concave` as `positive` is to `?`
-  * Top hit: 
+  * Top hit: `negative`, cosine distance `0.786877`

 3. `finite` is to `infinte` as `abelian` is to `?`
-  * Top hit: 
+  * Top hit: `nonabelian`, cosine distance `0.676042`

 4. `quantum` is to `classical` as `bottom` is to `?`
-  * Top hit: 
+  * Top hit: `top`, cosine distance `0.734896`

 5. `eq` is to `proves` as `figure` is to `?`
-  * Top hit: 
+  * Top hit: `showing`, cosine distance `0.668502`
+
+6. `italic_x` is to `italic_y` as `italic_a` is to `?`
+  * Top hit: `italic_b`, cosine distance `0.902608`


 #### Nearest word vectors 
 In a cloned GloVe repository, start via:
 ```
-python eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.5B.300d.txt 
+python2 eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt 
 ```

 1. **lattice** 

+```
+Word: lattice  Position in vocabulary: 488
+
+                               Word       Cosine distance
+
+---------------------------------------------------------
+
+                           lattices		0.853103
+
+                         triangular		0.637767
+
+                          honeycomb		0.626426
+
+                            crystal		0.624397
+
+                             finite		0.614720
+
+                            spacing		0.603067
+```
+
 2. **entanglement**

+```
+Word: entanglement  Position in vocabulary: 1568
+
+                               Word       Cosine distance
+
+---------------------------------------------------------
+
+                          entangled		0.780425
+
+                       multipartite		0.730968
+
+                        concurrence		0.691708
+
+                         negativity		0.649595
+
+                         tripartite		0.647623
+
+                            quantum		0.640395
+
+                           fidelity		0.640285
+
+                      teleportation		0.616797
+
+                            discord		0.613752
+
+                            entropy		0.612341
+
+                          bipartite		0.608034
+
+                          coherence		0.606859
+
+                        nonlocality		0.601337
+```
+
 3. **forgetful**

+```
+Word: forgetful  Position in vocabulary: 11740
+
+                               Word       Cosine distance
+
+---------------------------------------------------------
+
+                            functor		0.723472
+
+                           functors		0.656184
+
+                           morphism		0.598965
+```
+
 4. **eigenvalue**

+```
+Word: eigenvalue  Position in vocabulary: 1448
+
+                               Word       Cosine distance
+
+---------------------------------------------------------
+
+                        eigenvalues		0.893073
+
+                        eigenvector		0.768380
+
+                       eigenvectors		0.765241
+
+                      eigenfunction		0.754222
+
+                     eigenfunctions		0.686141
+
+                         eigenspace		0.666098
+
+                              eigen		0.641422
+
+                             matrix		0.616723
+
+                          eigenmode		0.613117
+
+                         eigenstate		0.612188
+
+                          laplacian		0.611396
+
+                            largest		0.606122
+
+                           smallest		0.605342
+
+                         eigenmodes		0.604839
+
+```
+
 5. **riemannian**
+
+```
+Word: riemannian  Position in vocabulary: 2285
+
+                               Word       Cosine distance
+
+---------------------------------------------------------
+
+                          manifolds		0.765827
+
+                           manifold		0.760806
+
+                             metric		0.719817
+
+                            finsler		0.687826
+
+                          curvature		0.676100
+
+                              ricci		0.664770
+
+                            metrics		0.660804
+
+                         riemmanian		0.651666
+
+                          euclidean		0.644686
+
+                         noncompact		0.643878
+
+                        conformally		0.638984
+
+                          riemanian		0.633814
+
+                             kahler		0.632680
+
+                            endowed		0.622035
+
+                        submanifold		0.613868
+
+                       submanifolds		0.612716
+
+                           geodesic		0.604488
+```
\ No newline at end of file