Deyan Ginev
--- a/resources/arxmliv-embeddings-082018.md

+ 33

− 29
+++ b/resources/arxmliv-embeddings-082018.md

+ 33

− 29
 @@ -12,7 +12,7 @@ Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWAR

 ### Accessibility and License
 The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
-and tool development purposes. 
+and tool development purposes.

 Access is restricted to  [SIGMathLing members](/member/) under the
 [SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org)
 @@ -22,35 +22,39 @@ articles, the right of distribution was only given (or assumed) to arXiv itself.
  - An 11.5 billion token model for the arXMLiv 08.2018 dataset, including subformula lexemes
    - `token_model.zip`
  - 300 dimensional GloVe word embeddings for the arXMLiv 08.2018 dataset
-    - `glove.arxmliv.5B.300d.zip` and `vocab.arxmliv.zip`
+    - `glove.arxmliv.11B.300d.zip` and `vocab.arxmliv.zip`
  - 300d GloVe word embeddings for individual subsets
    - `glove.subsets.zip`
+  - Embeddings and vocabulary with math lexemes omitted
+    - `glove.arxmliv.nomath.11B.300d.zip` and `vocab.arxmliv.nomath.zip`
+    - added on July 20, 2019
+    - used as a control when evaluating the contribution of formula lexemes
  - the main arXMLiv dataset is available separately [here](/resources/arxmliv-dataset-082018/)

 #### Token Model Statistics

-subset     | documents | paragraphs | sentences   |
-----------|----------:|-----------:|------------:|
-no_problem | 137,864   | 4,646,203  | 21,533,963  |
-warning    | 705,095   | 45,797,794 | 183,246,777 |
-error      | 389,225   | 26,759,524 | 99,641,978  |
-complete   | 1,232,184 | 77,203,521 | 304,422,718 |
+| subset     | documents | paragraphs |   sentences |
+| ---------- | --------: | ---------: | ----------: |
+| no_problem |   137,864 |  4,646,203 |  21,533,963 |
+| warning    |   705,095 | 45,797,794 | 183,246,777 |
+| error      |   389,225 | 26,759,524 |  99,641,978 |
+| complete   | 1,232,184 | 77,203,521 | 304,422,718 |

-subset     | words         | formulas    | inline cite | numeric literals |
-----------|--------------:|-----------: |------------:|-----------------:|
-no_problem | 430,217,995   | 20,910,732  | 3,709,520   | 11,177,753       |
-warning    | 3,175,663,430 | 281,832,412 | 25,337,574  | 83,606,897       |
-error      | 1,731,971,035 | 153,186,264 | 13,145,561  | 43,399,720       |
-complete   | 5,337,852,460 | 455,929,408 | 42,192,655  | 138,184,370      |
+| subset     |         words |    formulas | inline cite | numeric literals |
+| ---------- | ------------: | ----------: | ----------: | ---------------: |
+| no_problem |   430,217,995 |  20,910,732 |   3,709,520 |       11,177,753 |
+| warning    | 3,175,663,430 | 281,832,412 |  25,337,574 |       83,606,897 |
+| error      | 1,731,971,035 | 153,186,264 |  13,145,561 |       43,399,720 |
+| complete   | 5,337,852,460 | 455,929,408 |  42,192,655 |      138,184,370 |

 #### GloVe Model Statistics

-subset     | tokens         | unique words | unique words (freq 5+ )
-----------|--------------: |-------------:|-----------------------:
-no_problem | 622,968,267    | 715,433      | 219,304
-warning    | 7,203,536,205  | 3,478,235    | 666,317
-error      | 3,691,805,321  | 2,444,532    | 574,467
-complete   | 11,518,309,793 | 5,285,379    | 1,000,295
+| subset     |         tokens | unique words | unique words (freq 5+ ) |
+| ---------- | -------------: | -----------: | ----------------------: |
+| no_problem |    622,968,267 |      715,433 |                 219,304 |
+| warning    |  7,203,536,205 |    3,478,235 |                 666,317 |
+| error      |  3,691,805,321 |    2,444,532 |                 574,467 |
+| complete   | 11,518,309,793 |    5,285,379 |               1,000,295 |

 ### Citing this Resource

 @@ -59,14 +63,14 @@ Please cite the main dataset when using the word embeddings, as they are generat
 ### Download
  [Download link](https://gl.kwarc.info/SIGMathLing/embeddings-arXMLiv-08-2018)
  ([SIGMathLing members](/member/) only)
- 
+
 ### Generated via
- - [llamapun 0.2.0](https://github.com/KWARC/llamapun/releases/tag/0.2.0), 
+ - [llamapun 0.2.0](https://github.com/KWARC/llamapun/releases/tag/0.2.0),
 - [GloVe 1.2, 2018](https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8)

 ### Generation Parameters
 * token model distributed as 3 subsets - no_problem, warning and error. complete model is derived via:
-   
+
    ```
      cat token_model_no_problem.txt \
          token_model_warning.txt \
 @@ -102,20 +106,20 @@ Please cite the main dataset when using the word embeddings, as they are generat
  * Highest score: "gram3-comparative.txt", 72.75% (969/1332)

 4. complete
-  * Total accuracy: 35.48%  (6298/17750) 
+  * Total accuracy: 35.48%  (6298/17750)
  * Highest score: "gram3-comparative.txt", 76.65% (1021/1332)

 5. demo baseline: text8 demo (first 100M characters of Wikipedia)
  * Total accuracy: 23.62%  (4211/17827)
  * Highest score: "gram6-nationality-adjective.txt", 58.65% (892/1521)

-**Evaluation note:** These in-built evlauation runs are provided as a sanity check that the generated GloVe models pass a basic baseline against the non-expert tasks in the default GloVe suite. 
+**Evaluation note:** These in-built evlauation runs are provided as a sanity check that the generated GloVe models pass a basic baseline against the non-expert tasks in the default GloVe suite.
 One would need a scienctific discourse tailored set of test cases to evaluate the arXiv-based models competitively.

 #### Measuring word analogy
 In a cloned GloVe repository, start via:
 ```
-python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt 
+python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt
 ```

 1. `abelian` is to `group` as `disjoint` is to `?`
 @@ -137,13 +141,13 @@ python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_fil
  * Top hit: `italic_b`, cosine distance `0.902608`


-#### Nearest word vectors 
+#### Nearest word vectors
 In a cloned GloVe repository, start via:
 ```
-python2 eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt 
+python2 eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.11B.300d.txt
 ```

-1. **lattice** 
+1. **lattice**

 ```
 Word: lattice  Position in vocabulary: 488