Update embeddings description with "nomath" controls

c09ed9ec · Deyan Ginev · fe466e65 · c09ed9ec
Commit c09ed9ec authored Jul 20, 2019 by Deyan Ginev
--- a/resources/arxmliv-embeddings-082018.md
+++ b/resources/arxmliv-embeddings-082018.md
@@ -22,35 +22,39 @@ articles, the right of distribution was only given (or assumed) to arXiv itself.
  - An 11.5 billion token model for the arXMLiv 08.2018 dataset, including subformula lexemes
    - `token_model.zip`
  - 300 dimensional GloVe word embeddings for the arXMLiv 08.2018 dataset
-    - `glove.arxmliv.5B.300d.zip` and `vocab.arxmliv.zip`
+    - `glove.arxmliv.11B.300d.zip` and `vocab.arxmliv.zip`
  - 300d GloVe word embeddings for individual subsets
    - `glove.subsets.zip`
+  - Embeddings and vocabulary with math lexemes omitted
+    - `glove.arxmliv.nomath.11B.300d.zip` and `vocab.arxmliv.nomath.zip`
+    - added on July 20, 2019
+    - used as a control when evaluating the contribution of formula lexemes
  - the main arXMLiv dataset is available separately [here](/resources/arxmliv-dataset-082018/)

 #### Token Model Statistics

-subset     | documents | paragraphs | sentences   |
-----------|----------:|-----------:|------------:|
-no_problem | 137,864   | 4,646,203  | 21,533,963  |
-warning    | 705,095   | 45,797,794 | 183,246,777 |
-error      | 389,225   | 26,759,524 | 99,641,978  |
-complete   | 1,232,184 | 77,203,521 | 304,422,718 |
+| subset     | documents | paragraphs |   sentences |
+| ---------- | --------: | ---------: | ----------: |
+| no_problem |   137,864 |  4,646,203 |  21,533,963 |
+| warning    |   705,095 | 45,797,794 | 183,246,777 |
+| error      |   389,225 | 26,759,524 |  99,641,978 |
+| complete   | 1,232,184 | 77,203,521 | 304,422,718 |

-subset     | words         | formulas    | inline cite | numeric literals |
-----------|--------------:|-----------: |------------:|-----------------:|
-no_problem | 430,217,995   | 20,910,732  | 3,709,520   | 11,177,753       |
-warning    | 3,175,663,430 | 281,832,412 | 25,337,574  | 83,606,897       |
-error      | 1,731,971,035 | 153,186,264 | 13,145,561  | 43,399,720       |
-complete   | 5,337,852,460 | 455,929,408 | 42,192,655  | 138,184,370      |
+| subset     |         words |    formulas | inline cite | numeric literals |
+| ---------- | ------------: | ----------: | ----------: | ---------------: |
+| no_problem |   430,217,995 |  20,910,732 |   3,709,520 |       11,177,753 |
+| warning    | 3,175,663,430 | 281,832,412 |  25,337,574 |       83,606,897 |
+| error      | 1,731,971,035 | 153,186,264 |  13,145,561 |       43,399,720 |
+| complete   | 5,337,852,460 | 455,929,408 |  42,192,655 |      138,184,370 |

 #### GloVe Model Statistics

-subset     | tokens         | unique words | unique words (freq 5+ )
-----------|--------------: |-------------:|-----------------------:
-no_problem | 622,968,267    | 715,433      | 219,304
-warning    | 7,203,536,205  | 3,478,235    | 666,317
-error      | 3,691,805,321  | 2,444,532    | 574,467
-complete   | 11,518,309,793 | 5,285,379    | 1,000,295
+| subset     |         tokens | unique words | unique words (freq 5+ ) |
+| ---------- | -------------: | -----------: | ----------------------: |
+| no_problem |    622,968,267 |      715,433 |                 219,304 |
+| warning    |  7,203,536,205 |    3,478,235 |                 666,317 |
+| error      |  3,691,805,321 |    2,444,532 |                 574,467 |
+| complete   | 11,518,309,793 |    5,285,379 |               1,000,295 |

 ### Citing this Resource