adding embeddings, slight reorganization

fdf229c5 · Deyan Ginev · 87593b42 · fdf229c5 · fdf229c5 · fdf229c5
Commit fdf229c5 authored 7 years ago by Deyan Ginev
--- a/resources/arxmliv.md
+++ b/resources/arxmliv.md
 ---
 layout: page
-title: SIGMathLing - An HTML5 dataset for arXiv.org 
+title: arXMLiv 08.2017 - An HTML5 dataset for arXiv.org 
 ---
 Part of the [arXMLiv](https://kwarc.info/systems/arXMLiv/) project at the [KWARC](https://kwarc.info/) research group
@@ -17,19 +17,22 @@ Access is restricted to  [SIGMathLing members](/member/) under the
 [SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org)
 articles, the right of distribution was only given (or assumed) to arXiv itself.
-### Generated via
- - [LaTeXML 0.8.2](https://github.com/brucemiller/LaTeXML/releases/tag/v0.8.2), 
- - [CorTeX 0.2](https://github.com/dginev/CorTeX/releases/tag/0.2.0)
 ### Contents
  - 1,088,370 HTML5 documents
  - Three separate archive bundles separated by LaTeXML conversion severity
+  - derivative **word embeddings** and a **token model** are available separately [here](/resources/arxmliv-embeddings-082017/)
-| subset ID | file name | MD5                                | number of documents | size archived | size unpacked |
+  | subset ID  | number of documents | size archived | size unpacked |
-| ---                            | ---                                | ---                 | ---           | ---           |
+  | :---       | ---:                | ---:          | ---:          |
-| no_problems| arXMLiv_08_2017_no_problem.zip | `036945755c7cc75ea1577cf04ca4fead` | 112,088             | 5 GB          | 37 GB         |
+  | no\_problem| 112,088             | 5 GB          | 37 GB         |
-| warning| arXMLiv_08_2017_warning.zip    | `c0d5c1baf626225b48264510ac4c6bd5` | 574,638             | 71 GB         | 595 GB        | 
+  | warning    | 574,638             | 71 GB         | 595 GB        | 
-| error| arXMLiv_08_2017_error.zip      | `2f4e60b993d85d30523b064c19e45733` | 401,644             | 50 GB         | 421 GB        |
+  | error      | 401,644             | 50 GB         | 421 GB        |
+  | subset file name                 | MD5                                 |
+  | :---                             | :---                                |
+  | `arXMLiv_08_2017_no_problem.zip` | `036945755c7cc75ea1577cf04ca4fead`  |
+  | `arXMLiv_08_2017_warning.zip`    | `c0d5c1baf626225b48264510ac4c6bd5`  | 
+  | `arXMLiv_08_2017_error.zip`      | `2f4e60b993d85d30523b064c19e45733`  |
 ### Description
@@ -92,3 +95,6 @@ concrete identifier.
  [Download link](https://gl.kwarc.info/SIGMathLing/dataset-arXMLiv-08-2017)
  ([SIGMathLing members](/member/) only)
+### Generated via
+ - [LaTeXML 0.8.2](https://github.com/brucemiller/LaTeXML/releases/tag/v0.8.2), 
+ - [CorTeX 0.2](https://github.com/dginev/CorTeX/releases/tag/0.2.0)
--- a/resources/arxmliv-embeddings-082017.md
+++ b/resources/arxmliv-embeddings-082017.md
+---
+layout: page
+title: arXMLiv 08.2017 - Word Embeddings; Token Model
+---
+Part of the [arXMLiv](https://kwarc.info/systems/arXMLiv/) project at the [KWARC](https://kwarc.info/) research group
+### Author
+  Deyan Ginev, 
+### Current release
+ - 08.2017
+### Accessibility and License
+The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
+and tool development purposes. 
+Access is restricted to  [SIGMathLing members](/member/) under the
+[SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org)
+articles, the right of distribution was only given (or assumed) to arXiv itself.
+### Contents
+  - A 5 billion token model for the arXMLiv 08.2017 dataset
+    - `glove.arxmliv.5B.300d.zip` and `vocab.arxmliv.zip`
+  - 300 dimensional GloVe word embeddings for the arXMLiv 08.2017 dataset
+    - `token_model.zip`
+  - subset word embeddings
+    - `glove.subset.zip`
+  - the main arXMLiv dataset is available separately [here](/resources/arxmliv-dataset-082017/)
+#### Token Model Statistics
+subset     | documents | paragraphs | sentences   |
+-----------|----------:|-----------:|------------:|
+no_problem |   112,088 |  3,760,015 |  17,684,762 |
+warning    |   574,638 | 35,215,866 | 144,166,524 |
+error      |   401,644 | 28,555,173 | 111,798,273 |
+complete   | 1,088,370 | 67,531,054 | 273,649,559 |
+subset     | words         |formulas    | inline cite | numeric literals |
+-----------|--------------:|-----------:|------------:|-----------------:|
+no_problem |   355,253,671 |17,020,161  |  2,991,053  |   9,913,009      |
+warning    | 2,514,340,590 |219,167,820 | 20,163,304  |  65,294,846      |
+error      | 1,946,207,151 |169,247,016 | 14,458,082  |  51,730,645      |
+complete   | 4,815,801,412 |405,434,997 | 37,612,439  | 126,938,500      |
+#### GloVe Model Statistics
+subset     | tokens        | unique words | unique words (freq 5+ )
+-----------|--------------:|-------------:|-----------------------:
+no_problem |   384,951,086 |      490,134 | 170,615 
+warning    | 2,817,734,902 |    1,200,887 | 422,524
+error      | 2,180,119,361 |    1,889,392 | 518,609
+complete   | 5,382,805,349 |    2,573,974 | 746,673
+### Citing this Resource
+Please cite the main dataset when using the word embeddings, as they are generated and distributed jointly. [Instructions here](/resources/arxmliv-dataset-082017/#citing-this-resource)
+### Download
+  [Download link](https://gl.kwarc.info/SIGMathLing/embeddings-arXMLiv-08-2017)
+  ([SIGMathLing members](/member/) only)
+### Generated via
+ - [llamapun 0.1](https://github.com/KWARC/llamapun/releases/tag/0.1), 
+ - [GloVe 1.2](https://github.com/stanfordnlp/GloVe/tree/765074642a6544e47849bb85d8dc2e11e44c2922)
+### Generation Parameters
+ * token model distributed as 3 subsets - no_problem, warning and error. complete model is derived via:
+    ```
+      cat token_model_no_problem.txt \
+          token_model_warning.txt \
+          token_model_error.txt > token_model_complete.txt
+    ```
+ * [llamapun v0.1](https://github.com/KWARC/llamapun/releases/tag/0.1), `corpus_token_model` example used for token model extraction
+   * used llamapun math-aware sentence and word tokenization
+   * processed logical paragraphs only (excluded non-textual modalities, e.g. tables, figures and their captions, bibliographies)
+   * marked up formulas replaced with `mathformula` token
+   * marked up inline citations replaced with `citationelement` token
+   * numeric literals replaced with `NUM` token
+   * ignored sentences with unnaturally long words (>30 characters) - almost always due to latexml conversion errors
+ * [GloVe repository at sha 76507](https://github.com/stanfordnlp/GloVe/tree/765074642a6544e47849bb85d8dc2e11e44c2922)
+   * build/vocab_count -min-count 5
+   * build/cooccur -memory 32.0 -window-size 15
+   * build/shuffle -memory 32.0
+   * build/glove -threads 16 -x-max 100 -iter 25 -vector-size 300 -binary 2
+### Examples and baselines
+#### GloVe in-built evaluation (non-expert tasks e.g. language, relationships, geography)
+1. no_problem
+  * Total accuracy: 26.49%  (3665/13833)
+  * Highest score: "gram3-comparative.txt", 75.83% (1010/1332)
+2. warning
+  * Total accuracy: 31.16%  (4989/16013)
+  * Highest score: "gram3-comparative.txt", 75.45% (1005/1332)
+3. error
+  * Total accuracy: 29.63%  (4997/16867)
+  * Highest score: "gram3-comparative.txt", 76.58% (1020/1332)
+4. complete
+  * Total accuracy: 32.86%  (5770/17562)
+  * Highest score: "gram3-comparative.txt", 78.53% (1046/1332)
+5. demo baseline: text8 demo (first 100M characters of Wikipedia)
+  * Total accuracy: 23.91%  (4262/17827)
+  * Highest score: "capital-common-countries.txt", 62.65% (317/506)
+**Evaluation note:** These in-built evlauation runs are provided as a sanity check that the generated GloVe models pass a basic baseline against the non-expert tasks in the default GloVe suite. 
+One would need a scienctific discourse tailored set of test cases to evaluate the arXiv-based models competitively.
+#### Measuring word analogy
+In a cloned GloVe repository, start via:
+```
+python eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.5B.300d.txt 
+```
+1. `abelian` is to `group` as `disjoint` is to `?`
+  * Top hit: `union`, cosine distance `0.644784`
+2. `convex` is to `concave` as `positive` is to `?`
+  * Top hit: `negative`, cosine distance `0.802866`
+3. `finite` is to `infinte` as `abelian` is to `?`
+  * Top hit: `nonabelian`, cosine distance `0.664235`
+4. `quantum` is to `classical` as `bottom` is to `?`
+  * Top hit: `top`, cosine distance `0.719843`
+5. `eq` is to `proves` as `figure` is to `?`
+  * Top hit: `shows`, cosine distance `0.674743`
+#### Nearest word vectors 
+In a cloned GloVe repository, start via:
+```
+python eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.5B.300d.txt 
+```
+1. **lattice** 
+    ```
+    Word: lattice  Position in vocabulary: 311
+                                   Word   Cosine distance
+    -----------------------------------------------------
+                               lattices		0.811057
+                              honeycomb		0.657262
+                                 finite		0.625146
+                             triangular		0.608218
+                                spacing		0.605435
+    ```
+2. **entanglement**
+    ```
+    Word: entanglement  Position in vocabulary: 1293
+                                   Word   Cosine distance
+    -----------------------------------------------------
+                              entangled		0.763964
+                           multipartite		0.730231
+                               fidelity		0.653443
+                            concurrence		0.652454
+                          environemtnal		0.646705
+                             negativity		0.646165
+                                quantum		0.639032
+                                discord		0.624222
+                            nonlocality		0.610661
+                             tripartite		0.609896
+    ```
+3. **forgetful**
+    ```
+    Word: forgetful  Position in vocabulary: 10697
+                                   Word   Cosine distance
+    -----------------------------------------------------
+                                functor		0.723019
+                               functors		0.653969
+                               morphism		0.626222
+    ```
+4. **eigenvalue**
+    ```
+    Word: eigenvalue  Position in vocabulary: 1212
+                                   Word   Cosine distance
+    -----------------------------------------------------
+                            eigenvalues		0.878527
+                            eigenvector		0.766371
+                          eigenfunction		0.761923
+                           eigenvectors		0.747451
+                         eigenfunctions		0.707346
+                             eigenspace		0.661539
+                          corresponding		0.629746
+                              laplacian		0.627187
+                               operator		0.627130
+                                  eigen		0.620933
+    ```
+5. **riemannian**
+    ```
+    Word: riemannian  Position in vocabulary: 2026
+                                   Word   Cosine distance
+    -----------------------------------------------------
+                               manifold		0.766196
+                              manifolds		0.745785
+                                 metric		0.714120
+                              curvature		0.672975
+                                metrics		0.670006
+                                finsler		0.665079
+                                  ricci		0.657058
+                              euclidean		0.650198
+                                endowed		0.626307
+                             riemmanian		0.621626
+                              riemanian		0.618022
+    ```
\ No newline at end of file
--- a/resources/index.md
+++ b/resources/index.md
@@ -3,6 +3,8 @@ layout: page
 title: SIGMathLing - Datasets and Resources
 ---
- 1. [arXMLiv corpus, 08.2017 release](/resources/arxmliv/)
+ 1. [arXMLiv corpus, 08.2017 release](/resources/arxmliv-dataset-082017/)
+ 2. [arXMLiv word embeddings, 08.2017 release](/resources/arxmliv-embeddings-082017)
 Additional resources are en route, see the [plan](/technical/) for details.