Commit fdf229c5 authored by Deyan Ginev's avatar Deyan Ginev

adding embeddings, slight reorganization

parent 87593b42
---
layout: page
title: SIGMathLing - An HTML5 dataset for arXiv.org
title: arXMLiv 08.2017 - An HTML5 dataset for arXiv.org
---
Part of the [arXMLiv](https://kwarc.info/systems/arXMLiv/) project at the [KWARC](https://kwarc.info/) research group
......@@ -17,19 +17,22 @@ Access is restricted to [SIGMathLing members](/member/) under the
[SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org)
articles, the right of distribution was only given (or assumed) to arXiv itself.
### Generated via
- [LaTeXML 0.8.2](https://github.com/brucemiller/LaTeXML/releases/tag/v0.8.2),
- [CorTeX 0.2](https://github.com/dginev/CorTeX/releases/tag/0.2.0)
### Contents
- 1,088,370 HTML5 documents
- Three separate archive bundles separated by LaTeXML conversion severity
- derivative **word embeddings** and a **token model** are available separately [here](/resources/arxmliv-embeddings-082017/)
| subset ID | file name | MD5 | number of documents | size archived | size unpacked |
| --- | --- | --- | --- | --- |
| no_problems| arXMLiv_08_2017_no_problem.zip | `036945755c7cc75ea1577cf04ca4fead` | 112,088 | 5 GB | 37 GB |
| warning| arXMLiv_08_2017_warning.zip | `c0d5c1baf626225b48264510ac4c6bd5` | 574,638 | 71 GB | 595 GB |
| error| arXMLiv_08_2017_error.zip | `2f4e60b993d85d30523b064c19e45733` | 401,644 | 50 GB | 421 GB |
| subset ID | number of documents | size archived | size unpacked |
| :--- | ---: | ---: | ---: |
| no\_problem| 112,088 | 5 GB | 37 GB |
| warning | 574,638 | 71 GB | 595 GB |
| error | 401,644 | 50 GB | 421 GB |
| subset file name | MD5 |
| :--- | :--- |
| `arXMLiv_08_2017_no_problem.zip` | `036945755c7cc75ea1577cf04ca4fead` |
| `arXMLiv_08_2017_warning.zip` | `c0d5c1baf626225b48264510ac4c6bd5` |
| `arXMLiv_08_2017_error.zip` | `2f4e60b993d85d30523b064c19e45733` |
### Description
......@@ -92,3 +95,6 @@ concrete identifier.
[Download link](https://gl.kwarc.info/SIGMathLing/dataset-arXMLiv-08-2017)
([SIGMathLing members](/member/) only)
### Generated via
- [LaTeXML 0.8.2](https://github.com/brucemiller/LaTeXML/releases/tag/v0.8.2),
- [CorTeX 0.2](https://github.com/dginev/CorTeX/releases/tag/0.2.0)
---
layout: page
title: arXMLiv 08.2017 - Word Embeddings; Token Model
---
Part of the [arXMLiv](https://kwarc.info/systems/arXMLiv/) project at the [KWARC](https://kwarc.info/) research group
### Author
Deyan Ginev,
### Current release
- 08.2017
### Accessibility and License
The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
and tool development purposes.
Access is restricted to [SIGMathLing members](/member/) under the
[SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org)
articles, the right of distribution was only given (or assumed) to arXiv itself.
### Contents
- A 5 billion token model for the arXMLiv 08.2017 dataset
- `glove.arxmliv.5B.300d.zip` and `vocab.arxmliv.zip`
- 300 dimensional GloVe word embeddings for the arXMLiv 08.2017 dataset
- `token_model.zip`
- subset word embeddings
- `glove.subset.zip`
- the main arXMLiv dataset is available separately [here](/resources/arxmliv-dataset-082017/)
#### Token Model Statistics
subset | documents | paragraphs | sentences |
-----------|----------:|-----------:|------------:|
no_problem | 112,088 | 3,760,015 | 17,684,762 |
warning | 574,638 | 35,215,866 | 144,166,524 |
error | 401,644 | 28,555,173 | 111,798,273 |
complete | 1,088,370 | 67,531,054 | 273,649,559 |
subset | words |formulas | inline cite | numeric literals |
-----------|--------------:|-----------:|------------:|-----------------:|
no_problem | 355,253,671 |17,020,161 | 2,991,053 | 9,913,009 |
warning | 2,514,340,590 |219,167,820 | 20,163,304 | 65,294,846 |
error | 1,946,207,151 |169,247,016 | 14,458,082 | 51,730,645 |
complete | 4,815,801,412 |405,434,997 | 37,612,439 | 126,938,500 |
#### GloVe Model Statistics
subset | tokens | unique words | unique words (freq 5+ )
-----------|--------------:|-------------:|-----------------------:
no_problem | 384,951,086 | 490,134 | 170,615
warning | 2,817,734,902 | 1,200,887 | 422,524
error | 2,180,119,361 | 1,889,392 | 518,609
complete | 5,382,805,349 | 2,573,974 | 746,673
### Citing this Resource
Please cite the main dataset when using the word embeddings, as they are generated and distributed jointly. [Instructions here](/resources/arxmliv-dataset-082017/#citing-this-resource)
### Download
[Download link](https://gl.kwarc.info/SIGMathLing/embeddings-arXMLiv-08-2017)
([SIGMathLing members](/member/) only)
### Generated via
- [llamapun 0.1](https://github.com/KWARC/llamapun/releases/tag/0.1),
- [GloVe 1.2](https://github.com/stanfordnlp/GloVe/tree/765074642a6544e47849bb85d8dc2e11e44c2922)
### Generation Parameters
* token model distributed as 3 subsets - no_problem, warning and error. complete model is derived via:
```
cat token_model_no_problem.txt \
token_model_warning.txt \
token_model_error.txt > token_model_complete.txt
```
* [llamapun v0.1](https://github.com/KWARC/llamapun/releases/tag/0.1), `corpus_token_model` example used for token model extraction
* used llamapun math-aware sentence and word tokenization
* processed logical paragraphs only (excluded non-textual modalities, e.g. tables, figures and their captions, bibliographies)
* marked up formulas replaced with `mathformula` token
* marked up inline citations replaced with `citationelement` token
* numeric literals replaced with `NUM` token
* ignored sentences with unnaturally long words (>30 characters) - almost always due to latexml conversion errors
* [GloVe repository at sha 76507](https://github.com/stanfordnlp/GloVe/tree/765074642a6544e47849bb85d8dc2e11e44c2922)
* build/vocab_count -min-count 5
* build/cooccur -memory 32.0 -window-size 15
* build/shuffle -memory 32.0
* build/glove -threads 16 -x-max 100 -iter 25 -vector-size 300 -binary 2
### Examples and baselines
#### GloVe in-built evaluation (non-expert tasks e.g. language, relationships, geography)
1. no_problem
* Total accuracy: 26.49% (3665/13833)
* Highest score: "gram3-comparative.txt", 75.83% (1010/1332)
2. warning
* Total accuracy: 31.16% (4989/16013)
* Highest score: "gram3-comparative.txt", 75.45% (1005/1332)
3. error
* Total accuracy: 29.63% (4997/16867)
* Highest score: "gram3-comparative.txt", 76.58% (1020/1332)
4. complete
* Total accuracy: 32.86% (5770/17562)
* Highest score: "gram3-comparative.txt", 78.53% (1046/1332)
5. demo baseline: text8 demo (first 100M characters of Wikipedia)
* Total accuracy: 23.91% (4262/17827)
* Highest score: "capital-common-countries.txt", 62.65% (317/506)
**Evaluation note:** These in-built evlauation runs are provided as a sanity check that the generated GloVe models pass a basic baseline against the non-expert tasks in the default GloVe suite.
One would need a scienctific discourse tailored set of test cases to evaluate the arXiv-based models competitively.
#### Measuring word analogy
In a cloned GloVe repository, start via:
```
python eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.5B.300d.txt
```
1. `abelian` is to `group` as `disjoint` is to `?`
* Top hit: `union`, cosine distance `0.644784`
2. `convex` is to `concave` as `positive` is to `?`
* Top hit: `negative`, cosine distance `0.802866`
3. `finite` is to `infinte` as `abelian` is to `?`
* Top hit: `nonabelian`, cosine distance `0.664235`
4. `quantum` is to `classical` as `bottom` is to `?`
* Top hit: `top`, cosine distance `0.719843`
5. `eq` is to `proves` as `figure` is to `?`
* Top hit: `shows`, cosine distance `0.674743`
#### Nearest word vectors
In a cloned GloVe repository, start via:
```
python eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.5B.300d.txt
```
1. **lattice**
```
Word: lattice Position in vocabulary: 311
Word Cosine distance
-----------------------------------------------------
lattices 0.811057
honeycomb 0.657262
finite 0.625146
triangular 0.608218
spacing 0.605435
```
2. **entanglement**
```
Word: entanglement Position in vocabulary: 1293
Word Cosine distance
-----------------------------------------------------
entangled 0.763964
multipartite 0.730231
fidelity 0.653443
concurrence 0.652454
environemtnal 0.646705
negativity 0.646165
quantum 0.639032
discord 0.624222
nonlocality 0.610661
tripartite 0.609896
```
3. **forgetful**
```
Word: forgetful Position in vocabulary: 10697
Word Cosine distance
-----------------------------------------------------
functor 0.723019
functors 0.653969
morphism 0.626222
```
4. **eigenvalue**
```
Word: eigenvalue Position in vocabulary: 1212
Word Cosine distance
-----------------------------------------------------
eigenvalues 0.878527
eigenvector 0.766371
eigenfunction 0.761923
eigenvectors 0.747451
eigenfunctions 0.707346
eigenspace 0.661539
corresponding 0.629746
laplacian 0.627187
operator 0.627130
eigen 0.620933
```
5. **riemannian**
```
Word: riemannian Position in vocabulary: 2026
Word Cosine distance
-----------------------------------------------------
manifold 0.766196
manifolds 0.745785
metric 0.714120
curvature 0.672975
metrics 0.670006
finsler 0.665079
ricci 0.657058
euclidean 0.650198
endowed 0.626307
riemmanian 0.621626
riemanian 0.618022
```
\ No newline at end of file
......@@ -3,6 +3,8 @@ layout: page
title: SIGMathLing - Datasets and Resources
---
1. [arXMLiv corpus, 08.2017 release](/resources/arxmliv/)
1. [arXMLiv corpus, 08.2017 release](/resources/arxmliv-dataset-082017/)
2. [arXMLiv word embeddings, 08.2017 release](/resources/arxmliv-embeddings-082017)
Additional resources are en route, see the [plan](/technical/) for details.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment