-
Deyan Ginev authoredDeyan Ginev authored
- Author
- Release
- Accessibility and License
- Contents
- Token Model Statistics
- GloVe Model Statistics
- Citing this Resource
- Download
- Generated via
- Generation Parameters
- Examples and baselines
- GloVe in-built evaluation (non-expert tasks e.g. language, relationships, geography)
- Measuring word analogy
- Nearest word vectors
layout: page
title: arXMLiv 08.2017 - Word Embeddings; Token Model
Part of the arXMLiv project at the KWARC research group
Author
- Deyan Ginev
Release
- This page documents: 08.2017
- Latest: 08.2019
Accessibility and License
The content of this Dataset is licensed to SIGMathLing members for research and tool development purposes.
Access is restricted to SIGMathLing members under the SIGMathLing Non-Disclosure-Agreement as for most arXiv articles, the right of distribution was only given (or assumed) to arXiv itself.
Contents
- A 5 billion token model for the arXMLiv 08.2017 dataset
token_model.zip
- 300 dimensional GloVe word embeddings for the arXMLiv 08.2017 dataset
-
glove.arxmliv.5B.300d.zip
andvocab.arxmliv.zip
-
- 300d GloVe word embeddings for individual subsets
glove.subsets.zip
- the main arXMLiv dataset is available separately here
Token Model Statistics
subset | documents | paragraphs | sentences |
---|---|---|---|
no_problem | 112,088 | 3,760,015 | 17,684,762 |
warning | 574,638 | 35,215,866 | 144,166,524 |
error | 401,644 | 28,555,173 | 111,798,273 |
complete | 1,088,370 | 67,531,054 | 273,649,559 |
subset | words | formulas | inline cite | numeric literals |
---|---|---|---|---|
no_problem | 355,253,671 | 17,020,161 | 2,991,053 | 9,913,009 |
warning | 2,514,340,590 | 219,167,820 | 20,163,304 | 65,294,846 |
error | 1,946,207,151 | 169,247,016 | 14,458,082 | 51,730,645 |
complete | 4,815,801,412 | 405,434,997 | 37,612,439 | 126,938,500 |
GloVe Model Statistics
subset | tokens | unique words | unique words (freq 5+ ) |
---|---|---|---|
no_problem | 384,951,086 | 490,134 | 170,615 |
warning | 2,817,734,902 | 1,200,887 | 422,524 |
error | 2,180,119,361 | 1,889,392 | 518,609 |
complete | 5,382,805,349 | 2,573,974 | 746,673 |
Citing this Resource
Please cite the main dataset when using the word embeddings, as they are generated and distributed jointly. Instructions here
Download
Download link (SIGMathLing members only)
Generated via
Generation Parameters
-
token model distributed as 3 subsets - no_problem, warning and error. complete model is derived via:
cat token_model_no_problem.txt \ token_model_warning.txt \ token_model_error.txt > token_model_complete.txt
-
llamapun v0.1,
corpus_token_model
example used for token model extraction- used llamapun math-aware sentence and word tokenization
- processed logical paragraphs only (excluded non-textual modalities, e.g. tables, figures and their captions, bibliographies)
- marked up formulas replaced with
mathformula
token - marked up inline citations replaced with
citationelement
token - numeric literals replaced with
NUM
token - ignored sentences with unnaturally long words (>30 characters) - almost always due to latexml conversion errors
-
- build/vocab_count -min-count 5
- build/cooccur -memory 32.0 -window-size 15
- build/shuffle -memory 32.0
- build/glove -threads 16 -x-max 100 -iter 25 -vector-size 300 -binary 2
Examples and baselines
GloVe in-built evaluation (non-expert tasks e.g. language, relationships, geography)
- no_problem
- Total accuracy: 26.49% (3665/13833)
- Highest score: "gram3-comparative.txt", 75.83% (1010/1332)
- warning
- Total accuracy: 31.16% (4989/16013)
- Highest score: "gram3-comparative.txt", 75.45% (1005/1332)
- error
- Total accuracy: 29.63% (4997/16867)
- Highest score: "gram3-comparative.txt", 76.58% (1020/1332)
- complete
- Total accuracy: 32.86% (5770/17562)
- Highest score: "gram3-comparative.txt", 78.53% (1046/1332)
- demo baseline: text8 demo (first 100M characters of Wikipedia)
- Total accuracy: 23.91% (4262/17827)
- Highest score: "capital-common-countries.txt", 62.65% (317/506)
Evaluation note: These in-built evlauation runs are provided as a sanity check that the generated GloVe models pass a basic baseline against the non-expert tasks in the default GloVe suite. One would need a scienctific discourse tailored set of test cases to evaluate the arXiv-based models competitively.
Measuring word analogy
In a cloned GloVe repository, start via:
python eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.5B.300d.txt
-
abelian
is togroup
asdisjoint
is to?
- Top hit:
union
, cosine distance0.644784
-
convex
is toconcave
aspositive
is to?
- Top hit:
negative
, cosine distance0.802866
-
finite
is toinfinite
asabelian
is to?
- Top hit:
nonabelian
, cosine distance0.664235
-
quantum
is toclassical
asbottom
is to?
- Top hit:
top
, cosine distance0.719843
-
eq
is toproves
asfigure
is to?
- Top hit:
shows
, cosine distance0.674743
Nearest word vectors
In a cloned GloVe repository, start via:
python eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.5B.300d.txt
-
lattice
Word: lattice Position in vocabulary: 311 Word Cosine distance ----------------------------------------------------- lattices 0.811057 honeycomb 0.657262 finite 0.625146 triangular 0.608218 spacing 0.605435
-
entanglement
Word: entanglement Position in vocabulary: 1293 Word Cosine distance ----------------------------------------------------- entangled 0.763964 multipartite 0.730231 fidelity 0.653443 concurrence 0.652454 environemtnal 0.646705 negativity 0.646165 quantum 0.639032 discord 0.624222 nonlocality 0.610661 tripartite 0.609896
-
forgetful
Word: forgetful Position in vocabulary: 10697 Word Cosine distance ----------------------------------------------------- functor 0.723019 functors 0.653969 morphism 0.626222
-
eigenvalue
Word: eigenvalue Position in vocabulary: 1212 Word Cosine distance ----------------------------------------------------- eigenvalues 0.878527 eigenvector 0.766371 eigenfunction 0.761923 eigenvectors 0.747451 eigenfunctions 0.707346 eigenspace 0.661539 corresponding 0.629746 laplacian 0.627187 operator 0.627130 eigen 0.620933
-
riemannian
Word: riemannian Position in vocabulary: 2026 Word Cosine distance ----------------------------------------------------- manifold 0.766196 manifolds 0.745785 metric 0.714120 curvature 0.672975 metrics 0.670006 finsler 0.665079 ricci 0.657058 euclidean 0.650198 endowed 0.626307 riemmanian 0.621626 riemanian 0.618022