Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
W
website
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Container registry
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
SIGMathLing
website
Commits
175ee721
Commit
175ee721
authored
5 years ago
by
Deyan Ginev
Browse files
Options
Downloads
Patches
Plain Diff
updated 2019 embeddings handling apostrophe tokenization better
parent
5fad79dc
Branches
fix-sidebar-gitlab-link
Branches containing commit
No related tags found
1 merge request
!10
updated 2019 embeddings handling apostrophe tokenization better
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
resources/arxmliv-embeddings-082019.md
+97
-77
97 additions, 77 deletions
resources/arxmliv-embeddings-082019.md
with
97 additions
and
77 deletions
resources/arxmliv-embeddings-082019.md
+
97
−
77
View file @
175ee721
...
...
@@ -48,7 +48,7 @@ articles, the right of distribution was only given (or assumed) to arXiv itself.
| subset | tokens | unique words | unique words (freq 5+ ) |
| ---------- | -------------: | -----------: | ----------------------: |
| complete | 15,
214,964,673 | 2,868,070
|
1,013,10
6 |
| complete | 15,
192,564,807 | 2,782,667
|
989,13
6 |
### Citing this Resource
...
...
@@ -59,7 +59,7 @@ Please cite the main dataset when using the word embeddings, as they are generat
(
[
SIGMathLing members
](
/member/
)
only)
### Generated via
-
[
llamapun 0.3.
3
](
https://github.com/KWARC/llamapun/releases/tag/0.3.
3
)
,
-
[
llamapun 0.3.
4
](
https://github.com/KWARC/llamapun/releases/tag/0.3.
4
)
,
-
[
GloVe 1.2, 2019
](
https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8
)
### Generation Parameters
...
...
@@ -71,7 +71,7 @@ Please cite the main dataset when using the word embeddings, as they are generat
token_model_error.txt > token_model_complete.txt
```
*
[
llamapun v0.3.
3
](
https://github.com/KWARC/llamapun/releases/tag/0.3.
3
)
,
`corpus_token_model`
example used for token model extraction
*
[
llamapun v0.3.
4
](
https://github.com/KWARC/llamapun/releases/tag/0.3.
4
)
,
`corpus_token_model`
example used for token model extraction
*
processed logical paragraphs, abstracts, captions and keywords, ignore all other content (e.g. tables, bibliography, others)
*
excluded paragraphs containing latexml errors (marked via a
`ltx_ERROR`
HTML class); also excluded when words over 25 characters were encountered.
*
used llamapun math-aware word tokenization, with sub-formula math lexemes (improved robustness since 2018)
...
...
@@ -80,6 +80,7 @@ Please cite the main dataset when using the word embeddings, as they are generat
*
internal references replaced with
`ref`
token (e.g.
`Figure ref`
)
*
textual punctuation is included as-is, while mathematical punctuation is annotated via the latexml-generated lexemes.
*
words are downcased, while math content is kept cased, to mitigate lexical ambiguity.
*
an update was provided on Sep 29, 2019, improving tokenization of apostrophe-including constructs such as
`'s`
.
*
[
GloVe repository at sha 07d59d
](
https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8
)
*
build/vocab_count -min-count 5
...
...
@@ -91,8 +92,8 @@ Please cite the main dataset when using the word embeddings, as they are generat
#### GloVe in-built evaluation (non-expert tasks e.g. language, relationships, geography)
1.
NEW; 2019 model
*
Total accuracy: 3
8.30
% (7017/18322)
*
Highest score: "gram3-comparative.txt", 7
8.60
% (1047/1332)
*
Total accuracy: 3
7.76
% (7017/18322)
*
Highest score: "gram3-comparative.txt", 7
7.33
% (1047/1332)
2.
2018
[
GloVe embeddings
](
/resources/arxmliv-embeddings-082018/
)
...
...
@@ -110,23 +111,23 @@ python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_fil
```
1.
`abelian`
is to
`group`
as
`disjoint`
is to
`?`
*
Top hit:
`union`
, cosine distance
`0.6
18853
`
*
Top hit:
`union`
, cosine distance
`0.6
49029
`
2.
`convex`
is to
`concave`
as
`positive`
is to
`?`
*
Top hit:
`negative`
, cosine distance
`0.8
06679
`
*
Top hit:
`negative`
, cosine distance
`0.8
12031
`
3.
`finite`
is to
`infinte`
as
`abelian`
is to
`?`
*
Top hit:
`nonabelian`
, cosine distance
`0.6
9808
9`
*
Top hit:
`nonabelian`
, cosine distance
`0.6
8941
9`
4.
`quantum`
is to
`classical`
as
`bottom`
is to
`?`
*
Top hit:
`middle`
, cosine distance
`0.7
69180
`
*
Close second:
`top`
, cosine distance
`0.7
65937
`
*
Top hit:
`middle`
, cosine distance
`0.7
70132
`
*
Close second:
`top`
, cosine distance
`0.7
58245
`
5.
`eq`
is to
`proves`
as
`figure`
is to
`?`
*
Top hit:
`show
ing
`
, cosine distance
`0.6
89938
`
*
Top hit:
`show
s
`
, cosine distance
`0.6
75003
`
6.
`italic_x`
is to
`italic_y`
as
`italic_a`
is to
`?`
*
Top hit:
`italic_b`
, cosine distance
`0.91
546
7`
*
Top hit:
`italic_b`
, cosine distance
`0.91
282
7`
#### Nearest word vectors
...
...
@@ -139,81 +140,92 @@ python2 eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file gl
```
Word: lattice Position in vocabulary: 515
Word Cosine distance
Word Cosine distance
---------------------------------------------------------
lattices 0.86
5888
lattices 0.86
0839
honeycomb 0.677004
finite 0.662110
finite
0.65
0216
honeycomb
0.65
7155
triangular
0.63
2165
crystal
0.63
5061
crystal 0.627800
triangular 0.632298
s
ublattice
0.619
792
s
pacing
0.619
840
cubic 0.609822
square 0.613936
sublattice 0.612161
hexagonal 0.606321
hypercubic 0.606101
latter 0.602747
symmetry 0.601192
cubic 0.601035
```
2.
**entanglement**
```
Word: entanglement Position in vocabulary: 160
3
Word: entanglement Position in vocabulary: 160
5
Word Cosine distance
Word Cosine distance
---------------------------------------------------------
entangled 0.803443
entangled 0.795067
multipartite 0.745711
multipartit
e 0.7
44602
concurrenc
e 0.7
08164
negativity 0.69
8730
negativity 0.69
5089
concurrence 0.693703
quantum 0.666254
tripartite 0.6
69840
tripartite 0.6
53771
discord 0.660572
fidelity 0.651990
fidelity 0.657391
teleportation 0.639430
quantum 0.655452
nonlocality 0.626717
teleportation
0.62
8923
discord
0.62
2995
qubit
s
0.62
7504
qubit 0.62
2836
bipartite 0.6
22791
bipartite 0.6
14907
entangling
0.6
21
139
qubits
0.613
02
9
nonlocalit
y 0.61
9905
entrop
y 0.61
2276
qubit 0.615623
entropy 0.601869
```
3.
**forgetful**
```
Word: forgetful Position in vocabulary: 122
5
9
Word: forgetful Position in vocabulary: 122
2
9
Word Cosine distance
Word Cosine distance
---------------------------------------------------------
functor 0.7
49501
functor 0.7
31004
functors 0.6
86806
functors 0.6
67090
morphism 0.6
32394
morphism
s
0.6
05955
morphism
s
0.6
10589
morphism 0.6
04947
```
...
...
@@ -221,80 +233,88 @@ python2 eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file gl
```
Word: eigenvalue Position in vocabulary: 1527
Word Cosine distance
Word Cosine distance
---------------------------------------------------------
eigenvalues 0.
903885
eigenvalues 0.
894346
eigenvector 0.7
81512
eigenvector 0.7
75584
eigen
vectors
0.77
4260
eigen
function
0.77
2961
eigen
function 0.751316
eigen
vectors 0.762914
eigenfunctions 0.70
7166
eigenfunctions 0.70
0270
eigenspace 0.68
3321
eigenspace 0.68
6408
eigen 0.6
57366
eigen 0.6
64881
laplacian 0.64
9859
laplacian 0.64
6244
matrix 0.645466
eigenstate 0.629338
eigenmode 0.62
8024
eigenmode 0.62
6229
operator
0.620
24
5
largest
0.620
35
5
eigenmodes
0.61
0912
matrix
0.61
8085
largest 0.607076
eigenmodes 0.605928
operator 0.602806
smallest 0.600443
eigenstates 0.603603
```
5.
**riemannian**
```
Word: riemannian Position in vocabulary: 228
5
Word: riemannian Position in vocabulary: 2
4
28
Word Cosine distance
Word Cosine distance
---------------------------------------------------------
manifold 0.771125
manifolds 0.780788
manifold 0.771704
metric 0.725227
finsler 0.686441
manifolds 0.770408
ricci 0.678393
metric 0.7098
20
curvature 0.677
20
7
finsler 0.699053
metrics 0.660825
curvature 0.672640
euclidean 0.659125
ricci 0.667813
noncompact 0.647109
riemmanian 0.661929
conformally 0.643647
euclide
an 0.64
5
167
riemmani
an 0.64167
1
metrics 0.641648
submanifold 0.632707
submanifold
0.638
131
kahler
0.6
2
38
57
kahler 0.635828
geodesic 0.621973
riemanian 0.626252
submanifolds 0.617170
noncompact
0.6
23
36
3
endowed
0.6
160
36
geodesic 0.620316
riemanian 0.608523
submanifolds 0.613058
hyperbolic 0.603709
endowed
0.60
8804
submersion
0.60
0120
foliation 0.601818
```
\ No newline at end of file
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment