Skip to content
Snippets Groups Projects
Commit 52ff3b9b authored by Deyan Ginev's avatar Deyan Ginev
Browse files

update embeddings description with nomath controls

parent fe466e65
No related branches found
No related tags found
1 merge request!6Update embeddings description with "nomath" controls
...@@ -25,32 +25,35 @@ articles, the right of distribution was only given (or assumed) to arXiv itself. ...@@ -25,32 +25,35 @@ articles, the right of distribution was only given (or assumed) to arXiv itself.
- `glove.arxmliv.5B.300d.zip` and `vocab.arxmliv.zip` - `glove.arxmliv.5B.300d.zip` and `vocab.arxmliv.zip`
- 300d GloVe word embeddings for individual subsets - 300d GloVe word embeddings for individual subsets
- `glove.subsets.zip` - `glove.subsets.zip`
- Embeddings and vocabulary with math lexemes omitted: `glove.arxmliv.nomath.11B.300d.zip` and `vocab.arxmliv.nomath.zip`
- added on July 20, 2019
- used as a control when evaluating the contribution of formula lexemes
- the main arXMLiv dataset is available separately [here](/resources/arxmliv-dataset-082018/) - the main arXMLiv dataset is available separately [here](/resources/arxmliv-dataset-082018/)
#### Token Model Statistics #### Token Model Statistics
subset | documents | paragraphs | sentences | | subset | documents | paragraphs | sentences |
-----------|----------:|-----------:|------------:| | ---------- | --------: | ---------: | ----------: |
no_problem | 137,864 | 4,646,203 | 21,533,963 | | no_problem | 137,864 | 4,646,203 | 21,533,963 |
warning | 705,095 | 45,797,794 | 183,246,777 | | warning | 705,095 | 45,797,794 | 183,246,777 |
error | 389,225 | 26,759,524 | 99,641,978 | | error | 389,225 | 26,759,524 | 99,641,978 |
complete | 1,232,184 | 77,203,521 | 304,422,718 | | complete | 1,232,184 | 77,203,521 | 304,422,718 |
subset | words | formulas | inline cite | numeric literals | | subset | words | formulas | inline cite | numeric literals |
-----------|--------------:|-----------: |------------:|-----------------:| | ---------- | ------------: | ----------: | ----------: | ---------------: |
no_problem | 430,217,995 | 20,910,732 | 3,709,520 | 11,177,753 | | no_problem | 430,217,995 | 20,910,732 | 3,709,520 | 11,177,753 |
warning | 3,175,663,430 | 281,832,412 | 25,337,574 | 83,606,897 | | warning | 3,175,663,430 | 281,832,412 | 25,337,574 | 83,606,897 |
error | 1,731,971,035 | 153,186,264 | 13,145,561 | 43,399,720 | | error | 1,731,971,035 | 153,186,264 | 13,145,561 | 43,399,720 |
complete | 5,337,852,460 | 455,929,408 | 42,192,655 | 138,184,370 | | complete | 5,337,852,460 | 455,929,408 | 42,192,655 | 138,184,370 |
#### GloVe Model Statistics #### GloVe Model Statistics
subset | tokens | unique words | unique words (freq 5+ ) | subset | tokens | unique words | unique words (freq 5+ ) |
-----------|--------------: |-------------:|-----------------------: | ---------- | -------------: | -----------: | ----------------------: |
no_problem | 622,968,267 | 715,433 | 219,304 | no_problem | 622,968,267 | 715,433 | 219,304 |
warning | 7,203,536,205 | 3,478,235 | 666,317 | warning | 7,203,536,205 | 3,478,235 | 666,317 |
error | 3,691,805,321 | 2,444,532 | 574,467 | error | 3,691,805,321 | 2,444,532 | 574,467 |
complete | 11,518,309,793 | 5,285,379 | 1,000,295 | complete | 11,518,309,793 | 5,285,379 | 1,000,295 |
### Citing this Resource ### Citing this Resource
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment