Skip to content
Snippets Groups Projects
Commit c09ed9ec authored by Deyan Ginev's avatar Deyan Ginev
Browse files

Update embeddings description with "nomath" controls

parent fe466e65
No related branches found
No related tags found
No related merge requests found
......@@ -22,35 +22,39 @@ articles, the right of distribution was only given (or assumed) to arXiv itself.
- An 11.5 billion token model for the arXMLiv 08.2018 dataset, including subformula lexemes
- `token_model.zip`
- 300 dimensional GloVe word embeddings for the arXMLiv 08.2018 dataset
- `glove.arxmliv.5B.300d.zip` and `vocab.arxmliv.zip`
- `glove.arxmliv.11B.300d.zip` and `vocab.arxmliv.zip`
- 300d GloVe word embeddings for individual subsets
- `glove.subsets.zip`
- Embeddings and vocabulary with math lexemes omitted
- `glove.arxmliv.nomath.11B.300d.zip` and `vocab.arxmliv.nomath.zip`
- added on July 20, 2019
- used as a control when evaluating the contribution of formula lexemes
- the main arXMLiv dataset is available separately [here](/resources/arxmliv-dataset-082018/)
#### Token Model Statistics
subset | documents | paragraphs | sentences |
-----------|----------:|-----------:|------------:|
no_problem | 137,864 | 4,646,203 | 21,533,963 |
warning | 705,095 | 45,797,794 | 183,246,777 |
error | 389,225 | 26,759,524 | 99,641,978 |
complete | 1,232,184 | 77,203,521 | 304,422,718 |
| subset | documents | paragraphs | sentences |
| ---------- | --------: | ---------: | ----------: |
| no_problem | 137,864 | 4,646,203 | 21,533,963 |
| warning | 705,095 | 45,797,794 | 183,246,777 |
| error | 389,225 | 26,759,524 | 99,641,978 |
| complete | 1,232,184 | 77,203,521 | 304,422,718 |
subset | words | formulas | inline cite | numeric literals |
-----------|--------------:|-----------: |------------:|-----------------:|
no_problem | 430,217,995 | 20,910,732 | 3,709,520 | 11,177,753 |
warning | 3,175,663,430 | 281,832,412 | 25,337,574 | 83,606,897 |
error | 1,731,971,035 | 153,186,264 | 13,145,561 | 43,399,720 |
complete | 5,337,852,460 | 455,929,408 | 42,192,655 | 138,184,370 |
| subset | words | formulas | inline cite | numeric literals |
| ---------- | ------------: | ----------: | ----------: | ---------------: |
| no_problem | 430,217,995 | 20,910,732 | 3,709,520 | 11,177,753 |
| warning | 3,175,663,430 | 281,832,412 | 25,337,574 | 83,606,897 |
| error | 1,731,971,035 | 153,186,264 | 13,145,561 | 43,399,720 |
| complete | 5,337,852,460 | 455,929,408 | 42,192,655 | 138,184,370 |
#### GloVe Model Statistics
subset | tokens | unique words | unique words (freq 5+ )
-----------|--------------: |-------------:|-----------------------:
no_problem | 622,968,267 | 715,433 | 219,304
warning | 7,203,536,205 | 3,478,235 | 666,317
error | 3,691,805,321 | 2,444,532 | 574,467
complete | 11,518,309,793 | 5,285,379 | 1,000,295
| subset | tokens | unique words | unique words (freq 5+ ) |
| ---------- | -------------: | -----------: | ----------------------: |
| no_problem | 622,968,267 | 715,433 | 219,304 |
| warning | 7,203,536,205 | 3,478,235 | 666,317 |
| error | 3,691,805,321 | 2,444,532 | 574,467 |
| complete | 11,518,309,793 | 5,285,379 | 1,000,295 |
### Citing this Resource
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment