Skip to content
Snippets Groups Projects
Commit 7f9ad30d authored by Luis's avatar Luis
Browse files

add description of file structure of the datasets

parent d5d99003
Branches
No related tags found
No related merge requests found
Pipeline #3610 passed
......@@ -7,10 +7,13 @@ title: ArGoT 2021 - arXiv Glossary of Terms
- This page documents: ArGoT 2021 (latest)
### Contents
- 5,023 compressed XML files using the arXiv's naming convention.
- 881,301 articles.
- 800 ZIP archives, in arXiv's Year-Month `yymm` naming scheme.
- The XML sources total `500 MB` packaged, and `2.1 TB` unpacked.
- NN.v1 directory:
- 789,896 term-definition pairs.
- 2816 ZIP archives, in arXiv's Year-Month `yymm_num` naming scheme.
- SGD.v3 directory:
- 943,006 term-definition pairs.
- 2816 ZIP archives, in arXiv's Year-Month `yymm_num` naming scheme.
- The XML sources total `521 MB` packaged as `.tar.gz` archives.
### Download
- [Download link](https://gl.kwarc.info/SIGMathLing/dataset-argot-2021)
......@@ -20,17 +23,38 @@ title: ArGoT 2021 - arXiv Glossary of Terms
This is the first public release of the ArGoT dataset generated by the [Formal Abstracts](https://formalabstracts.github.io/) research group.
ArGoT is a dataset of term-definition pairs automatically extracted from the arXiv mathematical papers.
It is comprised of XML files with the following tags and attributes:
- article: arXiv article entry
- name: link to the article in the arXiv
- num: number of paragraphs in the article
- definition: a paragraph labeled as a definition by the ML classifier
- index: paragraph number inside the article
- dfndum: the term (definiendum) found in the statement of the definition.
Two independently extracted versions of the dataset are provided:
- NN: Neural network approach using a combination of LSTM for classification and LSTM-CRF for sequence tagging and
- SGD: Stochastic Gradient Descent for classification and ChunkParser for named entity recognition.
- **NN.v1**: Neural network approach using a combination of LSTM for classification and LSTM-CRF for sequence tagging and
- **SGD.v3**: Stochastic Gradient Descent for classification and ChunkParser for named entity recognition.
Both datasets have the same file structure:
```
SGD.v3/
├── math00
│ ├── 0001_001.xml.gz
│ ├── 0002_001.xml.gz
│ ├── 0003_001.xml.gz
.
.
.
├── math01
│ ├── 0101_001.xml.gz
│ ├── 0102_001.xml.gz
│ ├── 0103_001.xml.gz
.
.
.
```
It is comprised of XML files with the following tags and attributes:
- _article_: arXiv article entry
- _name_: link to the article in the arXiv
- _num_: number of paragraphs in the article
- _definition_: a paragraph labeled as a definition by the ML classifier
- _index_: paragraph number inside the article
- _dfndum_: the term (definiendum) found in the statement of the definition.
### Citing this Resource
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment