Commit 7f9ad30d authored by Luis's avatar Luis
Browse files

add description of file structure of the datasets

parent d5d99003
Pipeline #3610 passed with stage
in 1 minute and 12 seconds
......@@ -7,10 +7,13 @@ title: ArGoT 2021 - arXiv Glossary of Terms
- This page documents: ArGoT 2021 (latest)
### Contents
- 5,023 compressed XML files using the arXiv's naming convention.
- 881,301 articles.
- 800 ZIP archives, in arXiv's Year-Month `yymm` naming scheme.
- The XML sources total `500 MB` packaged, and `2.1 TB` unpacked.
- NN.v1 directory:
- 789,896 term-definition pairs.
- 2816 ZIP archives, in arXiv's Year-Month `yymm_num` naming scheme.
- SGD.v3 directory:
- 943,006 term-definition pairs.
- 2816 ZIP archives, in arXiv's Year-Month `yymm_num` naming scheme.
- The XML sources total `521 MB` packaged as `.tar.gz` archives.
### Download
- [Download link](https://gl.kwarc.info/SIGMathLing/dataset-argot-2021)
......@@ -20,17 +23,38 @@ title: ArGoT 2021 - arXiv Glossary of Terms
This is the first public release of the ArGoT dataset generated by the [Formal Abstracts](https://formalabstracts.github.io/) research group.
ArGoT is a dataset of term-definition pairs automatically extracted from the arXiv mathematical papers.
It is comprised of XML files with the following tags and attributes:
- article: arXiv article entry
- name: link to the article in the arXiv
- num: number of paragraphs in the article
- definition: a paragraph labeled as a definition by the ML classifier
- index: paragraph number inside the article
- dfndum: the term (definiendum) found in the statement of the definition.
Two independently extracted versions of the dataset are provided:
- NN: Neural network approach using a combination of LSTM for classification and LSTM-CRF for sequence tagging and
- SGD: Stochastic Gradient Descent for classification and ChunkParser for named entity recognition.
- **NN.v1**: Neural network approach using a combination of LSTM for classification and LSTM-CRF for sequence tagging and
- **SGD.v3**: Stochastic Gradient Descent for classification and ChunkParser for named entity recognition.
Both datasets have the same file structure:
```
SGD.v3/
├── math00
│   ├── 0001_001.xml.gz
│   ├── 0002_001.xml.gz
│   ├── 0003_001.xml.gz
.
.
.
├── math01
│   ├── 0101_001.xml.gz
│   ├── 0102_001.xml.gz
│   ├── 0103_001.xml.gz
.
.
.
```
It is comprised of XML files with the following tags and attributes:
- _article_: arXiv article entry
- _name_: link to the article in the arXiv
- _num_: number of paragraphs in the article
- _definition_: a paragraph labeled as a definition by the ML classifier
- _index_: paragraph number inside the article
- _dfndum_: the term (definiendum) found in the statement of the definition.
### Citing this Resource
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment