add description of file structure of the datasets

7f9ad30d · Luis · d5d99003 · 7f9ad30d
Commit 7f9ad30d authored 3 years ago by Luis
--- a/resources/argot-dataset-2020.md
+++ b/resources/argot-dataset-2020.md
@@ -7,10 +7,13 @@ title: ArGoT 2021 - arXiv Glossary of Terms
 - This page documents: ArGoT 2021 (latest)

 ### Contents
-  - 5,023 compressed XML files using the arXiv's naming convention.
-  - 881,301  articles.
-  - 800 ZIP archives, in arXiv's Year-Month `yymm` naming scheme.
-  - The XML sources total `500 MB` packaged, and `2.1 TB` unpacked.
+  - NN.v1 directory:
+      - 789,896  term-definition pairs.
+      - 2816 ZIP archives, in arXiv's Year-Month `yymm_num` naming scheme.
+  - SGD.v3 directory:
+      - 943,006  term-definition pairs.
+      - 2816 ZIP archives, in arXiv's Year-Month `yymm_num` naming scheme.
+  - The XML sources total `521 MB` packaged as `.tar.gz` archives.

 ### Download
  - [Download link](https://gl.kwarc.info/SIGMathLing/dataset-argot-2021)
@@ -20,17 +23,38 @@ title: ArGoT 2021 - arXiv Glossary of Terms

 This is the first public release of the ArGoT dataset generated by the [Formal Abstracts](https://formalabstracts.github.io/) research group.
 ArGoT is a dataset of term-definition pairs automatically extracted from the arXiv mathematical papers.
-It is comprised of  XML files  with the following tags and attributes:
-   - article: arXiv article entry
-       - name: link to the article in the arXiv
-       - num: number of paragraphs in the article
-   - definition: a paragraph labeled as a definition by the ML classifier
-       - index: paragraph number inside the article
-   - dfndum: the term (definiendum) found in the statement of the definition.

 Two independently extracted  versions of the dataset are provided:
-  - NN: Neural network approach using a combination of LSTM for classification and LSTM-CRF for sequence tagging and
-  - SGD: Stochastic Gradient Descent for classification and ChunkParser for named entity recognition.
+  - **NN.v1**: Neural network approach using a combination of LSTM for classification and LSTM-CRF for sequence tagging and
+  - **SGD.v3**: Stochastic Gradient Descent for classification and ChunkParser for named entity recognition.
+
+Both datasets have the same file structure:
+```
+SGD.v3/
+├── math00
+│   ├── 0001_001.xml.gz
+│   ├── 0002_001.xml.gz
+│   ├── 0003_001.xml.gz
+      .
+      .
+      .
+├── math01
+│   ├── 0101_001.xml.gz
+│   ├── 0102_001.xml.gz
+│   ├── 0103_001.xml.gz
+      .
+      .
+      .
+```
+
+It is comprised of  XML files  with the following tags and attributes:
+   - _article_: arXiv article entry
+       - _name_: link to the article in the arXiv
+       - _num_: number of paragraphs in the article
+   - _definition_: a paragraph labeled as a definition by the ML classifier
+       - _index_: paragraph number inside the article
+   - _dfndum_: the term (definiendum) found in the statement of the definition.
+


 ### Citing this Resource