diff --git a/resources/argot-dataset-2020.md b/resources/argot-dataset-2020.md index 6f6bd104ea59bac40cff9fba873a303e31d5a4d3..38ba8368381094cc46b11aa699d6c17937d36aed 100644 --- a/resources/argot-dataset-2020.md +++ b/resources/argot-dataset-2020.md @@ -7,10 +7,13 @@ title: ArGoT 2021 - arXiv Glossary of Terms - This page documents: ArGoT 2021 (latest) ### Contents - - 5,023 compressed XML files using the arXiv's naming convention. - - 881,301 articles. - - 800 ZIP archives, in arXiv's Year-Month `yymm` naming scheme. - - The XML sources total `500 MB` packaged, and `2.1 TB` unpacked. + - NN.v1 directory: + - 789,896 term-definition pairs. + - 2816 ZIP archives, in arXiv's Year-Month `yymm_num` naming scheme. + - SGD.v3 directory: + - 943,006 term-definition pairs. + - 2816 ZIP archives, in arXiv's Year-Month `yymm_num` naming scheme. + - The XML sources total `521 MB` packaged as `.tar.gz` archives. ### Download - [Download link](https://gl.kwarc.info/SIGMathLing/dataset-argot-2021) @@ -20,17 +23,38 @@ title: ArGoT 2021 - arXiv Glossary of Terms This is the first public release of the ArGoT dataset generated by the [Formal Abstracts](https://formalabstracts.github.io/) research group. ArGoT is a dataset of term-definition pairs automatically extracted from the arXiv mathematical papers. -It is comprised of XML files with the following tags and attributes: - - article: arXiv article entry - - name: link to the article in the arXiv - - num: number of paragraphs in the article - - definition: a paragraph labeled as a definition by the ML classifier - - index: paragraph number inside the article - - dfndum: the term (definiendum) found in the statement of the definition. Two independently extracted versions of the dataset are provided: - - NN: Neural network approach using a combination of LSTM for classification and LSTM-CRF for sequence tagging and - - SGD: Stochastic Gradient Descent for classification and ChunkParser for named entity recognition. + - **NN.v1**: Neural network approach using a combination of LSTM for classification and LSTM-CRF for sequence tagging and + - **SGD.v3**: Stochastic Gradient Descent for classification and ChunkParser for named entity recognition. + +Both datasets have the same file structure: +``` +SGD.v3/ +├── math00 +│ ├── 0001_001.xml.gz +│ ├── 0002_001.xml.gz +│ ├── 0003_001.xml.gz + . + . + . +├── math01 +│ ├── 0101_001.xml.gz +│ ├── 0102_001.xml.gz +│ ├── 0103_001.xml.gz + . + . + . +``` + +It is comprised of XML files with the following tags and attributes: + - _article_: arXiv article entry + - _name_: link to the article in the arXiv + - _num_: number of paragraphs in the article + - _definition_: a paragraph labeled as a definition by the ML classifier + - _index_: paragraph number inside the article + - _dfndum_: the term (definiendum) found in the statement of the definition. + ### Citing this Resource