Skip to content
Snippets Groups Projects
Commit d5d99003 authored by Luis's avatar Luis
Browse files

first draft of argot resource page

parent b90a6790
Branches
No related tags found
No related merge requests found
Pipeline #3585 passed
---
layout: page
title: ArGoT 2021 - arXiv Glossary of Terms
---
### Release
- This page documents: ArGoT 2021 (latest)
### Contents
- 5,023 compressed XML files using the arXiv's naming convention.
- 881,301 articles.
- 800 ZIP archives, in arXiv's Year-Month `yymm` naming scheme.
- The XML sources total `500 MB` packaged, and `2.1 TB` unpacked.
### Download
- [Download link](https://gl.kwarc.info/SIGMathLing/dataset-argot-2021)
- [SIGMathLing members](/member/) only. Joining is free and mostly a legal checkmark on our end - all researchers welcome!
### Description
This is the first public release of the ArGoT dataset generated by the [Formal Abstracts](https://formalabstracts.github.io/) research group.
ArGoT is a dataset of term-definition pairs automatically extracted from the arXiv mathematical papers.
It is comprised of XML files with the following tags and attributes:
- article: arXiv article entry
- name: link to the article in the arXiv
- num: number of paragraphs in the article
- definition: a paragraph labeled as a definition by the ML classifier
- index: paragraph number inside the article
- dfndum: the term (definiendum) found in the statement of the definition.
Two independently extracted versions of the dataset are provided:
- NN: Neural network approach using a combination of LSTM for classification and LSTM-CRF for sequence tagging and
- SGD: Stochastic Gradient Descent for classification and ChunkParser for named entity recognition.
### Citing this Resource
#### pure bibTeX
```
@MISC{SML:argot:2021,
author = {Luis Berlioz},
title = {ArGoT:2021 dataset, arXiv Glossary of Terms},
howpublished = {hosted at \url{https://sigmathling.kwarc.info/resources/argot-dataset-2021/}},
note = {SIGMathLing -- Special Interest Group on Math Linguistics},
year = {2021}
```
#### bibTeX for the bibLaTeX package (preferred)
```
@online{SML:argot:2021,
author = {Luis Berlioz},
title = {argot:2021 dataset, an automatically extracted glossary of mathematical terms from the arXiv},
url = {https://sigmathling.kwarc.info/resources/argot-dataset-2021/},
note = {SIGMathLing -- Special Interest Group on Math Linguistics},
year = {2021}
```
#### EndNote
```
%0 Generic
%T argot:2021 dataset, an automatically extracted glossary of mathematical terms from the arXiv
%A Berlioz, Luis
%D 2021
%I hosted at https://sigmathling.kwarc.info/resources/argot-dataset-2021/
%F SML:argot:2021b
%O SIGMathLing – Special Interest Group on Math Linguistics
```
### Accessibility and License
The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
and tool development purposes.
Access is restricted to [SIGMathLing members](/member/) under the
[SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org)
articles, the right of distribution was only given (or assumed) to arXiv itself.
### Generated via
- [LaTeXML 0.8.5](https://github.com/brucemiller/LaTeXML/releases/tag/v0.8.5),
- [latexml-plugin-argot 1.1](docker-singularity classifier)
### About
Part of the [Formal Abstracts](https://formalabstracts.github.io/) research group. Author: Luis Berlioz
### Appendix
**MathML formula example:**
```xml
<article name="1407_005/1407.2218/1407.2218.xml" num="89">
<definition index="51">
<stmnt> Assume _inline_math_. We define the following space-time
norm if _inline_math_ is a time interval _display_math_ </stmnt>
<dfndum>space-time norm</dfndum>
</definition>
</article>
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment