Commit d5d99003 authored by Luis's avatar Luis
Browse files

first draft of argot resource page

parent b90a6790
Pipeline #3585 passed with stage
in 1 minute and 13 seconds
---
layout: page
title: ArGoT 2021 - arXiv Glossary of Terms
---
### Release
- This page documents: ArGoT 2021 (latest)
### Contents
- 5,023 compressed XML files using the arXiv's naming convention.
- 881,301 articles.
- 800 ZIP archives, in arXiv's Year-Month `yymm` naming scheme.
- The XML sources total `500 MB` packaged, and `2.1 TB` unpacked.
### Download
- [Download link](https://gl.kwarc.info/SIGMathLing/dataset-argot-2021)
- [SIGMathLing members](/member/) only. Joining is free and mostly a legal checkmark on our end - all researchers welcome!
### Description
This is the first public release of the ArGoT dataset generated by the [Formal Abstracts](https://formalabstracts.github.io/) research group.
ArGoT is a dataset of term-definition pairs automatically extracted from the arXiv mathematical papers.
It is comprised of XML files with the following tags and attributes:
- article: arXiv article entry
- name: link to the article in the arXiv
- num: number of paragraphs in the article
- definition: a paragraph labeled as a definition by the ML classifier
- index: paragraph number inside the article
- dfndum: the term (definiendum) found in the statement of the definition.
Two independently extracted versions of the dataset are provided:
- NN: Neural network approach using a combination of LSTM for classification and LSTM-CRF for sequence tagging and
- SGD: Stochastic Gradient Descent for classification and ChunkParser for named entity recognition.
### Citing this Resource
#### pure bibTeX
```
@MISC{SML:argot:2021,
author = {Luis Berlioz},
title = {ArGoT:2021 dataset, arXiv Glossary of Terms},
howpublished = {hosted at \url{https://sigmathling.kwarc.info/resources/argot-dataset-2021/}},
note = {SIGMathLing -- Special Interest Group on Math Linguistics},
year = {2021}
```
#### bibTeX for the bibLaTeX package (preferred)
```
@online{SML:argot:2021,
author = {Luis Berlioz},
title = {argot:2021 dataset, an automatically extracted glossary of mathematical terms from the arXiv},
url = {https://sigmathling.kwarc.info/resources/argot-dataset-2021/},
note = {SIGMathLing -- Special Interest Group on Math Linguistics},
year = {2021}
```
#### EndNote
```
%0 Generic
%T argot:2021 dataset, an automatically extracted glossary of mathematical terms from the arXiv
%A Berlioz, Luis
%D 2021
%I hosted at https://sigmathling.kwarc.info/resources/argot-dataset-2021/
%F SML:argot:2021b
%O SIGMathLing – Special Interest Group on Math Linguistics
```
### Accessibility and License
The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
and tool development purposes.
Access is restricted to [SIGMathLing members](/member/) under the
[SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org)
articles, the right of distribution was only given (or assumed) to arXiv itself.
### Generated via
- [LaTeXML 0.8.5](https://github.com/brucemiller/LaTeXML/releases/tag/v0.8.5),
- [latexml-plugin-argot 1.1](docker-singularity classifier)
### About
Part of the [Formal Abstracts](https://formalabstracts.github.io/) research group. Author: Luis Berlioz
### Appendix
**MathML formula example:**
```xml
<article name="1407_005/1407.2218/1407.2218.xml" num="89">
<definition index="51">
<stmnt> Assume _inline_math_. We define the following space-time
norm if _inline_math_ is a time interval _display_math_ </stmnt>
<dfndum>space-time norm</dfndum>
</definition>
</article>
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment