argot-dataset-2021.md 3.99 KB
Newer Older
Luis's avatar
Luis committed
1
2
3
4
5
6
7
8
9
---
layout: page
title: ArGoT 2021 - arXiv Glossary of Terms
---

### Release
 - This page documents: ArGoT 2021 (latest)

### Contents
10
11
12
13
14
15
16
  - NN.v1 directory:
      - 789,896  term-definition pairs.
      - 2816 ZIP archives, in arXiv's Year-Month `yymm_num` naming scheme.
  - SGD.v3 directory:
      - 943,006  term-definition pairs.
      - 2816 ZIP archives, in arXiv's Year-Month `yymm_num` naming scheme.
  - The XML sources total `521 MB` packaged as `.tar.gz` archives.
Luis's avatar
Luis committed
17
18

### Download
Luis's avatar
Luis committed
19
  - [Download link](https://gl.kwarc.info/SIGMathLing/dataset-arxiv-argot-2021)
Luis's avatar
Luis committed
20
21
22
23
24
25
26
  - [SIGMathLing members](/member/) only. Joining is free and mostly a legal checkmark on our end - all researchers welcome!

### Description
This is the first public release of the ArGoT dataset generated by the [Formal Abstracts](https://formalabstracts.github.io/) research group.
ArGoT is a dataset of term-definition pairs automatically extracted from the arXiv mathematical papers.

Two independently extracted  versions of the dataset are provided:
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
  - **NN.v1**: Neural network approach using a combination of LSTM for classification and LSTM-CRF for sequence tagging and
  - **SGD.v3**: Stochastic Gradient Descent for classification and ChunkParser for named entity recognition.

Both datasets have the same file structure:
```
SGD.v3/
├── math00
│   ├── 0001_001.xml.gz
│   ├── 0002_001.xml.gz
│   ├── 0003_001.xml.gz
      .
      .
      .
├── math01
│   ├── 0101_001.xml.gz
│   ├── 0102_001.xml.gz
│   ├── 0103_001.xml.gz
      .
      .
      .
```

It is comprised of  XML files  with the following tags and attributes:
   - _article_: arXiv article entry
       - _name_: link to the article in the arXiv
       - _num_: number of paragraphs in the article
   - _definition_: a paragraph labeled as a definition by the ML classifier
       - _index_: paragraph number inside the article
   - _dfndum_: the term (definiendum) found in the statement of the definition.

Luis's avatar
Luis committed
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107


### Citing this Resource

#### pure bibTeX
```
@MISC{SML:argot:2021,
  author = {Luis Berlioz},
  title = {ArGoT:2021 dataset, arXiv Glossary of Terms},
  howpublished = {hosted at \url{https://sigmathling.kwarc.info/resources/argot-dataset-2021/}},
  note = {SIGMathLing -- Special Interest Group on Math Linguistics},
  year = {2021}
```

#### bibTeX for the bibLaTeX package (preferred)
```
@online{SML:argot:2021,
  author = {Luis Berlioz},
  title = {argot:2021 dataset, an automatically extracted glossary of mathematical terms from the arXiv},
  url = {https://sigmathling.kwarc.info/resources/argot-dataset-2021/},
  note = {SIGMathLing -- Special Interest Group on Math Linguistics},
  year = {2021}
```

#### EndNote
```
%0 Generic
%T argot:2021 dataset, an automatically extracted glossary of mathematical terms from the arXiv
%A Berlioz, Luis
%D 2021
%I hosted at https://sigmathling.kwarc.info/resources/argot-dataset-2021/
%F SML:argot:2021b
%O SIGMathLing – Special Interest Group on Math Linguistics
```

### Accessibility and License
The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
and tool development purposes.

Access is restricted to  [SIGMathLing members](/member/) under the
[SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org)
articles, the right of distribution was only given (or assumed) to arXiv itself.

### Generated via
 - [LaTeXML 0.8.5](https://github.com/brucemiller/LaTeXML/releases/tag/v0.8.5),
 - [latexml-plugin-argot 1.1](docker-singularity classifier)

### About
Part of the [Formal Abstracts](https://formalabstracts.github.io/) research group. Author: Luis Berlioz

### Appendix
Luis's avatar
Luis committed
108
**Example of an entry in the database:**
Luis's avatar
Luis committed
109
110
111
112
113
114
115
116
117
```xml
    <article name="1407_005/1407.2218/1407.2218.xml" num="89">
    <definition index="51">
        <stmnt> Assume _inline_math_. We define the following space-time 
        norm if _inline_math_ is a time interval _display_math_ </stmnt>
        <dfndum>space-time norm</dfndum>
    </definition>
    </article>
```