Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
W
website
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Container registry
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
SIGMathLing
website
Commits
d5d99003
Commit
d5d99003
authored
3 years ago
by
Luis
Browse files
Options
Downloads
Patches
Plain Diff
first draft of argot resource page
parent
b90a6790
Branches
Branches containing commit
No related tags found
No related merge requests found
Pipeline
#3585
passed
3 years ago
Stage: deploy
Changes
1
Pipelines
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
resources/argot-dataset-2020.md
+95
-0
95 additions, 0 deletions
resources/argot-dataset-2020.md
with
95 additions
and
0 deletions
resources/argot-dataset-2020.md
0 → 100644
+
95
−
0
View file @
d5d99003
---
layout
:
page
title
:
ArGoT 2021 - arXiv Glossary of Terms
---
### Release
-
This page documents: ArGoT 2021 (latest)
### Contents
-
5,023 compressed XML files using the arXiv's naming convention.
-
881,301 articles.
-
800 ZIP archives, in arXiv's Year-Month
`yymm`
naming scheme.
-
The XML sources total
`500 MB`
packaged, and
`2.1 TB`
unpacked.
### Download
-
[
Download link
](
https://gl.kwarc.info/SIGMathLing/dataset-argot-2021
)
-
[
SIGMathLing members
](
/member/
)
only. Joining is free and mostly a legal checkmark on our end - all researchers welcome!
### Description
This is the first public release of the ArGoT dataset generated by the
[
Formal Abstracts
](
https://formalabstracts.github.io/
)
research group.
ArGoT is a dataset of term-definition pairs automatically extracted from the arXiv mathematical papers.
It is comprised of XML files with the following tags and attributes:
-
article: arXiv article entry
-
name: link to the article in the arXiv
-
num: number of paragraphs in the article
-
definition: a paragraph labeled as a definition by the ML classifier
-
index: paragraph number inside the article
-
dfndum: the term (definiendum) found in the statement of the definition.
Two independently extracted versions of the dataset are provided:
-
NN: Neural network approach using a combination of LSTM for classification and LSTM-CRF for sequence tagging and
-
SGD: Stochastic Gradient Descent for classification and ChunkParser for named entity recognition.
### Citing this Resource
#### pure bibTeX
```
@MISC{SML:argot:2021,
author = {Luis Berlioz},
title = {ArGoT:2021 dataset, arXiv Glossary of Terms},
howpublished = {hosted at \url{https://sigmathling.kwarc.info/resources/argot-dataset-2021/}},
note = {SIGMathLing -- Special Interest Group on Math Linguistics},
year = {2021}
```
#### bibTeX for the bibLaTeX package (preferred)
```
@online{SML:argot:2021,
author = {Luis Berlioz},
title = {argot:2021 dataset, an automatically extracted glossary of mathematical terms from the arXiv},
url = {https://sigmathling.kwarc.info/resources/argot-dataset-2021/},
note = {SIGMathLing -- Special Interest Group on Math Linguistics},
year = {2021}
```
#### EndNote
```
%0 Generic
%T argot:2021 dataset, an automatically extracted glossary of mathematical terms from the arXiv
%A Berlioz, Luis
%D 2021
%I hosted at https://sigmathling.kwarc.info/resources/argot-dataset-2021/
%F SML:argot:2021b
%O SIGMathLing – Special Interest Group on Math Linguistics
```
### Accessibility and License
The content of this Dataset is licensed to
[
SIGMathLing members
](
/member/
)
for research
and tool development purposes.
Access is restricted to
[
SIGMathLing members
](
/member/
)
under the
[
SIGMathLing Non-Disclosure-Agreement
](
/nda/
)
as for most
[
arXiv
](
http://arxiv.org
)
articles, the right of distribution was only given (or assumed) to arXiv itself.
### Generated via
-
[
LaTeXML 0.8.5
](
https://github.com/brucemiller/LaTeXML/releases/tag/v0.8.5
)
,
-
[
latexml-plugin-argot 1.1
](
docker-singularity
classifier)
### About
Part of the
[
Formal Abstracts
](
https://formalabstracts.github.io/
)
research group. Author: Luis Berlioz
### Appendix
**MathML formula example:**
```
xml
<article
name=
"1407_005/1407.2218/1407.2218.xml"
num=
"89"
>
<definition
index=
"51"
>
<stmnt>
Assume _inline_math_. We define the following space-time
norm if _inline_math_ is a time interval _display_math_
</stmnt>
<dfndum>
space-time norm
</dfndum>
</definition>
</article>
```
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment