Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
W
website
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Container registry
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
SIGMathLing
website
Commits
7f9ad30d
Commit
7f9ad30d
authored
3 years ago
by
Luis
Browse files
Options
Downloads
Patches
Plain Diff
add description of file structure of the datasets
parent
d5d99003
Branches
Branches containing commit
No related tags found
No related merge requests found
Pipeline
#3610
passed
3 years ago
Stage: deploy
Changes
1
Pipelines
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
resources/argot-dataset-2020.md
+37
-13
37 additions, 13 deletions
resources/argot-dataset-2020.md
with
37 additions
and
13 deletions
resources/argot-dataset-2020.md
+
37
−
13
View file @
7f9ad30d
...
...
@@ -7,10 +7,13 @@ title: ArGoT 2021 - arXiv Glossary of Terms
-
This page documents: ArGoT 2021 (latest)
### Contents
-
5,023 compressed XML files using the arXiv's naming convention.
-
881,301 articles.
-
800 ZIP archives, in arXiv's Year-Month
`yymm`
naming scheme.
-
The XML sources total
`500 MB`
packaged, and
`2.1 TB`
unpacked.
-
NN.v1 directory:
-
789,896 term-definition pairs.
-
2816 ZIP archives, in arXiv's Year-Month
`yymm_num`
naming scheme.
-
SGD.v3 directory:
-
943,006 term-definition pairs.
-
2816 ZIP archives, in arXiv's Year-Month
`yymm_num`
naming scheme.
-
The XML sources total
`521 MB`
packaged as
`.tar.gz`
archives.
### Download
-
[
Download link
](
https://gl.kwarc.info/SIGMathLing/dataset-argot-2021
)
...
...
@@ -20,17 +23,38 @@ title: ArGoT 2021 - arXiv Glossary of Terms
This is the first public release of the ArGoT dataset generated by the
[
Formal Abstracts
](
https://formalabstracts.github.io/
)
research group.
ArGoT is a dataset of term-definition pairs automatically extracted from the arXiv mathematical papers.
It is comprised of XML files with the following tags and attributes:
-
article: arXiv article entry
-
name: link to the article in the arXiv
-
num: number of paragraphs in the article
-
definition: a paragraph labeled as a definition by the ML classifier
-
index: paragraph number inside the article
-
dfndum: the term (definiendum) found in the statement of the definition.
Two independently extracted versions of the dataset are provided:
-
NN: Neural network approach using a combination of LSTM for classification and LSTM-CRF for sequence tagging and
-
SGD: Stochastic Gradient Descent for classification and ChunkParser for named entity recognition.
-
**NN.v1**
: Neural network approach using a combination of LSTM for classification and LSTM-CRF for sequence tagging and
-
**SGD.v3**
: Stochastic Gradient Descent for classification and ChunkParser for named entity recognition.
Both datasets have the same file structure:
```
SGD.v3/
├── math00
│ ├── 0001_001.xml.gz
│ ├── 0002_001.xml.gz
│ ├── 0003_001.xml.gz
.
.
.
├── math01
│ ├── 0101_001.xml.gz
│ ├── 0102_001.xml.gz
│ ├── 0103_001.xml.gz
.
.
.
```
It is comprised of XML files with the following tags and attributes:
-
_article_: arXiv article entry
-
_name_: link to the article in the arXiv
-
_num_: number of paragraphs in the article
-
_definition_: a paragraph labeled as a definition by the ML classifier
-
_index_: paragraph number inside the article
-
_dfndum_: the term (definiendum) found in the statement of the definition.
### Citing this Resource
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment