Commit c50cbdc9 authored by Deyan Ginev's avatar Deyan Ginev

Merge branch 'statement-dataset-release' into 'master'

First announcement of statement dataset

See merge request !8
parents 2688fb64 79670257
Pipeline #1729 failed with stage
in 2 minutes and 27 seconds
---
layout: post
title: Statement Classification Data Set
---
A new data set with annotations for 10.5 million scientific statements has been uploaded to SIGMathLing.
The content of this data set is licensed to [SIGMathLing members](/member/) for research
and tool development purposes subject to the [SIGMathLing Non-Disclosure-Agreement](/nda/).
The annotations were extracted automatically from the machine-readable version of arXiv.org also available as a [SIGMathLing resource](https://sigmathling.kwarc.info/resources/arxmliv-dataset-082018/).
Details can be found on the corresponding [resource page](/resources/arxmliv-statements-082018/).
---
layout: page
title: Scientific statement classification dataset from arXMLiv 08.2018
---
Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWARC](https://kwarc.info/) research group
### Author
- Deyan Ginev
### Current release
- 08.2018
### Accessibility and License
The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
and tool development purposes.
Access is restricted to [SIGMathLing members](/member/) under the
[SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org)
articles, the right of distribution was only given (or assumed) to arXiv itself.
### Contents
- 10.5 million plain-text paragraphs associated with a statement class
- 50 directories, each containing entries from the same class of scientific statements
- each filename is a SHA-256 hash of its contents, as a guarantee for uniqueness and random order
- two separate tar bundles over the same data, one with and one without lexemes for mathematical expressions
- data is extracted from the separately distributed [arXMLiv 08.2018](https://sigmathling.kwarc.info/resources/arxmliv-dataset-082018/) dataset.
- see the bottom of this page for a full statement frequency breakdown
| file name | MD5 | size | size unpacked |
| :------------------------------------------------ | :--------------------------------- | ----: | ------------: |
| `statement_paragraphs_arxmliv_08_2018.tar` | `ff48316737b41c13fbaa786eef8d1b6e` | 22 GB | 45 GB |
| `nomath_statement_paragraphs_arxmliv_08_2018.tar` | `e214eacb3b73fa3e7416f00673f9c298` | 12 GB | 40 GB |
### Description
For the full details, please read [our paper](https://arxiv.org/abs/1908.10993) on announcing the statement classification task.
This is a first public release of an annotated statement dataset derived from [arXMLiv](https://sigmathling.kwarc.info/resources/arxmliv-dataset-082018/), a machine-readable representation of the arXiv corpus of scientific articles.
This resource contains 10,555,689 paragraphs with associated statement labels, realized as one paragraph per file, one sentence per line. Each file is placed in a subdirectory named after its annotated class. The statements were extracted from author-annotated environments, where we only selected the *first* paragraph,immediately following the heading. Headings include both structural sections (e.g. *Introduction*), as well as scholarly statement annotations, (e.g. *Definition*, *Proof*, *Remark*).
We also include a control dataset of the same statements with all mathematical symbolism omitted (`nomath`), numbering 10,137,007 paragraphs. This math-free resource is smaller as omitting the formulas results in fewer unique paragraphs. We consider it a useful benchmark when trying to evaluate the specific impact of mathematical expressions on classification performance.
We welcome community feedback on all of: data quality, representation issues, as well as organization and archival best practices. We plan on jointly release new versions of this data together with new releases of the arXMLiv corpus.
### Examples
Definition with math lexemes (main data, single sentence, linebreaks for readability):
```
a directed quantum turing automaton is a quadruple
italic_T RELOP_equals OPEN_( caligraphic_H PUNCT_, caligraphic_K PUNCT_, caligraphic_L PUNCT_, italic_tau CLOSE_) PUNCT_,
where
caligraphic_H caligraphic_K and caligraphic_L
are finite dimensional hilbert spaces over the complex field blackboard_C and
italic_tau METARELOP_colon caligraphic_H MULOP_tensor_product caligraphic_K ARROW_rightarrow
caligraphic_H MULOP_tensor_product caligraphic_L
is an isometry in fdhilb
```
source: `definition/1e4a1aea317bbf363c5314fb25eaf72c8a350a1007bb8aafc542e188405b93d5.txt`
Same definition without math lexemes (nomath data, single sentence, linebreaks for readability):
```
a directed quantum turing automaton is a quadruple
where and are finite dimensional hilbert spaces over the complex field and
is an isometry in fdhilb
```
nomath source: `definition/35b170bae4259a5c430846116142d4e4a45097e52daf818b78ea378d94d14a21.txt`
### Citing this Resource
#### pure bibTeX
```
@MISC{SML:statement-classification:08.2018,
author = {Deyan Ginev},
title = {Statement classification dataset, 10.5 million plain-text paragraphs from {arXMLiv:08.2018}},
howpublished = {\url{https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/}},
note = {SIGMathLing -- Special Interest Group on Math Linguistics},
year = 2019}
```
#### bibTeX for the bibLaTeX package (preferred)
```
@online{SML:statement-classification:08.2018,
author = {Deyan Ginev},
title = {Statement classification dataset, 10.5 million plain-text paragraphs from {arXMLiv:08.2018}},
url = {https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/},
note = {SIGMathLing -- Special Interest Group on Math Linguistics},
year = 2019}
```
#### EndNote
```
%0 Generic
%T Statement classification dataset, 10.5 million plain-text paragraphs from {arXMLiv:08.2018}
%A Ginev, Deyan
%D 2019
%I hosted at https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/
%F SML:statement-classification:08.2018b
%O SIGMathLing – Special Interest Group on Math Linguistics
```
### Download
[Download link](https://gl.kwarc.info/SIGMathLing/statements-arXMLiv-08-2018)
([SIGMathLing members](/member/) only)
### Generated via
- [llamapun 0.3.2](https://github.com/KWARC/llamapun/releases/tag/0.3.2)
### Contents Breakdown
| **statement class** | **frequency** | **frequency (nomath)** |
| :------------------ | ------------: | ---------------------: |
| abstract | 1,030,774 | 1,030,691 |
| acknowledgement | 162,230 | 162,220 |
| affirmation | 36 | 22 |
| answer | 40 | 39 |
| assumption | 29,577 | 26,890 |
| bound | 47 | 37 |
| case | 3,256 | 2,208 |
| claim | 89,737 | 75,778 |
| comment | 325 | 322 |
| conclusion | 284,585 | 284,536 |
| condition | 3,950 | 3,508 |
| conjecture | 44,893 | 41,780 |
| constraint | 753 | 731 |
| convention | 2,176 | 2,160 |
| corollary | 436,768 | 402,728 |
| criterion | 236 | 219 |
| definition | 686,717 | 667,797 |
| demonstration | 23,043 | 22,842 |
| discussion | 116,650 | 116,643 |
| example | 295,152 | 289,005 |
| exercise | 404 | 404 |
| expansion | 5 | 2 |
| expectation | 13 | 13 |
| experiment | 154 | 153 |
| explanation | 16 | 16 |
| fact | 17,737 | 16,473 |
| hint | 9 | 9 |
| introduction | 688,530 | 688,187 |
| issue | 41 | 28 |
| keywords | 1,565 | 1,565 |
| lemma | 1,320,646 | 1,162,559 |
| method | 50,968 | 50,947 |
| notation | 16,611 | 16,077 |
| note | 4,462 | 4,415 |
| notice | 4 | 4 |
| observation | 18,776 | 18,013 |
| overview | 11,279 | 11,277 |
| principle | 236 | 232 |
| problem | 30,369 | 29,221 |
| proof | 2,125,750 | 2,096,644 |
| proposition | 829,068 | 763,268 |
| question | 27,240 | 26,673 |
| relatedwork | 26,300 | 26,299 |
| remark | 639,038 | 635,180 |
| result | 239,905 | 239,639 |
| rule | 775 | 712 |
| solution | 163 | 144 |
| step | 6,910 | 6,536 |
| summary | 117 | 117 |
| theorem | 1,287,653 | 1,212,044 |
......@@ -3,11 +3,12 @@ layout: page
title: SIGMathLing - Datasets and Resources
---
## Resources hosted on the SIGMathLing Repository
1. [arXMLiv statements dataset, 08.2018 release](/resources/arxmliv-statements-082018)
1. [arXMLiv word embeddings, 08.2018 release](/resources/arxmliv-embeddings-082018)
1. [arXMLiv corpus, 08.2018 release](/resources/arxmliv-dataset-082018/)
1. [quantity expressions](/resources/quantity-expressions)
1. [arXMLiv word embeddings, 08.2017 release](/resources/arxmliv-embeddings-082017)
1. [arXMLiv corpus, 08.2017 release](/resources/arxmliv-dataset-082017/)
1. [arXMLiv corpus, 08.2018 release](/resources/arxmliv-dataset-082018/)
1. [quantity expressions](/resources/quantity-expressions)
1. [arXMLiv word embeddings, 08.2017 release](/resources/arxmliv-embeddings-082017)
1. [arXMLiv corpus, 08.2017 release](/resources/arxmliv-dataset-082017/)
## Resources hosted externally
1. [ACL-math-annotation](http://www-al.nii.ac.jp/acl-math-annotation/)
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment