Merge branch 'statement-dataset-release' into 'master'

First announcement of statement dataset See merge request !8

Merge branch 'statement-dataset-release' into 'master'
c50cbdc9 · Deyan Ginev · 2688fb64 · 79670257 · c50cbdc9 · c50cbdc9
Commit c50cbdc9 authored 5 years ago by Deyan Ginev
--- a/_posts/2019-28-29-statement-classification-dataset.md
+++ b/_posts/2019-28-29-statement-classification-dataset.md
+---
+layout: post
+title: Statement Classification Data Set
+---
+A new data set with annotations for 10.5 million scientific statements has been uploaded to SIGMathLing.
+The content of this data set is licensed to [SIGMathLing members](/member/) for research
+and tool development purposes subject to the [SIGMathLing Non-Disclosure-Agreement](/nda/).
+
+The annotations were extracted automatically from the machine-readable version of arXiv.org also available as a [SIGMathLing resource](https://sigmathling.kwarc.info/resources/arxmliv-dataset-082018/).
+
+Details can be found on the corresponding [resource page](/resources/arxmliv-statements-082018/).
--- a/resources/arxmliv-statements-082018.md
+++ b/resources/arxmliv-statements-082018.md
+---
+layout: page
+title: Scientific statement classification dataset from arXMLiv 08.2018
+---
+Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWARC](https://kwarc.info/) research group
+
+### Author
+ - Deyan Ginev
+
+### Current release
+ - 08.2018
+
+### Accessibility and License
+The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
+and tool development purposes.
+
+Access is restricted to  [SIGMathLing members](/member/) under the
+[SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org)
+articles, the right of distribution was only given (or assumed) to arXiv itself.
+
+### Contents
+  - 10.5 million plain-text paragraphs associated with a statement class
+  - 50 directories, each containing entries from the same class of scientific statements
+  - each filename is a SHA-256 hash of its contents, as a guarantee for uniqueness and random order
+  - two separate tar bundles over the same data, one with and one without lexemes for mathematical expressions
+  - data is extracted from the separately distributed [arXMLiv 08.2018](https://sigmathling.kwarc.info/resources/arxmliv-dataset-082018/) dataset.
+  - see the bottom of this page for a full statement frequency breakdown
+
+  | file name                                         | MD5                                |  size | size unpacked |
+  | :------------------------------------------------ | :--------------------------------- | ----: | ------------: |
+  | `statement_paragraphs_arxmliv_08_2018.tar`        | `ff48316737b41c13fbaa786eef8d1b6e` | 22 GB |         45 GB |
+  | `nomath_statement_paragraphs_arxmliv_08_2018.tar` | `e214eacb3b73fa3e7416f00673f9c298` | 12 GB |         40 GB |
+
+### Description
+
+For the full details, please read [our paper](https://arxiv.org/abs/1908.10993) on announcing the statement classification task.
+
+This is a first public release of an annotated statement dataset derived from [arXMLiv](https://sigmathling.kwarc.info/resources/arxmliv-dataset-082018/), a machine-readable representation of the arXiv corpus of scientific articles.
+
+This resource contains 10,555,689 paragraphs with associated statement labels, realized as one paragraph per file, one sentence per line. Each file is placed in a subdirectory named after its annotated class. The statements were extracted from author-annotated environments, where we only selected the *first* paragraph,immediately following the heading. Headings include both structural sections (e.g. *Introduction*), as well as scholarly statement annotations, (e.g. *Definition*, *Proof*, *Remark*).
+
+We also include a control dataset of the same statements with all mathematical symbolism omitted (`nomath`), numbering 10,137,007 paragraphs. This math-free resource is smaller as omitting the formulas results in fewer unique paragraphs. We consider it a useful benchmark when trying to evaluate the specific impact of mathematical expressions on classification performance.
+
+We welcome community feedback on all of: data quality, representation issues, as well as organization and archival best practices. We plan on jointly release new versions of this data together with new releases of the arXMLiv corpus.
+
+### Examples
+
+Definition with math lexemes (main data, single sentence, linebreaks for readability):
+```
+a directed quantum turing automaton is a quadruple
+  italic_T RELOP_equals OPEN_( caligraphic_H PUNCT_, caligraphic_K PUNCT_, caligraphic_L PUNCT_, italic_tau CLOSE_) PUNCT_,
+where
+  caligraphic_H caligraphic_K and caligraphic_L
+are finite dimensional hilbert spaces over the complex field blackboard_C and
+  italic_tau METARELOP_colon caligraphic_H MULOP_tensor_product caligraphic_K ARROW_rightarrow
+    caligraphic_H MULOP_tensor_product caligraphic_L
+is an isometry in fdhilb
+```
+source: `definition/1e4a1aea317bbf363c5314fb25eaf72c8a350a1007bb8aafc542e188405b93d5.txt`
+
+Same definition without math lexemes (nomath data, single sentence, linebreaks for readability):
+```
+a directed quantum turing automaton is a quadruple
+  where and are finite dimensional hilbert spaces over the complex field and
+  is an isometry in fdhilb
+```
+nomath source: `definition/35b170bae4259a5c430846116142d4e4a45097e52daf818b78ea378d94d14a21.txt`
+
+### Citing this Resource
+
+#### pure bibTeX
+```
+@MISC{SML:statement-classification:08.2018,
+  author = {Deyan Ginev},
+  title = {Statement classification dataset, 10.5 million plain-text paragraphs from {arXMLiv:08.2018}},
+  howpublished = {\url{https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/}},
+  note = {SIGMathLing -- Special Interest Group on Math Linguistics},
+  year = 2019}
+```
+
+#### bibTeX for the bibLaTeX package (preferred)
+```
+@online{SML:statement-classification:08.2018,
+  author = {Deyan Ginev},
+  title = {Statement classification dataset, 10.5 million plain-text paragraphs from {arXMLiv:08.2018}},
+  url = {https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/},
+  note = {SIGMathLing -- Special Interest Group on Math Linguistics},
+  year = 2019}
+```
+
+#### EndNote
+```
+%0 Generic
+%T Statement classification dataset, 10.5 million plain-text paragraphs from {arXMLiv:08.2018}
+%A Ginev, Deyan
+%D 2019
+%I hosted at https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/
+%F SML:statement-classification:08.2018b
+%O SIGMathLing – Special Interest Group on Math Linguistics
+```
+
+### Download
+  [Download link](https://gl.kwarc.info/SIGMathLing/statements-arXMLiv-08-2018)
+  ([SIGMathLing members](/member/) only)
+
+### Generated via
+  - [llamapun 0.3.2](https://github.com/KWARC/llamapun/releases/tag/0.3.2)
+
+### Contents Breakdown
+
+  | **statement class** | **frequency** | **frequency (nomath)** |
+  | :------------------ | ------------: | ---------------------: |
+  | abstract            |     1,030,774 |              1,030,691 |
+  | acknowledgement     |       162,230 |                162,220 |
+  | affirmation         |            36 |                     22 |
+  | answer              |            40 |                     39 |
+  | assumption          |        29,577 |                 26,890 |
+  | bound               |            47 |                     37 |
+  | case                |         3,256 |                  2,208 |
+  | claim               |        89,737 |                 75,778 |
+  | comment             |           325 |                    322 |
+  | conclusion          |       284,585 |                284,536 |
+  | condition           |         3,950 |                  3,508 |
+  | conjecture          |        44,893 |                 41,780 |
+  | constraint          |           753 |                    731 |
+  | convention          |         2,176 |                  2,160 |
+  | corollary           |       436,768 |                402,728 |
+  | criterion           |           236 |                    219 |
+  | definition          |       686,717 |                667,797 |
+  | demonstration       |        23,043 |                 22,842 |
+  | discussion          |       116,650 |                116,643 |
+  | example             |       295,152 |                289,005 |
+  | exercise            |           404 |                    404 |
+  | expansion           |             5 |                      2 |
+  | expectation         |            13 |                     13 |
+  | experiment          |           154 |                    153 |
+  | explanation         |            16 |                     16 |
+  | fact                |        17,737 |                 16,473 |
+  | hint                |             9 |                      9 |
+  | introduction        |       688,530 |                688,187 |
+  | issue               |            41 |                     28 |
+  | keywords            |         1,565 |                  1,565 |
+  | lemma               |     1,320,646 |              1,162,559 |
+  | method              |        50,968 |                 50,947 |
+  | notation            |        16,611 |                 16,077 |
+  | note                |         4,462 |                  4,415 |
+  | notice              |             4 |                      4 |
+  | observation         |        18,776 |                 18,013 |
+  | overview            |        11,279 |                 11,277 |
+  | principle           |           236 |                    232 |
+  | problem             |        30,369 |                 29,221 |
+  | proof               |     2,125,750 |              2,096,644 |
+  | proposition         |       829,068 |                763,268 |
+  | question            |        27,240 |                 26,673 |
+  | relatedwork         |        26,300 |                 26,299 |
+  | remark              |       639,038 |                635,180 |
+  | result              |       239,905 |                239,639 |
+  | rule                |           775 |                    712 |
+  | solution            |           163 |                    144 |
+  | step                |         6,910 |                  6,536 |
+  | summary             |           117 |                    117 |
+  | theorem             |     1,287,653 |              1,212,044 |
--- a/resources/index.md
+++ b/resources/index.md
@@ -3,11 +3,12 @@ layout: page
 title: SIGMathLing - Datasets and Resources
 ---
 ## Resources hosted on the SIGMathLing Repository
+ 1. [arXMLiv statements dataset, 08.2018 release](/resources/arxmliv-statements-082018)
 1. [arXMLiv word embeddings, 08.2018 release](/resources/arxmliv-embeddings-082018)
- 1. [arXMLiv corpus, 08.2018 release](/resources/arxmliv-dataset-082018/) 
- 1. [quantity expressions](/resources/quantity-expressions) 
- 1. [arXMLiv word embeddings, 08.2017 release](/resources/arxmliv-embeddings-082017) 
- 1. [arXMLiv corpus, 08.2017 release](/resources/arxmliv-dataset-082017/) 
+ 1. [arXMLiv corpus, 08.2018 release](/resources/arxmliv-dataset-082018/)
+ 1. [quantity expressions](/resources/quantity-expressions)
+ 1. [arXMLiv word embeddings, 08.2017 release](/resources/arxmliv-embeddings-082017)
+ 1. [arXMLiv corpus, 08.2017 release](/resources/arxmliv-dataset-082017/)

 ## Resources hosted externally
 1.  [ACL-math-annotation](http://www-al.nii.ac.jp/acl-math-annotation/)