From 9d716bdd38e9783706a87e9e27b22ef4cd5fac1a Mon Sep 17 00:00:00 2001
From: Deyan Ginev <d.ginev@jacobs-university.de>
Date: Mon, 22 Jul 2019 14:40:03 -0400
Subject: [PATCH] first announcement of statement dataset

---
 resources/arxmliv-statements-082018.md | 106 +++++++++++++++++++++++++
 1 file changed, 106 insertions(+)
 create mode 100644 resources/arxmliv-statements-082018.md

diff --git a/resources/arxmliv-statements-082018.md b/resources/arxmliv-statements-082018.md
new file mode 100644
index 0000000..07ace63
--- /dev/null
+++ b/resources/arxmliv-statements-082018.md
@@ -0,0 +1,106 @@
+---
+layout: page
+title: Scientific statement classification dataset from arXMLiv 08.2018
+---
+Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWARC](https://kwarc.info/) research group
+
+### Author
+ - Deyan Ginev
+
+### Current release
+ - 08.2018
+
+### Accessibility and License
+The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
+and tool development purposes.
+
+Access is restricted to  [SIGMathLing members](/member/) under the
+[SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org)
+articles, the right of distribution was only given (or assumed) to arXiv itself.
+
+### Contents
+  - 10.5 million plain-text paragraphs associated with a statement class
+  - 50 directories, each containing entries from the same class of scientific statements
+  - each filename is a SHA-256 hash of its contents, as a guarantee for uniqueness and random order
+  - two separate tar bundles over the same data, one with and one without lexemes for mathematical expressions
+  - data is extracted from the separately distributed [arXMLiv 08.2018](https://sigmathling.kwarc.info/resources/arxmliv-dataset-082018/) dataset.
+
+  | file name                                         | MD5                                |  size | size unpacked |
+  | :------------------------------------------------ | :--------------------------------- | ----: | ------------: |
+  | `statement_paragraphs_arxmliv_08_2018.tar`        | `ff48316737b41c13fbaa786eef8d1b6e` | 22 GB |         45 GB |
+  | `nomath_statement_paragraphs_arxmliv_08_2018.tar` | `e214eacb3b73fa3e7416f00673f9c298` | 12 GB |         40 GB |
+
+### Description
+
+For the full details, please read [our paper](TODO) on announcing the statement classification task.
+
+This is a first public release of an annotated statement dataset derived from [arXMLiv](https://sigmathling.kwarc.info/resources/arxmliv-dataset-082018/), a machine-readable representation of the arXiv corpus of scientific articles.
+
+This resource contains 10,555,689 paragraphs with associated statement labels, realized as one paragraph per file, one sentence per line. Each file is placed in a subdirectory named after its annotated class. The statements were extracted from author-annotated environments, where we only selected the *first* paragraph,immediately following the heading. Headings include both structural sections (e.g. *Introduction*), as well as scholarly statement annotations, (e.g. *Definition*, *Proof*, *Remark*).
+
+We also include a control dataset of the same statements with all mathematical symbolism omitted (`nomath`), numbering 10,137,007 paragraphs. This math-free resource is smaller as omitting the formulas results in fewer unique paragraphs. We consider it a useful benchmark when trying to evaluate the specific impact of mathematical expressions on classification performance.
+
+We welcome community feedback on all of: data quality, representation issues, as well as organization and archival best practices. We plan on jointly release new versions of this data together with new releases of the arXMLiv corpus.
+
+### Examples
+
+Definition with math lexemes (main data, single sentence, linebreaks for readability):
+```
+a directed quantum turing automaton is a quadruple
+  italic_T RELOP_equals OPEN_( caligraphic_H PUNCT_, caligraphic_K PUNCT_, caligraphic_L PUNCT_, italic_tau CLOSE_) PUNCT_,
+where
+  caligraphic_H caligraphic_K and caligraphic_L
+are finite dimensional hilbert spaces over the complex field blackboard_C and
+  italic_tau METARELOP_colon caligraphic_H MULOP_tensor_product caligraphic_K ARROW_rightarrow
+    caligraphic_H MULOP_tensor_product caligraphic_L
+is an isometry in fdhilb
+```
+source: `definition/1e4a1aea317bbf363c5314fb25eaf72c8a350a1007bb8aafc542e188405b93d5.txt`
+
+Same definition without math lexemes (nomath data, single sentence, linebreaks for readability):
+```
+a directed quantum turing automaton is a quadruple
+  where and are finite dimensional hilbert spaces over the complex field and
+  is an isometry in fdhilb
+```
+nomath source: `definition/35b170bae4259a5c430846116142d4e4a45097e52daf818b78ea378d94d14a21.txt`
+
+### Citing this Resource
+
+#### pure bibTeX
+```
+@MISC{SML:statement-classification:08.2018,
+  author = {Deyan Ginev},
+  title = {Statement classification dataset, 10.5 million plain-text paragraphs from {arXMLiv:08.2018}},
+  howpublished = {\url{https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/}},
+  note = {SIGMathLing -- Special Interest Group on Math Linguistics},
+  year = 2019}
+```
+
+#### bibTeX for the bibLaTeX package (preferred)
+```
+@online{SML:statement-classification:08.2018,
+  author = {Deyan Ginev},
+  title = {Statement classification dataset, 10.5 million plain-text paragraphs from {arXMLiv:08.2018}},
+  url = {https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/},
+  note = {SIGMathLing -- Special Interest Group on Math Linguistics},
+  year = 2019}
+```
+
+#### EndNote
+```
+%0 Generic
+%T Statement classification dataset, 10.5 million plain-text paragraphs from {arXMLiv:08.2018}
+%A Ginev, Deyan
+%D 2019
+%I hosted at https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/
+%F SML:statement-classification:08.2018b
+%O SIGMathLing – Special Interest Group on Math Linguistics
+```
+
+### Download
+  [Download link](https://gl.kwarc.info/SIGMathLing/statements-arXMLiv-08-2018)
+  ([SIGMathLing members](/member/) only)
+
+### Generated via
+  - [llamapun 0.3.2](https://github.com/KWARC/llamapun/releases/tag/0.3.2)
-- 
GitLab