From 9d716bdd38e9783706a87e9e27b22ef4cd5fac1a Mon Sep 17 00:00:00 2001 From: Deyan Ginev <d.ginev@jacobs-university.de> Date: Mon, 22 Jul 2019 14:40:03 -0400 Subject: [PATCH] first announcement of statement dataset --- resources/arxmliv-statements-082018.md | 106 +++++++++++++++++++++++++ 1 file changed, 106 insertions(+) create mode 100644 resources/arxmliv-statements-082018.md diff --git a/resources/arxmliv-statements-082018.md b/resources/arxmliv-statements-082018.md new file mode 100644 index 0000000..07ace63 --- /dev/null +++ b/resources/arxmliv-statements-082018.md @@ -0,0 +1,106 @@ +--- +layout: page +title: Scientific statement classification dataset from arXMLiv 08.2018 +--- +Part of the [arXMLiv](https://kwarc.info/projects/arXMLiv/) project at the [KWARC](https://kwarc.info/) research group + +### Author + - Deyan Ginev + +### Current release + - 08.2018 + +### Accessibility and License +The content of this Dataset is licensed to [SIGMathLing members](/member/) for research +and tool development purposes. + +Access is restricted to [SIGMathLing members](/member/) under the +[SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org) +articles, the right of distribution was only given (or assumed) to arXiv itself. + +### Contents + - 10.5 million plain-text paragraphs associated with a statement class + - 50 directories, each containing entries from the same class of scientific statements + - each filename is a SHA-256 hash of its contents, as a guarantee for uniqueness and random order + - two separate tar bundles over the same data, one with and one without lexemes for mathematical expressions + - data is extracted from the separately distributed [arXMLiv 08.2018](https://sigmathling.kwarc.info/resources/arxmliv-dataset-082018/) dataset. + + | file name | MD5 | size | size unpacked | + | :------------------------------------------------ | :--------------------------------- | ----: | ------------: | + | `statement_paragraphs_arxmliv_08_2018.tar` | `ff48316737b41c13fbaa786eef8d1b6e` | 22 GB | 45 GB | + | `nomath_statement_paragraphs_arxmliv_08_2018.tar` | `e214eacb3b73fa3e7416f00673f9c298` | 12 GB | 40 GB | + +### Description + +For the full details, please read [our paper](TODO) on announcing the statement classification task. + +This is a first public release of an annotated statement dataset derived from [arXMLiv](https://sigmathling.kwarc.info/resources/arxmliv-dataset-082018/), a machine-readable representation of the arXiv corpus of scientific articles. + +This resource contains 10,555,689 paragraphs with associated statement labels, realized as one paragraph per file, one sentence per line. Each file is placed in a subdirectory named after its annotated class. The statements were extracted from author-annotated environments, where we only selected the *first* paragraph,immediately following the heading. Headings include both structural sections (e.g. *Introduction*), as well as scholarly statement annotations, (e.g. *Definition*, *Proof*, *Remark*). + +We also include a control dataset of the same statements with all mathematical symbolism omitted (`nomath`), numbering 10,137,007 paragraphs. This math-free resource is smaller as omitting the formulas results in fewer unique paragraphs. We consider it a useful benchmark when trying to evaluate the specific impact of mathematical expressions on classification performance. + +We welcome community feedback on all of: data quality, representation issues, as well as organization and archival best practices. We plan on jointly release new versions of this data together with new releases of the arXMLiv corpus. + +### Examples + +Definition with math lexemes (main data, single sentence, linebreaks for readability): +``` +a directed quantum turing automaton is a quadruple + italic_T RELOP_equals OPEN_( caligraphic_H PUNCT_, caligraphic_K PUNCT_, caligraphic_L PUNCT_, italic_tau CLOSE_) PUNCT_, +where + caligraphic_H caligraphic_K and caligraphic_L +are finite dimensional hilbert spaces over the complex field blackboard_C and + italic_tau METARELOP_colon caligraphic_H MULOP_tensor_product caligraphic_K ARROW_rightarrow + caligraphic_H MULOP_tensor_product caligraphic_L +is an isometry in fdhilb +``` +source: `definition/1e4a1aea317bbf363c5314fb25eaf72c8a350a1007bb8aafc542e188405b93d5.txt` + +Same definition without math lexemes (nomath data, single sentence, linebreaks for readability): +``` +a directed quantum turing automaton is a quadruple + where and are finite dimensional hilbert spaces over the complex field and + is an isometry in fdhilb +``` +nomath source: `definition/35b170bae4259a5c430846116142d4e4a45097e52daf818b78ea378d94d14a21.txt` + +### Citing this Resource + +#### pure bibTeX +``` +@MISC{SML:statement-classification:08.2018, + author = {Deyan Ginev}, + title = {Statement classification dataset, 10.5 million plain-text paragraphs from {arXMLiv:08.2018}}, + howpublished = {\url{https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/}}, + note = {SIGMathLing -- Special Interest Group on Math Linguistics}, + year = 2019} +``` + +#### bibTeX for the bibLaTeX package (preferred) +``` +@online{SML:statement-classification:08.2018, + author = {Deyan Ginev}, + title = {Statement classification dataset, 10.5 million plain-text paragraphs from {arXMLiv:08.2018}}, + url = {https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/}, + note = {SIGMathLing -- Special Interest Group on Math Linguistics}, + year = 2019} +``` + +#### EndNote +``` +%0 Generic +%T Statement classification dataset, 10.5 million plain-text paragraphs from {arXMLiv:08.2018} +%A Ginev, Deyan +%D 2019 +%I hosted at https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/ +%F SML:statement-classification:08.2018b +%O SIGMathLing – Special Interest Group on Math Linguistics +``` + +### Download + [Download link](https://gl.kwarc.info/SIGMathLing/statements-arXMLiv-08-2018) + ([SIGMathLing members](/member/) only) + +### Generated via + - [llamapun 0.3.2](https://github.com/KWARC/llamapun/releases/tag/0.3.2) -- GitLab