For the full details, please read [our paper](TODO) on announcing the statement classification task.
This is a first public release of an annotated statement dataset derived from [arXMLiv](https://sigmathling.kwarc.info/resources/arxmliv-dataset-082018/), a machine-readable representation of the arXiv corpus of scientific articles.
This resource contains 10,555,689 paragraphs with associated statement labels, realized as one paragraph per file, one sentence per line. Each file is placed in a subdirectory named after its annotated class. The statements were extracted from author-annotated environments, where we only selected the *first* paragraph,immediately following the heading. Headings include both structural sections (e.g. *Introduction*), as well as scholarly statement annotations, (e.g. *Definition*, *Proof*, *Remark*).
We also include a control dataset of the same statements with all mathematical symbolism omitted (`nomath`), numbering 10,137,007 paragraphs. This math-free resource is smaller as omitting the formulas results in fewer unique paragraphs. We consider it a useful benchmark when trying to evaluate the specific impact of mathematical expressions on classification performance.
We welcome community feedback on all of: data quality, representation issues, as well as organization and archival best practices. We plan on jointly release new versions of this data together with new releases of the arXMLiv corpus.
### Examples
Definition with math lexemes (main data, single sentence, linebreaks for readability):
```
a directed quantum turing automaton is a quadruple