diff --git a/resources/argot-dataset-2020.md b/resources/argot-dataset-2020.md new file mode 100644 index 0000000000000000000000000000000000000000..6f6bd104ea59bac40cff9fba873a303e31d5a4d3 --- /dev/null +++ b/resources/argot-dataset-2020.md @@ -0,0 +1,95 @@ +--- +layout: page +title: ArGoT 2021 - arXiv Glossary of Terms +--- + +### Release + - This page documents: ArGoT 2021 (latest) + +### Contents + - 5,023 compressed XML files using the arXiv's naming convention. + - 881,301 articles. + - 800 ZIP archives, in arXiv's Year-Month `yymm` naming scheme. + - The XML sources total `500 MB` packaged, and `2.1 TB` unpacked. + +### Download + - [Download link](https://gl.kwarc.info/SIGMathLing/dataset-argot-2021) + - [SIGMathLing members](/member/) only. Joining is free and mostly a legal checkmark on our end - all researchers welcome! + +### Description + +This is the first public release of the ArGoT dataset generated by the [Formal Abstracts](https://formalabstracts.github.io/) research group. +ArGoT is a dataset of term-definition pairs automatically extracted from the arXiv mathematical papers. +It is comprised of XML files with the following tags and attributes: + - article: arXiv article entry + - name: link to the article in the arXiv + - num: number of paragraphs in the article + - definition: a paragraph labeled as a definition by the ML classifier + - index: paragraph number inside the article + - dfndum: the term (definiendum) found in the statement of the definition. + +Two independently extracted versions of the dataset are provided: + - NN: Neural network approach using a combination of LSTM for classification and LSTM-CRF for sequence tagging and + - SGD: Stochastic Gradient Descent for classification and ChunkParser for named entity recognition. + + +### Citing this Resource + +#### pure bibTeX +``` +@MISC{SML:argot:2021, + author = {Luis Berlioz}, + title = {ArGoT:2021 dataset, arXiv Glossary of Terms}, + howpublished = {hosted at \url{https://sigmathling.kwarc.info/resources/argot-dataset-2021/}}, + note = {SIGMathLing -- Special Interest Group on Math Linguistics}, + year = {2021} +``` + +#### bibTeX for the bibLaTeX package (preferred) +``` +@online{SML:argot:2021, + author = {Luis Berlioz}, + title = {argot:2021 dataset, an automatically extracted glossary of mathematical terms from the arXiv}, + url = {https://sigmathling.kwarc.info/resources/argot-dataset-2021/}, + note = {SIGMathLing -- Special Interest Group on Math Linguistics}, + year = {2021} +``` + +#### EndNote +``` +%0 Generic +%T argot:2021 dataset, an automatically extracted glossary of mathematical terms from the arXiv +%A Berlioz, Luis +%D 2021 +%I hosted at https://sigmathling.kwarc.info/resources/argot-dataset-2021/ +%F SML:argot:2021b +%O SIGMathLing – Special Interest Group on Math Linguistics +``` + +### Accessibility and License +The content of this Dataset is licensed to [SIGMathLing members](/member/) for research +and tool development purposes. + +Access is restricted to [SIGMathLing members](/member/) under the +[SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org) +articles, the right of distribution was only given (or assumed) to arXiv itself. + +### Generated via + - [LaTeXML 0.8.5](https://github.com/brucemiller/LaTeXML/releases/tag/v0.8.5), + - [latexml-plugin-argot 1.1](docker-singularity classifier) + +### About +Part of the [Formal Abstracts](https://formalabstracts.github.io/) research group. Author: Luis Berlioz + +### Appendix +**MathML formula example:** + +```xml + <article name="1407_005/1407.2218/1407.2218.xml" num="89"> + <definition index="51"> + <stmnt> Assume _inline_math_. We define the following space-time + norm if _inline_math_ is a time interval _display_math_ </stmnt> + <dfndum>space-time norm</dfndum> + </definition> + </article> +```