diff --git a/systems/arXMLiv.md b/systems/arXMLiv.md index 97d2ce1eb2805a79bbab3c2507b1653dde72833d..3a24db56b57301f7eb4ec4587ca03b3063aa75b0 100644 --- a/systems/arXMLiv.md +++ b/systems/arXMLiv.md @@ -9,8 +9,27 @@ people: - mkohlhase - dginev -website: http://cortex.mathweb.info +website: http://cortex.mathweb.org repository: https://github.com/dginev/CorTeX --- -The [Cornell e-print arXiv](http://arxiv.org) contains one of the largest corpora of scientific literature in the world. Unfortunately, its contents are locked up in the TeX/LaTeX format, which makes it nearly useless for knowledge management techniques. We translate it to XML to have a basis for uncovering it's structural semantics. +The [Cornell e-print arXiv](http://arxiv.org) contains one of the largest corpora of +scientific literature in the world. Unfortunately, its contents are locked up in the +TeX/LaTeX format, which makes it nearly useless for knowledge management techniques. We +translate it to XML and "HTML5 with [MathML](http://www.w3.org/TR/MathML/)" via +[LaTeXML](https://dlmf.nist.gov/LaTeXML/) to have a basis for uncovering it's structural +semantics (see the [LLaMaPuN](/systems/llamapun/) project for details). + +The actual corpus processing (and distribution to hundreds of worker machines) is +performed by the [CorTeX](https://github.com/dginev/cortex) system; see the system +state/results: [old but complete](http://cortex.mathweb.org/corpus/arXMLiv), +[new system in Erlangen](https://corpora.mathweb.org/corpus/arxiv_1712/tex_to_html). + +Applications of this include a mathematical search engine [MathWebSearch](/systems/mws/): +(live [demo on the arXMLiv data set](http://arxivsearch.mathweb.org). + +Unfortunately, we cannot re-distribute the results of the transformation freely, due to +arXiv licensing policies. Therefore we have created the Special Interest Group for Math +Linguistics ([SIGMathLing](http://SIGMathLing.kwarc.info) that can distribute the data +sets under an [NDA](https://sigmathling.kwarc.info/nda/) to +[SIGMathLing members](https://sigmathling.kwarc.info/member/)).