--- layout: project title: arXMLiv shorttitle: arXMLiv active: true teaser: Translating the arXiv to XML/HTML5 start_date: '2006' publink: http://kwarc.github.io/bibs/arXMLiv funding: internal people: - dginev - mkohlhase collaborators: - Dr. Bruce Miller (NIST) - various Jacobs University undergrads logo: public/kwarc_logo.svg website: http://corpora.mathweb.org repository: https://github.com/dginev/CorTeX --- The [Cornell e-print arXiv](http://arxiv.org) contains one of the largest corpora of scientific literature in the world. Unfortunately, its contents are locked up in the TeX/LaTeX format, which makes it nearly useless for knowledge management techniques. We translate it to XML and "HTML5 with [MathML](http://www.w3.org/TR/MathML/)" via [LaTeXML](https://dlmf.nist.gov/LaTeXML/) to have a basis for uncovering it's structural semantics (see the [LLaMaPuN](/systems/llamapun/) project for details). The actual corpus processing (and distribution to hundreds of worker machines) is performed by the [CorTeX](https://github.com/dginev/cortex) system; see the [system state/results on arXiv](https://corpora.mathweb.org/history/arxmliv/tex_to_html). Applications of this include a mathematical search engine [MathWebSearch](/systems/mws/): (live [demo on the arXMLiv data set](http://arxivsearch.mathweb.org)). Unfortunately, we cannot re-distribute the results of the transformation freely due to arXiv licensing policies. Therefore we have created the Special Interest Group for Math Linguistics ([SIGMathLing](http://SIGMathLing.kwarc.info)) that can distribute the data sets under an [NDA](https://sigmathling.kwarc.info/nda/) to [SIGMathLing members](https://sigmathling.kwarc.info/member/).