Michael Kohlhase authoredMichael Kohlhase authored
layout: project
title: arXMLiv
shorttitle: arXMLiv
active: true
teaser: Translating the arXiv to XML/HTML5
start_date: '2006'
publink: http://kwarc.github.io/bibs/arXMLiv
funding: internal
- dginev
- mkohlhase
- Dr. Bruce Miller (NIST)
- various Jacobs University undergrads
logo: public/kwarc_logo.svg
website: http://corpora.mathweb.org
repository: https://github.com/dginev/CorTeX
The Cornell e-print arXiv contains one of the largest corpora of scientific literature in the world. Unfortunately, its contents are locked up in the TeX/LaTeX format, which makes it nearly useless for knowledge management techniques. We translate it to XML and "HTML5 with MathML" via LaTeXML to have a basis for uncovering it's structural semantics (see the LLaMaPuN project for details).
The actual corpus processing (and distribution to hundreds of worker machines) is performed by the CorTeX system; see the system state/results on arXiv.
Applications of this include a mathematical search engine MathWebSearch: (live demo on the arXMLiv data set).
Unfortunately, we cannot re-distribute the results of the transformation freely due to arXiv licensing policies. Therefore we have created the Special Interest Group for Math Linguistics (SIGMathLing) that can distribute the data sets under an NDA to SIGMathLing members.