Skip to content
Snippets Groups Projects
Select Git revision
  • master default
  • zolekode-master-patch-26857
  • zolekode-master-patch-78201
  • zolekode-master-patch-54259
  • zolekode-master-patch-38209
5 results

arXMLiv.md

Blame
  • arXMLiv.md 1.77 KiB
    layout: project
    
    title: arXMLiv
    shorttitle: arXMLiv
    active: true
    teaser: Translating the arXiv to XML/HTML5 
    start_date: '2006'
    publink: http://kwarc.github.io/bibs/arXMLiv
    funding: internal
    
    people:
        - mkohlhase
        - dginev
    
    collaborators:
        - Dr. Bruce Miller (NIST)
        - various Jacobs University undergrads
    
    logo: public/kwarc_logo.svg
    website: http://cortex.mathweb.org
    repository: https://github.com/dginev/CorTeX

    The Cornell e-print arXiv contains one of the largest corpora of scientific literature in the world. Unfortunately, its contents are locked up in the TeX/LaTeX format, which makes it nearly useless for knowledge management techniques. We translate it to XML and "HTML5 with MathML" via LaTeXML to have a basis for uncovering it's structural semantics (see the LLaMaPuN project for details).

    The actual corpus processing (and distribution to hundreds of worker machines) is performed by the CorTeX system; see the system state/results: old but complete, new system in Erlangen.

    Applications of this include a mathematical search engine MathWebSearch: (live demo on the arXMLiv data set).

    Unfortunately, we cannot re-distribute the results of the transformation freely due to arXiv licensing policies. Therefore we have created the Special Interest Group for Math Linguistics (SIGMathLing) that can distribute the data sets under an NDA to SIGMathLing members.