Skip to content
Snippets Groups Projects
arXMLiv.md 1.56 KiB
Newer Older
  • Learn to ignore specific revisions
  • Michael Kohlhase's avatar
    Michael Kohlhase committed
    ---
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    layout: system
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    title: arXMLiv
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    teaser: Translating the arXiv to XML/HTML5 
    
    start_date: '2006'
    
    people:
        - mkohlhase
        - dginev
    
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    website: http://cortex.mathweb.org
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    repository: https://github.com/dginev/CorTeX
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    ---
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    The [Cornell e-print arXiv](http://arxiv.org) contains one of the largest corpora of
    scientific literature in the world. Unfortunately, its contents are locked up in the
    TeX/LaTeX format, which makes it nearly useless for knowledge management techniques. We
    translate it to XML and "HTML5 with [MathML](http://www.w3.org/TR/MathML/)" via
    [LaTeXML](https://dlmf.nist.gov/LaTeXML/) to have a basis for uncovering it's structural
    semantics (see the [LLaMaPuN](/systems/llamapun/) project for details).
    
    The actual corpus processing (and distribution to hundreds of worker machines) is
    performed by the [CorTeX](https://github.com/dginev/cortex) system; see the system
    state/results: [old but complete](http://cortex.mathweb.org/corpus/arXMLiv),
    [new system in Erlangen](https://corpora.mathweb.org/corpus/arxiv_1712/tex_to_html).
    
    Applications of this include a mathematical search engine [MathWebSearch](/systems/mws/): 
    (live [demo on the arXMLiv data set](http://arxivsearch.mathweb.org). 
    
    Unfortunately, we cannot re-distribute the results of the transformation freely, due to
    arXiv licensing policies. Therefore we have created the Special Interest Group for Math
    Linguistics ([SIGMathLing](http://SIGMathLing.kwarc.info) that can distribute the data
    sets under an [NDA](https://sigmathling.kwarc.info/nda/) to
    [SIGMathLing members](https://sigmathling.kwarc.info/member/)).