Skip to content
Snippets Groups Projects
arXMLiv.md 1.7 KiB
Newer Older
  • Learn to ignore specific revisions
  • Michael Kohlhase's avatar
    Michael Kohlhase committed
    ---
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    layout: project
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    title: arXMLiv
    
    shorttitle: arXMLiv
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    active: true
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    teaser: Translating the arXiv to XML/HTML5 
    
    start_date: '2006'
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    publink: http://kwarc.github.io/bibs/arXMLiv
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    funding: internal
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
        - mkohlhase 
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    collaborators:
        - Dr. Bruce Miller (NIST)
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
        - various Jacobs University undergrads
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    logo: public/kwarc_logo.svg
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    website: http://corpora.mathweb.org
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    repository: https://github.com/dginev/CorTeX
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    ---
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    The [Cornell e-print arXiv](http://arxiv.org) contains one of the largest corpora of
    scientific literature in the world. Unfortunately, its contents are locked up in the
    TeX/LaTeX format, which makes it nearly useless for knowledge management techniques. We
    translate it to XML and "HTML5 with [MathML](http://www.w3.org/TR/MathML/)" via
    [LaTeXML](https://dlmf.nist.gov/LaTeXML/) to have a basis for uncovering it's structural
    semantics (see the [LLaMaPuN](/systems/llamapun/) project for details).
    
    The actual corpus processing (and distribution to hundreds of worker machines) is
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    performed by the [CorTeX](https://github.com/dginev/cortex) system; see the [system
    state/results on arXiv](https://corpora.mathweb.org/history/arxmliv/tex_to_html).
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    
    Applications of this include a mathematical search engine [MathWebSearch](/systems/mws/): 
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    (live [demo on the arXMLiv data set](http://arxivsearch.mathweb.org)). 
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    Unfortunately, we cannot re-distribute the results of the transformation freely due to
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    arXiv licensing policies. Therefore we have created the Special Interest Group for Math
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    Linguistics ([SIGMathLing](http://SIGMathLing.kwarc.info)) that can distribute the data
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    sets under an [NDA](https://sigmathling.kwarc.info/nda/) to
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    [SIGMathLing members](https://sigmathling.kwarc.info/member/).