Skip to content
Snippets Groups Projects
llamapun.md 1.48 KiB
Newer Older
  • Learn to ignore specific revisions
  • Michael Kohlhase's avatar
    new
    Michael Kohlhase committed
    ---
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    layout: system
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    title: "LLaMaPuN: Language and Mathematics Processing and Understanding" 
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    teaser: A RUST library for math corpus linguistics.
    
    
    start_date: 2008-11
    
    people: 
        - mkohlhase
        - dginev
        - jfschaefer
        - itoloaca
    
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    supported-by:
        - mathsearch
    
    
    repository: https://github.com/KWARC/LLaMaPUn/
    
    Michael Kohlhase's avatar
    Michael Kohlhase committed
    publink: http://kwarc.github.io/bibs/llamapun
    
    Michael Kohlhase's avatar
    new
    Michael Kohlhase committed
    ---
    
    The [LLaMaPUn library](https://github.com/KWARC/LLaMaPUn/) is a
    [RUST](https://www.rust-lang.org) library that provides a wide range of processing
    tools for natural language and mathematics.
    It can be used to investigate the structure and meaning of scientific/technical
    documents and to build tools for extracting semantic representations from them that can be
    
    Michael Kohlhase's avatar
    new
    Michael Kohlhase committed
    used to enhance access to and interaction with document corpora.
    
    
    In particular, the LLaMaPUn library is used on the
    [arXMLiv](https://www.kwarc.info/projects/arXMLiv/) data set, which is a translation
    of the [arxiv](https://arxiv.org/) corpus to "HTML5 with [MathML](https://www.w3.org/TR/MathML/)".
    
    Michael Kohlhase's avatar
    new
    Michael Kohlhase committed
    
    
    Some of the library's features are:
     * Plaintext generation with many options (unicode normalization, word stemming, custom handling of e.g. `math` nodes, ...)
     * Word/Sentence tokenization
     * Support for standard NLP tools (token models for GloVe, POS tagging with SENNA, ...)
     * Mapping between plaintext offsets and HTML nodes (using the DNM data structure)
    
    For a more complete overview, take a look at the [README file](https://github.com/KWARC/llamapun/blob/master/README.md).