llamapun.md

---
layout: system

title: "LLaMaPuN: Language and Mathematics Processing and Understanding" 
teaser: A RUST library for math corpus linguistics.

start_date: 2008-11

people: 
    - mkohlhase
    - dginev
    - jfschaefer
    - itoloaca

supported-by:
    - mathsearch

repository: https://github.com/KWARC/LLaMaPUn/
publink: http://kwarc.github.io/bibs/llamapun
---

The [LLaMaPUn library](https://github.com/KWARC/LLaMaPUn/) is a
[RUST](https://www.rust-lang.org) library that provides a wide range of processing
tools for natural language and mathematics.
It can be used to investigate the structure and meaning of scientific/technical
documents and to build tools for extracting semantic representations from them that can be
used to enhance access to and interaction with document corpora.

In particular, the LLaMaPUn library is used on the
[arXMLiv](https://www.kwarc.info/projects/arXMLiv/) data set, which is a translation
of the [arxiv](https://arxiv.org/) corpus to "HTML5 with [MathML](https://www.w3.org/TR/MathML/)".

Some of the library's features are:
 * Plaintext generation with many options (unicode normalization, word stemming, custom handling of e.g. `math` nodes, ...)
 * Word/Sentence tokenization
 * Support for standard NLP tools (token models for GloVe, POS tagging with SENNA, ...)
 * Mapping between plaintext offsets and HTML nodes (using the DNM data structure)

For a more complete overview, take a look at the [README file](https://github.com/KWARC/llamapun/blob/master/README.md).