Skip to content
Snippets Groups Projects
Commit 5fa0989b authored by Michael Kohlhase's avatar Michael Kohlhase
Browse files

Merge branch 'master' of gl.kwarc.info:kwarc/kwarc.info/www

parents 46364304 104f0887
No related branches found
No related tags found
No related merge requests found
......@@ -19,10 +19,21 @@ repository: https://github.com/KWARC/LLaMaPUn/
publink: http://kwarc.github.io/bibs/llamapun
---
The LaMaPUn project investigates the structure and meaning of scientific/technical
documents and builds tools for extracting semantic representations from them that can be
The [LLaMaPUn library](https://github.com/KWARC/LLaMaPUn/) is a
[RUST](https://www.rust-lang.org) library that provides a wide range of processing
tools for natural language and mathematics.
It can be used to investigate the structure and meaning of scientific/technical
documents and to build tools for extracting semantic representations from them that can be
used to enhance access to and interaction with document corpora.
The LLaMaPUn library consists of a wide range of processing tools for natural language and
mathematics.
In particular, the LLaMaPUn library is used on the
[arXMLiv](https://www.kwarc.info/projects/arXMLiv/) data set, which is a translation
of the [arxiv](https://arxiv.org/) corpus to "HTML5 with [MathML](https://www.w3.org/TR/MathML/)".
Some of the library's features are:
* Plaintext generation with many options (unicode normalization, word stemming, custom handling of e.g. `math` nodes, ...)
* Word/Sentence tokenization
* Support for standard NLP tools (token models for GloVe, POS tagging with SENNA, ...)
* Mapping between plaintext offsets and HTML nodes (using the DNM data structure)
For a more complete overview, take a look at the [README file](https://github.com/KWARC/llamapun/blob/master/README.md).
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment