Skip to content
Snippets Groups Projects
Commit 104f0887 authored by jfschaefer's avatar jfschaefer
Browse files

extended llamapun system description

parent b10866f1
No related branches found
No related tags found
No related merge requests found
......@@ -19,10 +19,21 @@ repository: https://github.com/KWARC/LLaMaPUn/
publink: http://kwarc.github.io/bibs/llamapun
---
The LaMaPUn project investigates the structure and meaning of scientific/technical
documents and builds tools for extracting semantic representations from them that can be
The [LLaMaPUn library](https://github.com/KWARC/LLaMaPUn/) is a
[RUST](https://www.rust-lang.org) library that provides a wide range of processing
tools for natural language and mathematics.
It can be used to investigate the structure and meaning of scientific/technical
documents and to build tools for extracting semantic representations from them that can be
used to enhance access to and interaction with document corpora.
The LLaMaPUn library consists of a wide range of processing tools for natural language and
mathematics.
In particular, the LLaMaPUn library is used on the
[arXMLiv](https://www.kwarc.info/projects/arXMLiv/) data set, which is a translation
of the [arxiv](https://arxiv.org/) corpus to "HTML5 with [MathML](https://www.w3.org/TR/MathML/)".
Some of the library's features are:
* Plaintext generation with many options (unicode normalization, word stemming, custom handling of e.g. `math` nodes, ...)
* Word/Sentence tokenization
* Support for standard NLP tools (token models for GloVe, POS tagging with SENNA, ...)
* Mapping between plaintext offsets and HTML nodes (using the DNM data structure)
For a more complete overview, take a look at the [README file](https://github.com/KWARC/llamapun/blob/master/README.md).
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment