From 104f0887be804902229787c36d7c0707a293ce47 Mon Sep 17 00:00:00 2001 From: jfschaefer <jfschaefer@outlook.com> Date: Wed, 4 Apr 2018 10:12:49 +0200 Subject: [PATCH] extended llamapun system description --- systems/llamapun.md | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/systems/llamapun.md b/systems/llamapun.md index c150bda..22ee76e 100644 --- a/systems/llamapun.md +++ b/systems/llamapun.md @@ -19,10 +19,21 @@ repository: https://github.com/KWARC/LLaMaPUn/ publink: http://kwarc.github.io/bibs/llamapun --- -The LaMaPUn project investigates the structure and meaning of scientific/technical -documents and builds tools for extracting semantic representations from them that can be +The [LLaMaPUn library](https://github.com/KWARC/LLaMaPUn/) is a +[RUST](https://www.rust-lang.org) library that provides a wide range of processing +tools for natural language and mathematics. +It can be used to investigate the structure and meaning of scientific/technical +documents and to build tools for extracting semantic representations from them that can be used to enhance access to and interaction with document corpora. -The LLaMaPUn library consists of a wide range of processing tools for natural language and -mathematics. +In particular, the LLaMaPUn library is used on the +[arXMLiv](https://www.kwarc.info/projects/arXMLiv/) data set, which is a translation +of the [arxiv](https://arxiv.org/) corpus to "HTML5 with [MathML](https://www.w3.org/TR/MathML/)". +Some of the library's features are: + * Plaintext generation with many options (unicode normalization, word stemming, custom handling of e.g. `math` nodes, ...) + * Word/Sentence tokenization + * Support for standard NLP tools (token models for GloVe, POS tagging with SENNA, ...) + * Mapping between plaintext offsets and HTML nodes (using the DNM data structure) + +For a more complete overview, take a look at the [README file](https://github.com/KWARC/llamapun/blob/master/README.md). -- GitLab