Commit 425f9f39 authored by jfschaefer's avatar jfschaefer
Browse files

more comments

parent e8eb9401
We want to use GF in order to parse mathematical documents.
Here is an overview of our current approach.
Note that the current implementation is merely a proof of concept and subject to constant change.
Note that the current implementation is merely a proof of concept and subject to constant change. Issues and discussions can also be found at <https://gl.kwarc.info/smglom/GF/issues>.
Preprocessing
......@@ -13,7 +13,7 @@ A positive integer $n$ is called prime iff there is no integer $1 < m < n$ such
```
This sentence (written in LaTeX) gets transformed into an html representation using LaTeXML.
We have tools to process this html representation and generate a (space-separated) token stream.
We use our LLaMaPUn library to process this html representation and generate a (space-separated) token stream.
The formula representation is based directly on the corresponding presentation-MathML.
For the example sentence, this gives us the following token stream:
......@@ -21,11 +21,13 @@ For the example sentence, this gives us the following token stream:
a positive integer $ mi( n ) $ is called prime iff there is no integer $ mrow( mn( 1 ) mo( < ) mi( m ) mo( < ) mi( n ) ) $ such that $ mrow( mi( m ) mo( | ) mi( n ) ) $```
```
The LLaMaPUn library allows us to map offsets in this string back to the nodes of the HTML representation.
Formula parsing
===
So far, formulae are parsed semantically, which means that the coverage is tiny at the moment.
So far, the coverage for parsing formulae is tiny.
As an example,
```
......@@ -52,6 +54,7 @@ fexpr_fbinrel_fcid_fbinrel_fexpr_to_mathcn : -- introduces on
FExpression -> FBinRelation -> FComplexIdentifier -> FBinRelation -> FExpression -> MathCN;
```
The `MathCN` concept is not good and we need to do more work on declarations (<https://gl.kwarc.info/smglom/GF/issues/2>).
Language/English parsing
===
......@@ -66,9 +69,11 @@ This can be done using the following rule:
appo_mobj : MObj -> MathCN -> MObj;
```
As mentioned above, more work is needed on `MathCN`, but also `MObj` and declarations in general (<https://gl.kwarc.info/smglom/GF/issues/2>).
Lexica
===
To increase coverage, we want to use unsupervised/semisupervised machine learning techniques for generating lexica.
So far, a first version of a small lexicon of math objects (`MObj`) has been generated using boot-strapping and a version of the EM-algorithm on a subset of arxiv documents.
So far, a first version of a small lexicon of math objects (`MObj`) has been generated using boot-strapping and a version of the EM-algorithm on a subset of arxiv documents (roughly 2*10^6 sentences).
i MNlpEng.gf
i gf/MNlpEng.gf
parse "a positive integer $ mi( n ) $ is called prime iff there is no integer $ mrow( mn( 1 ) mo( < ) mi( m ) mo( < ) mi( n ) ) $ such that $ mrow( mi( m ) mo( | ) mi( n ) ) $"
parse "a set is called empty iff it is the empty set"
parse "an alphabet $ mi( A ) $ is a finite set"
......@@ -2,6 +2,7 @@
-- Some physics papers were included, so a few physical nouns are contained as well.
-- The machine learning algorithm is still under development and more refined and more complete results can be expected.
-- Currently all nouns are of type MObj. This shouldn't be the case, as e.g. optimality does not represent a mathematical object.
-- NOTE: IRREGULAR NOUNS ARE NOT HANDLED PROPERLY (consider e.g. "vertex" (pl. "vertices"))
concrete NLexiconMObjEng of NLexiconMObj = MCatsEng ** open SyntaxEng, ParadigmsEng in {
lin
set_MObj = mkCN (mkN "set");
......
README
---
This directory contains some experimental scripts. They have only been used for small examples and are probably out of date.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment