We want to use GF in order to parse mathematical documents.
Here is an overview of our current approach.
Note that the current implementation is merely a proof of concept and subject to constant change.
Note that the current implementation is merely a proof of concept and subject to constant change. Issues and discussions can also be found at <https://gl.kwarc.info/smglom/GF/issues>.
Preprocessing
...
...
@@ -13,7 +13,7 @@ A positive integer $n$ is called prime iff there is no integer $1 < m < n$ such
```
This sentence (written in LaTeX) gets transformed into an html representation using LaTeXML.
We have tools to process this html representation and generate a (space-separated) token stream.
We use our LLaMaPUn library to process this html representation and generate a (space-separated) token stream.
The formula representation is based directly on the corresponding presentation-MathML.
For the example sentence, this gives us the following token stream:
...
...
@@ -21,11 +21,13 @@ For the example sentence, this gives us the following token stream:
a positive integer $ mi( n ) $ is called prime iff there is no integer $ mrow( mn( 1 ) mo( < ) mi( m ) mo( < ) mi( n ) ) $ such that $ mrow( mi( m ) mo( | ) mi( n ) ) $```
```
The LLaMaPUn library allows us to map offsets in this string back to the nodes of the HTML representation.
Formula parsing
===
So far, formulae are parsed semantically, which means that the coverage is tiny at the moment.
So far, the coverage for parsing formulae is tiny.
As an example,
```
...
...
@@ -52,6 +54,7 @@ fexpr_fbinrel_fcid_fbinrel_fexpr_to_mathcn : -- introduces on
The `MathCN` concept is not good and we need to do more work on declarations (<https://gl.kwarc.info/smglom/GF/issues/2>).
Language/English parsing
===
...
...
@@ -66,9 +69,11 @@ This can be done using the following rule:
appo_mobj : MObj -> MathCN -> MObj;
```
As mentioned above, more work is needed on `MathCN`, but also `MObj` and declarations in general (<https://gl.kwarc.info/smglom/GF/issues/2>).
Lexica
===
To increase coverage, we want to use unsupervised/semisupervised machine learning techniques for generating lexica.
So far, a first version of a small lexicon of math objects (`MObj`) has been generated using boot-strapping and a version of the EM-algorithm on a subset of arxiv documents.
So far, a first version of a small lexicon of math objects (`MObj`) has been generated using boot-strapping and a version of the EM-algorithm on a subset of arxiv documents (roughly 2*10^6 sentences).
parse "a positive integer $ mi( n ) $ is called prime iff there is no integer $ mrow( mn( 1 ) mo( < ) mi( m ) mo( < ) mi( n ) ) $ such that $ mrow( mi( m ) mo( | ) mi( n ) ) $"
parse "a set is called empty iff it is the empty set"