Commit 6ed809d7 authored by jfschaefer's avatar jfschaefer
Browse files

add current version of README

parent 3f1eda47
## The smglom/GF repository for our experiments with Grammatical Framework on smglom
We want to use GF in order to parse mathematical documents.
Here is an overview of our current approach.
Note that the current implementation is merely a proof of concept and subject to constant change.
We want to use this repository to collect different scripts and GF grammars from our experiments.
Preprocessing
===
Consider the following example sentence:
```
A positive integer $n$ is called prime iff there is no integer $1 < m < n$ such that $m \mid n$.
```
This sentence (written in LaTeX) gets transformed into an html representation using LaTeXML.
We have tools to process this html representation and generate a (space-separated) token stream.
The formula representation is based directly on the corresponding presentation-MathML.
For the example sentence, this gives us the following token stream:
```
a positive integer $ mi( n ) $ is called prime iff there is no integer $ mrow( mn( 1 ) mo( < ) mi( m ) mo( < ) mi( n ) ) $ such that $ mrow( mi( m ) mo( | ) mi( n ) ) $```
```
Formula parsing
===
So far, formulae are parsed semantically, which means that the coverage is tiny at the moment.
As an example,
```
mrow( mi( m ) mo( | ) mi( n ) )
```
matches the rule
```
apply_bin_rel : FBinRelation -> FExpression -> FExpression -> FStatement;
```
The prefix "F" indicates that the categories act on the formula level.
A complete formula can be one of the following two categories: A `Statement` or a `MathCN`.
The `Statement` corresponds to a clause (`Cl`). The example above ($m \mid n$) would be an example for that.
A `MathCN` on the other hand corresponds to a noun.
In "A positive integer $n$", the $n$ would be such a `MathCN`.
Similarly, in "there is no integer $1 < m < n$", the formula also is a `MathCN`. It matches the rule:
```
fexpr_fbinrel_fcid_fbinrel_fexpr_to_mathcn : -- introduces one identifier (like in $1 < n < 10$)
FExpression -> FBinRelation -> FComplexIdentifier -> FBinRelation -> FExpression -> MathCN;
```
Language/English parsing
===
The central categories on the language side are: `Statement`, `Definition`, `MObj`.
A `Statement` can be a formula (as described above) or a language statement.
An `MObj` is a "mathematical object", such as an "integer".
Often, identifiers are introduced in an apposition (as in "a positive integer $n$").
This can be done using the following rule:
```
appo_mobj : MObj -> MathCN -> MObj;
```
Lexica
===
To increase coverage, we want to use unsupervised/semisupervised machine learning techniques for generating lexica.
So far, a first version of a small lexicon of math objects (`MObj`) has been generated using boot-strapping and a version of the EM-algorithm on a subset of arxiv documents.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment