Commit 5a00c5cc authored by Dennis Müller's avatar Dennis Müller
Browse files


# FiFoM Repository
This repository contains all code associated with my *DAAD (German Academic Exchange Service) Postdoc Scholarship 12/2019-06/2020*; the proposal for which can be found [here]( This project attempts to use Machine Learning (based on a [GPT2]( Transformer-model trained from scratch on a significant subset of []( to augment generic LaTeX with [sTeX]( to disambiguate mathematical expressions.
## Structure
The majority of the code contained is written in Scala (requiring java >8) and managed by [build.sbt]( The Machine Learning parts are written in Python 3 using the [transformers]( and [tokenizers]( libraries based on [PyTorch](
- The [deploy]( contains an assembled [jar-file]( (bundling a working MMT instance), and an [example shell-file]( that attempts to translate the file [example.tex]( to sTeX.
- The [deploy/python]( contains the relevant python-code and pretrained models used. In particular, [python/gpt2]( contains the GPT2-model pretrained on an arxiv dump; [python/l2s]( contains the model fine-tuned for LaTeX->sTeX translation.
- The [deploy/mathhub]( contains the subset of [MathHub]( - consisting of [MMT]( - necessary for various evaluations during running (see below).
- The remaining folders contain various sbt modules for e.g. parsing LaTeX/sTeX, expanding sTeX-macros etc.
## Running
The [example shell-file]( shows how to run the translator on a given LaTeX file. Notably, the file does not need to have a full header, so the translator can be run on fragments as well. The [ML.jar]( takes the following command line arguments:
- `-tex <file>`: The (absolute or relative) path to the tex-file to be translated.
- `-mathhub <folder>`: An (optional) path to a folder containing MMT-archives, which is needed for the following options. If present, all translations will be heuristically checked for whether they are valid sTeX-expressions (on the basis of the [smglom](; contained in the [deploy/mathhub]( and logged to the console. Green fragments are considered sTeX, yellow fragments are assumed to be variables, red fragments are considered invalid (but may still be valid LaTeX).
- `-eval_latex`: If present, all translations will have their containing sTeX-macros expanded, the result is heuristically normalized and compared to the original input.
- `-alignments <folder>`: An optional path to a repository containing [Alignments]( between sTeX/smglom-macros and the strongly typed [Math-in-the-Middle]( MMT-archive (also contained in the [deploy/mathhub](, used for the following option.
- `-typecheck`: If present, all translations will be:
- Converted to [OMDoc]( and imported to MMT. This requires a working installation of [LaTeXML]( and the [LaTeXML-sTeX-Plugin]( to work.
- Translated - using the available alignments - to a typed Math-in-the-Middle expression, and
- Type checked by having its Math-in-the-Middle type inferred. A success guarantees that the translation is well-formed and well-typed.
The Scala and Python parts communicate via TCP on port 65432; with the [Python file]( acting as server and the [ML.jar]( acting as client. If `ML.jar` can not connect to a server on port 65432, it will attempt to run `python3` in `./python`, effectively running a server (which will be closed automatically when the program terminates).
A *known bug* occasionally causes the program to freeze - it is unclear wheher that happens in the python code or the TCP-connection. `` logs to the file `l2s.log`, but without errors. Curiously, the error disappears if server and client are started separately (making it practically impossible to debug this error). Consequently, if this happens, I recommend first running `python3` manually, waiting until it outputs `Server running` and then running `ML.jar`, which will connect to the already running python server.
name := "ML"
organization := "Jazzpirate"
version := "0.1"
scalaVersion := "2.12.8"
resolvers += Resolver.mavenLocal
resolvers += "Sonatype Releases" at ""
lazy val core = project.settings(
name := "Core"
lazy val mmtutil = project.settings(
name := "mmtutil",
libraryDependencies ++= Seq(
lazy val latex = project.settings(
name := "LaTeX"
).dependsOn(mmt) // TODO replace by mmt for final
lazy val stex = project.settings(
name := "sTeX",
libraryDependencies ++= Seq(dependencies.os)
lazy val ml = project.settings(
name := "ML",
libraryDependencies ++= Seq(
).dependsOn(mmt) // TODO replace by mmt for final
lazy val mmt = project.settings(
name := "MMT",
Compile / unmanagedJars ++= Seq(baseDirectory.value / "lib" / "mmt.jar")
lazy val dependencies = new {
val scalaxml = "org.scala-lang.modules" %% "scala-xml" % "1.2.0"
val scalacompiler = "org.scala-lang" % "scala-compiler" % "2.12.8"
val scalaparser = "org.scala-lang.modules" %% "scala-parser-combinators" % "1.1.2"
val tensorflow = "org.platanios" %% "tensorflow" % "0.4.1"
val tensorflow_cpu = tensorflow classifier "linux-cpu-x86_64"
val tensorflow_gpu = tensorflow classifier "linux-gpu-x86_64"
val xz = "org.tukaani" % "xz" % "1.8"
//val akka = "com.typesafe.akka" %% "akka-actor" % "2.6.4"
val os = "com.lihaoyi" %% "os-lib" % "0.2.7"
\ No newline at end of file
import com.jazzpirate.latex.commands.CommandObject
import com.jazzpirate.latex.syntax.LatexObject
import com.jazzpirate.stex.Tokenizer
import info.kwarc.mmt.api.utils.{JSONArray, JSONInt, JSONObject, JSONString}
object Labeller {
private val labels = List(
private def label(in:LatexObject) : Option[List[(String,String,LatexObject)]] = in match {
case co: CommandObject =>
labels.find(_._1 == co.key) match {
case Some((_,s)) =>
val ret = co.command.expand(co).flatMap(Tokenizer.apply).map{case (a,b) => (a,s,b)}
case _ => None
case _ => None
def apply(ls : List[LatexObject]) = {,Some(label)))
def print(ls : List[(String,String,LatexObject)]) = {
val stds ={
case (s,l,o) =>
val nos = o.ref.start.toString + ":" + o.ref.end.toString
val length = List(s.length,l.length,nos.length).max
val ns = if (s.length < length) s + (1 to (length - s.length)).map(_ => " ").mkString("") else s
val nl = if (l.length < length) l + (1 to (length - l.length)).map(_ => " ").mkString("") else l
val no = if (nos.length < length) nos + (1 to (length - nos.length)).map(_ => " ").mkString("") else nos
val (l1,l2,l3) = stds.unzip3
l1.mkString(" | ") + "\n" + l2.mkString(" | ") + "\n" + l3.mkString(" | ")
def toJson(ls : List[(String,String,LatexObject)]) = {
var tokens : List[String] = Nil
var labels : List[String] = Nil
var offsets : List[(Int,Int)] = Nil
ls.foreach {case (s,l,o) =>
tokens ::= s
labels ::= l
offsets ::= (o.ref.start,o.ref.end)
("tokens",JSONString(tokens.reverse.mkString(" "))),
("labels",JSONString(labels.reverse.mkString(" "))),
("offsets",JSONArray({case (i,j) => JSONObject(("start",JSONInt(i)),("end",JSONInt(j)))} :_*))
import com.jazzpirate.{latex, stex}
import com.jazzpirate.latex.{LaTeXProcessor, Normalize, Remove}
import com.jazzpirate.latex.bindings.Commands
import com.jazzpirate.latex.syntax.LatexFile
import com.jazzpirate.mmt.MyMMT
import com.jazzpirate.stex.{StexChecker, Tokenizer}
import com.jazzpirate.stex.bindings.StexCommands
import info.kwarc.mmt.api.utils.{File, JSONArray, JSONInt, JSONObject, JSONString}
val home = File("/home/jazzpirate/work/Scala/ML/data")
val mmt = MyMMT()
val macros = StexChecker.collect(mmt.archivepath / "smglom",true)
val proc = new LaTeXProcessor
val files = mmt.getsmglom()
val paragraphs = files.flatMap {f =>
val doc = proc.parse(f)
val ndoc = proc.expand(doc,Remove(stex.bindings.Basics.ednote) :: Normalize.cleanups,true).asInstanceOf[LatexFile]
val labelled = Labeller(paragraphs)
val vocab =
val labels = => JSONString(e._2)).distinct
println(vocab.length + " Tokens")
println(labels.length + " Labels")
val commands = vocab.filter(_.startsWith("\\"))
println(commands.length + " Commands")
val jsons =
val json_comms = JSONArray(*)
val corpus = JSONArray(jsons:_*)
val filestr = JSONObject(
("max_tokens",JSONInt(commands.length + ((vocab.length - commands.length) / 3))),
val outfile = home / "smglom.json"
\ No newline at end of file
This diff is collapsed.
File added
java -jar ML.jar -tex example.tex -mathhub mathhub -eval_latex -alignments mathhub/Alignments -typecheck
\usepackage{amsmath,bbm, amsthm, hyperref}
\section{Naive Set Theory}
\emph{The} most fundamental notion in mathematics is that of a \emph{set}. While a precise definition is somewhat elaborate, we will preliminarily work with the following informal definition, as stated by Cantor in 18xx:
A \emph{set} is any gathering together of distinct objects (called its \emph{element} into a unity.
If $x$ is an element of a set $A$, we write $x\in A$.
In particular, we assume the following two axioms to hold. Notably, these two axioms together imply contradictions. There are ways of slightly restricting them to avoid that, but doing so requires some deeper understanding of formal logic, hence we stick with the following for now:
Two sets are equal if and only if they have the same elements.
\[\forall A,B. (A=B) \Leftrightarrow (\forall x. x\in A \Leftrightarrow x\in B) \]
For any property $P$, there exists the set $\{x | P(x)\}$ that contains precisely those sets that satisfy $P$.
\[\forall y. y\in \{x|P(x)\}\Leftrightarrow P(y)\]
We write $\{a,b,c,\ldots\}$ for the set $\{x| (x=a)\text{ or }(x=b)\text{ or }(x=c)\text{ or }\ldots\}$
For example:
\[\mathbb N = \{0,1,2,3,\ldots\}\]
\[\mathbb Z = \{\ldots,-2,-1,0,1,2,\ldots\}\]
\[\mathbb Q = \left\{ \frac ab | a\in\mathbb Z\text{ and }b\in\mathbb N\text{ and }b>0\right\}\]
We call a set $A$ a \emph{subset} of a set $B$ (and write $A\subseteq B$) if and only if every element of $A$ is also an element of $B$:
\[(A \subseteq B)\Leftrightarrow (\forall x\in A. x\in B)\]
In that case we also call $B$ a \emph{superset} of $A$ (and write $B\supseteq A$).
If additionally $A\neq B$, then we call $A$ a \emph{proper subset} of $B$ (and $B$ a \emph{proper superset} of $A$).
The set of all subsets of a set $A$ is called its \emph{power set} and written
$\mathcal P(A) := \{x|x\subseteq A\}$.
We introduce some common operations on sets:
\item The \emph{union} of two sets $A$,$B$ is the set
\[A\cup B := \{x|x\in A\text{ or }x\in B\}\]
\item The \emph{intersection} of two sets $A$,$B$ is the set
\[A \cap B := \{x|x\in A\text{ and }x\in B\}\]
\item The \emph{set difference} of two sets $A$,$B$ is the set
\[A\backslash B := \{x\in A| x\notin B\}\]
\item The \emph{cartesian product} of two sets $A$,$B$ is the set of all pairs
\[A\times B := \{ \langle x,y\rangle | x\in A\text{ and }x\in B\} \]
For (for now) \emph{finite} sets, we define:
For a finite set $A$, the \emph{cardinality} of $A$ is the number of its members and denoted by
\[|A|\in\mathbb N\]
There is a more general notion of cardinality which also applies to \emph{infinite} sets, but that requires a more rigorous treatment of sets than presented here.
For finite sets $X$,$Y$ we have
\[|X\times Y|= |X|\cdot|Y|\]
\[|X\cup Y| = |X| + |Y| - |X\cap Y|\]
\section{The Induction Principle}
The following theorem is often used as an example for induction proofs:
$1+2+\ldots+n = \frac{n(n+1)}{2}\text{ for all }n\geq1$
For $n=5$, we have $1+2+3+4+5=(5\cdot6)/2=15$.
\begin{proof} By induction.
Let $A(n)$ be the statement, that the equation holds for a given $n$.
\textbf{Base case:} The statement $A(1)$ is true. Indeed: $1=\frac{1(1+1)}{2}$.
\textbf{Inductive case:} We need to show that $A(n) \text{ implies }A(n+1)$. Assume $1+2+\ldots+n=\frac{n(n+1)}2$.
Hence, the equation holds for $n+1$. Hence, $A(n)$ holds for all $n$.
There's an alternative proof, which is a lot simpler and (apocryphally) was discovered by Gauss when he was seven years old, and as punishment was tasked by his teacher to sum all natural numbers from 1 to 100:
\begin{proof} By reordering.
We pair the first element of the sum with the last element, the second element with the second to last, etc:
That way, we obtain $n/2$ pairs, each of which equals $n+1$.
Hence $1+2+\ldots+n=\frac{n(n+1)}2$
(Assuming $n$ is even, that is)
\section{Sums and Products}
Using the following definition, we can express the same thing a bit more conveniently:
Given $a_1$,$\ldots$,$a_n$, we write
\[\sum_{i=1}^na_i := a_1+a_2+\ldots+a_n\]
Using this notation, the above theorem becomes:
\[\sum_{i=1}^ni=\frac{n(n+1)}2 \text{ for all }n\geq1\]
Similarly, we define:
Given $a_1$,$\ldots$,$a_n$, we write
\[\prod_{i=1}^na_i:=a_1\cdot a_2\cdot\ldots\cdot a_n\]
The multiplicative variant of the above $\sum_{i=1}^ni$ occurs quite frequently:
The \emph{factorial} $n!$ of a natural number $n$ is defined as
\[n! := \prod_{i=1}^ni\]
\paragraph{} It now makes sense to consider what \emph{empty} products and sums are supposed to mean.
Let $n\in\mathbb N$ and $k<n$. Then clearly for any sequence $a_i$ we have\\
$\sum_{i=1}^n a_i = \left(\sum_{i=1}^k a_i\right) + \left(\sum_{i=k+1}^n a_i\right)$. If we want the same to hold for $k=n$, then the right side of the sum would be an empty sum, which consequently needs to be $0$.
Analogously, for products we get\\
$\prod_{i=1}^na_i = \left(\prod_{i=1}^na_i\right)\cdot\left(\prod_{i=n+1}^na_i\right)$, which means we need the empty product on the right side to be $1$ in order for the equation to hold.
\paragraph{} Consequently, we establish the following conventions:
We define
whenever $i$ ranges over the empty set.
We define
whenever $i$ ranges over the empty set.
...from which it follows, that $0!=1$.
\section{Combinatorics and the Binomial Formula}
The above factorial function has a combinatorial interpretation:
Given $n$ distinct objects, there are exactly $n!$ ways of arranging these objects in order.
There are exactly $3!=6$ ways to arrange the three letters \textsf a,\textsf b,\textsf c; namely
\textsf{abc}, \textsf{acb}, \textsf{bac}, \textsf{bca}, \textsf{cab} and \textsf{cba}.
We have $n$ possibilities to pick the first element.
Removing that element leaves $n-1$ possibilities for the second element, and so on.
Multiplying these possibilities yields $n!$ possibilities in total.
Given $n$ distinct objects and $k<n$, there are exactly $\frac{n!}{(n-k)!}$ ways of picking $k$ objects in some order.
Disregarding order, there are exactly $\frac{n!}{k!\cdot(n-k)!}$ ways of picking $k$ objects.
There are exactly $\frac{4!}{(4-2)!}=12$ ways to pick two out of the four letters \textsf a,\textsf b,\textsf c, \textsf d; namely
\textsf{ab}, \textsf{ac}, \textsf{ad}, \textsf{ba}, \textsf{bc}, \textsf{bd}, \textsf{ca}, \textsf{cb}, \textsf{cd}, \textsf{da}, \textsf{db} and \textsf{dc}.
If we do not care about the order, we can equate e.g. the cases \textsf{ad} and \textsf{da}, leaving $\frac{4!}{2!\cdot(4-2)!}=6$ possibilities.
By \autoref{factorialcombinatorics}, we have $n!$ possibilities to pick $n$ elements in order.
We can think of picking $k$ elements out of $n$ as picking $n$ elements and discarding the last $n-k$ elements.
We can thus simply divide out the number of possibilities for the last $n-k$ elements, yielding $\frac{n!}{(n-k)!}$ possibilities.
If we only care about \emph{which} elements we picked regardless of order, we have to additionally divide out the number of possible arrangements for the picked $k$ elements, which is $k!$.
Hence in that case, we get $\frac{n!}{k!(n-k)!}$ possibilities.
The without-order case of \autoref{binomialcombinatorics.theorem} is common enough to deserve its own definition:
Given natural numbers $k<n$, the \emph{binomial coefficient} $n \choose k$ is defined as:
\[ {n \choose k} := \frac{n!}{k!(n-k)!} \]
Immediately obvious then are the following equalities:
We have
\[{n \choose k}=\frac{n!}{k!(n-k)!}=\prod_{j=0}^{k-1}\frac{n-j}{k-j}={n \choose n-k}\]
\[{n \choose k-1}+{n \choose k} = {n + 1 \choose k}\]
The following definition should be obvious:
Given natural numbers $n$ and $m$, we define:
\[ n^m:=\prod_{i=1}^mn \]
By \autoref{emptyproduct.def}, we obtain $n^0=1$ for all natural numbers $n$.
We can now use binomial coefficients to establish the \emph{binomial theorem}:
$\displaystyle \forall n\in\mathbb N. \forall a,b\in\mathbb Z. (a+b)^n=\sum_{k=0}^n{n \choose k}a^kb^{n-k}$
In the case $n=2$ we get the common formula
\[(a+b)^2 = \sum_{k=0}^2{2 \choose k}a^kb^{2-k} = a^2 + 2ab + b^2 \]
By induction on $n$
\textbf{$n=0$}: trivially, both sides are $0$.
\textbf{Inductive case}: Assume
\[(a+b)^n=\sum_{k=0}^n{n \choose k}a^kb^{n-k}\]
Then \[ (a+b)^{n+1} = (a+b)(a+b)^n = (a+b)\left(\sum_{k=0}^n{n \choose k}a^kb^{n-k}\right) \]
By distributivity \[\left(\sum_{k=0}^n{n \choose k}a^{k+1}b^{n-k}\right) + \left(\sum_{k=0}^n{n \choose k}a^kb^{n-k+1}\right)\]
We switch out the index in the first sum, using $i=k-1$
\[\left(\sum_{i=1}^{n+1}{n \choose i-1}a^ib^{n-i+1}\right) + \left(\sum_{k=0}^n{n \choose k}a^kb^{n-k+1}\right)\]
We rename the first index back to $k$ and split off the highest and lowest summands of the first and second sum respectively:
\[a^{n+1}b^0 + \left(\sum_{k=1}^n{n \choose k-1}a^kb^{n-k+1}\right) + \left(\sum_{k=1}^n{n \choose k}a^kb^{n-k+1}\right) + a^0b^{n+1}\]
We merge the two sums using \autoref{binomialequalities}:
\[ a^{n+1}b^0 + \left( \sum_{k=1}^n{n+1 \choose k}a^kb^{n-k+1} \right) + a^0b^{n+1} \]
...and pull the outer two summands back in:
\[ \sum_{k=0}^{n+1}{n+1 \choose k}a^kb^{(n+1)-k} \]
...proving the claim.
\section{Basic Algebra}
A \emph{monoid} is a structure ($G$,$\circ$,$e$) such that $\circ$ is \emph{associative} and $e$ is a \emph{unit} - i.e.
\[ \forall a,b,c. (a\circ b)\circ c = a\circ(b\circ c) \]
\[ \forall a. a\circ e = e \circ a = e \]
A monoid is called \emph{abelian} (or \emph{commutative}), if for all $a,b\in G$ we have
\[a\circ b = b \circ a\]
Given $g\in G$, we call $h\in G$ a \emph{right-inverse} of $g$ if $g\circ h = e$, and a \emph{left-inverse} of $g$ if $h\circ g = e$. This leads to the following definition:
A \emph{group} is a monoid such that for every $g\in G$ there is a $g^{-1}$ such that
\[g\circ g^{-1} = g^{-1}\circ g = e\]
\ No newline at end of file
namespace smalg
namespace mitmalg
namespace sets
namespace fnd
smalg:?magma?base-set sets:?SetStructures/setstruct_theory?universe direction="both" collection="smglom"
smalg:?magma?operation mitmalg:?Magma/magma_theory?op direction="both" collection="smglom"
smalg:?unital?unit mitmalg:?Unital/unital_theory?unit direction="both" collection="smglom"
smalg:?group?inverse mitmalg:?Loop/loop_theory?inverse direction="both" collection="smglom"
namespace smar
namespace mitmar
namespace fnd
// Natural Numbers
smar:?naturalnumbers?NaturalNumbers fnd:?NatLiterals?nat_lit direction="both" collection="smglom"
smar:?naturalnumbers?natural-number fnd:?NatLiterals?nat_lit direction="both" collection="smglom"
smar:?naturalnumbers?NaturalNumbers direction="both" collection="smglom"
smar:?naturalnumbers?PositiveNaturalNumbers fnd:?NatLiterals?pos_lit direction="both" collection="smglom" direction="both" collection="smglom"
smar:?naturalnumbers?successor fnd:?NatLiterals?succ_nat_lit direction="both" collection="smglom"
smar:?natarith?lethan fnd:?NatLiterals?leq_nat_lit direction="both" collection="smglom"
smar:?natarith?remainder mitmar:?NaturalArithmetics?mod direction="both" collection="smglom"
smar:?natarith?exponentiation mitmar:?NaturalArithmetics?exponentiation direction="both" collection="smglom"
smar:?natarith?morethan mitmar:?NaturalArithmetics?morethan direction="both" collection="smglom"
smar:?natarith?lessthan mitmar:?NaturalArithmetics?lessthan direction="both" collection="smglom"
smar:?natarith?methan mitmar:?NaturalArithmetics?geq direction="both" collection="smglom"
smar:?factorial?factorial mitmar:?NaturalArithmetics?factorial direction="both" collection="smglom"
// smar:?natarith?addition fnd:?NatLiterals?plus_nat_lit // n-ary vs binary
// smar:?natarith?multiplication fnd:?NatLiterals?times_nat_lit // n-ary vs binary
// smar:?natarith?subtraction mitmar:?NaturalArithmetics?subtraction // n-ary vs binary, has additional proof argument
// smar:?natarith?modulus mitmar:?NaturalArithmetics?div // has additional proof argument
// smar:?natarith?root mitmar:?NaturalArithmetics?root // has additional proof argument
// smar:?natarith?square-root mitmar:?NaturalArithmetics?squareRoot // has additional proof argument
// smar:?natarith?mmorethan // n-ary
// smar:?natarith?mlessthan // n-ary
// smar:?natarith?mmethan // n-ary