README.MD 6.73 KB
Newer Older
Dennis Müller's avatar
init  
Dennis Müller committed
1
2
3
4
5
6
# FiFoM Repository

This repository contains all code associated with my *DAAD (German Academic Exchange Service) Postdoc Scholarship 12/2019-06/2020*; the proposal for which can be found [here](https://kwarc.info/people/dmueller/pubs/DAAD_proposal.pdf). This project attempts to use Machine Learning (based on a [GPT2](https://github.com/openai/gpt-2) Transformer-model trained from scratch on a significant subset of [arxiv.org](https://arxiv.org)-submissions) to augment generic LaTeX with [sTeX](https://github.com/slatex/sTeX)-annotations to disambiguate mathematical expressions.

## Structure

Dennis Müller's avatar
readme    
Dennis Müller committed
7
The majority of the code contained is written in Scala (requiring java >8) and managed by [build.sbt](https://gl.kwarc.info/dmueller/fifom/-/blob/master/build.sbt). The Machine Learning parts are written in Python 3 using the [transformers](https://github.com/huggingface/transformers) and [tokenizers](https://github.com/huggingface/tokenizers) libraries based on [PyTorch](https://pytorch.org/).
Dennis Müller's avatar
init  
Dennis Müller committed
8

Dennis Müller's avatar
readme    
Dennis Müller committed
9
10
11
12
13
14
15
- The [deploy](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy)-folder contains an assembled [jar-file](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/ML.jar) (bundling a working MMT instance), and an [example shell-file](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/example.sh) that attempts to translate the file [example.tex](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/example.tex) to sTeX.
- The [deploy/python](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/python)-folder contains the relevant python-code and pretrained models used. In particular, [python/gpt2](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/python/gpt2) contains the GPT2-model pretrained on an arxiv dump; [python/l2s](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/python/l2s) contains the model fine-tuned for LaTeX->sTeX translation.
- The [deploy/mathhub](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/mathhub)-folder contains the subset of [MathHub](https://mathhub.info) - consisting of [MMT](https://uniformal.github.io/)-archives - necessary for various evaluations during running (see below).
- The remaining folders contain various sbt modules for e.g. parsing LaTeX/sTeX, expanding sTeX-macros etc.:
  - [latex](https://gl.kwarc.info/dmueller/fifom/-/blob/master/latex) contains (mostly) code related to parsing, normalizing and modifying latex syntax.
  - [stex](https://gl.kwarc.info/dmueller/fifom/-/blob/master/stex) contains code specifically related to sTeX, such as parser extensions for `\symdef` statements and variants, verbalizing (in [Verbalizer.scala](https://gl.kwarc.info/dmueller/fifom/-/blob/master/stex/src/main/scala/com/jazzpirate/stex/Verbalizer.scala)) and checking (in [StexChecker.scala](https://gl.kwarc.info/dmueller/fifom/-/blob/master/stex/src/main/scala/com/jazzpirate/stex/StexChecker.scala)).
  - [mmt](https://gl.kwarc.info/dmueller/fifom/-/blob/master/mmt) contains code relates to interactions with the MMT-system. [MitM.scala](https://gl.kwarc.info/dmueller/fifom/-/blob/master/mmt/src/main/scala/com/jazzpirate/mmt/MitM.scala) contains non-trivial alignment translation too complex for simple `.align`-files and generic code for translating between MitM and SMGLoM, [Generator.scala](https://gl.kwarc.info/dmueller/fifom/-/blob/master/mmt/src/main/scala/com/jazzpirate/mmt/Generator.scala) contains code related to random term generation.
Dennis Müller's avatar
init  
Dennis Müller committed
16
17
18
19


## Running

Dennis Müller's avatar
readme    
Dennis Müller committed
20
The [example shell-file](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/example.sh) shows how to run the translator on a given LaTeX file. Notably, the file does not need to have a full header, so the translator can be run on fragments as well. The [ML.jar](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/ML.jar) takes the following command line arguments:
Dennis Müller's avatar
init  
Dennis Müller committed
21
- `-tex <file>`: The (absolute or relative) path to the tex-file to be translated.
Dennis Müller's avatar
readme    
Dennis Müller committed
22
- `-mathhub <folder>`: An (optional) path to a folder containing MMT-archives, which is needed for the following options. If present, all translations will be heuristically checked for whether they are valid sTeX-expressions (on the basis of the [smglom](https://kwarc.info/systems/smglom/)-archives; contained in the [deploy/mathhub](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/mathhub)-folder) and logged to the console. Green fragments are considered sTeX, yellow fragments are assumed to be variables, red fragments are considered invalid (but may still be valid LaTeX). 
Dennis Müller's avatar
init  
Dennis Müller committed
23
- `-eval_latex`: If present, all translations will have their containing sTeX-macros expanded, the result is heuristically normalized and compared to the original input.
Dennis Müller's avatar
readme    
Dennis Müller committed
24
25
- `-alignments <folder>`: An optional path to a repository containing [Alignments](https://kwarc.info/people/dmueller/pubs/thesis.pdf) between sTeX/smglom-macros and the strongly typed [Math-in-the-Middle](https://kwarc.info/people/dmueller/pubs/thesis.pdf) MMT-archive (also contained in the [deploy/mathhub](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/mathhub)-folder), used for the following option.
- `-stex <file>`: An (optional) path to a text file containing the correct features (one per line) for the input tex-file. If present, the translations will be evaluated in comparison to these.
Dennis Müller's avatar
init  
Dennis Müller committed
26
27
28
29
30
- `-typecheck`: If present, all translations will be:
  - Converted to [OMDoc](https://kwarc.info/systems/omdoc/) and imported to MMT. This requires a working installation of [LaTeXML](https://dlmf.nist.gov/LaTeXML/) and the [LaTeXML-sTeX-Plugin](https://github.com/slatex/LaTeXML-Plugin-sTeX) to work.
  - Translated - using the available alignments - to a typed Math-in-the-Middle expression, and
  - Type checked by having its Math-in-the-Middle type inferred. A success guarantees that the translation is well-formed and well-typed.

Dennis Müller's avatar
readme    
Dennis Müller committed
31
32
33
## Additional Information

The Scala and Python parts communicate via TCP on port 65432; with the [Python file](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/python/la2s.py) acting as server and the [ML.jar](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/ML.jar) acting as client. If `ML.jar` can not connect to a server on port 65432, it will attempt to run `python3 la2s.py` in `./python`, effectively running a server (which will be closed automatically when the program terminates).
Dennis Müller's avatar
init  
Dennis Müller committed
34
35

A *known bug* occasionally causes the program to freeze - it is unclear wheher that happens in the python code or the TCP-connection. `la2s.py` logs to the file `l2s.log`, but without errors. Curiously, the error disappears if server and client are started separately (making it practically impossible to debug this error). Consequently, if this happens, I recommend first running `python3 la2s.py` manually, waiting until it outputs `Server running` and then running `ML.jar`, which will connect to the already running python server.