Commit e09b3c14 authored by Dennis Müller's avatar Dennis Müller
Browse files

readme

parent 9a44e756
......@@ -4,26 +4,32 @@ This repository contains all code associated with my *DAAD (German Academic Exch
## Structure
The majority of the code contained is written in Scala (requiring java >8) and managed by [build.sbt](https://github.com/Jazzpirate/DAAD-FiFoM/blob/master/build.sbt). The Machine Learning parts are written in Python 3 using the [transformers](https://github.com/huggingface/transformers) and [tokenizers](https://github.com/huggingface/tokenizers) libraries based on [PyTorch](https://pytorch.org/).
The majority of the code contained is written in Scala (requiring java >8) and managed by [build.sbt](https://gl.kwarc.info/dmueller/fifom/-/blob/master/build.sbt). The Machine Learning parts are written in Python 3 using the [transformers](https://github.com/huggingface/transformers) and [tokenizers](https://github.com/huggingface/tokenizers) libraries based on [PyTorch](https://pytorch.org/).
- The [deploy](https://github.com/Jazzpirate/DAAD-FiFoM/blob/master/deploy)-folder contains an assembled [jar-file](https://github.com/Jazzpirate/DAAD-FiFoM/blob/master/deploy/ML.jar) (bundling a working MMT instance), and an [example shell-file](https://github.com/Jazzpirate/DAAD-FiFoM/blob/master/deploy/example.sh) that attempts to translate the file [example.tex](https://github.com/Jazzpirate/DAAD-FiFoM/blob/master/deploy/example.tex) to sTeX.
- The [deploy/python](https://github.com/Jazzpirate/DAAD-FiFoM/blob/master/deploy/python)-folder contains the relevant python-code and pretrained models used. In particular, [python/gpt2](https://github.com/Jazzpirate/DAAD-FiFoM/blob/master/deploy/python/gpt2) contains the GPT2-model pretrained on an arxiv dump; [python/l2s](https://github.com/Jazzpirate/DAAD-FiFoM/blob/master/deploy/python/l2s) contains the model fine-tuned for LaTeX->sTeX translation.
- The [deploy/mathhub](https://github.com/Jazzpirate/DAAD-FiFoM/blob/master/deploy/mathhub)-folder contains the subset of [MathHub](https://mathhub.info) - consisting of [MMT](https://uniformal.github.io/)-archives - necessary for various evaluations during running (see below).
- The remaining folders contain various sbt modules for e.g. parsing LaTeX/sTeX, expanding sTeX-macros etc.
- The [deploy](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy)-folder contains an assembled [jar-file](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/ML.jar) (bundling a working MMT instance), and an [example shell-file](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/example.sh) that attempts to translate the file [example.tex](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/example.tex) to sTeX.
- The [deploy/python](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/python)-folder contains the relevant python-code and pretrained models used. In particular, [python/gpt2](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/python/gpt2) contains the GPT2-model pretrained on an arxiv dump; [python/l2s](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/python/l2s) contains the model fine-tuned for LaTeX->sTeX translation.
- The [deploy/mathhub](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/mathhub)-folder contains the subset of [MathHub](https://mathhub.info) - consisting of [MMT](https://uniformal.github.io/)-archives - necessary for various evaluations during running (see below).
- The remaining folders contain various sbt modules for e.g. parsing LaTeX/sTeX, expanding sTeX-macros etc.:
- [latex](https://gl.kwarc.info/dmueller/fifom/-/blob/master/latex) contains (mostly) code related to parsing, normalizing and modifying latex syntax.
- [stex](https://gl.kwarc.info/dmueller/fifom/-/blob/master/stex) contains code specifically related to sTeX, such as parser extensions for `\symdef` statements and variants, verbalizing (in [Verbalizer.scala](https://gl.kwarc.info/dmueller/fifom/-/blob/master/stex/src/main/scala/com/jazzpirate/stex/Verbalizer.scala)) and checking (in [StexChecker.scala](https://gl.kwarc.info/dmueller/fifom/-/blob/master/stex/src/main/scala/com/jazzpirate/stex/StexChecker.scala)).
- [mmt](https://gl.kwarc.info/dmueller/fifom/-/blob/master/mmt) contains code relates to interactions with the MMT-system. [MitM.scala](https://gl.kwarc.info/dmueller/fifom/-/blob/master/mmt/src/main/scala/com/jazzpirate/mmt/MitM.scala) contains non-trivial alignment translation too complex for simple `.align`-files and generic code for translating between MitM and SMGLoM, [Generator.scala](https://gl.kwarc.info/dmueller/fifom/-/blob/master/mmt/src/main/scala/com/jazzpirate/mmt/Generator.scala) contains code related to random term generation.
## Running
The [example shell-file](https://github.com/Jazzpirate/DAAD-FiFoM/blob/master/deploy/example.sh) shows how to run the translator on a given LaTeX file. Notably, the file does not need to have a full header, so the translator can be run on fragments as well. The [ML.jar](https://github.com/Jazzpirate/DAAD-FiFoM/blob/master/deploy/ML.jar) takes the following command line arguments:
The [example shell-file](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/example.sh) shows how to run the translator on a given LaTeX file. Notably, the file does not need to have a full header, so the translator can be run on fragments as well. The [ML.jar](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/ML.jar) takes the following command line arguments:
- `-tex <file>`: The (absolute or relative) path to the tex-file to be translated.
- `-mathhub <folder>`: An (optional) path to a folder containing MMT-archives, which is needed for the following options. If present, all translations will be heuristically checked for whether they are valid sTeX-expressions (on the basis of the [smglom](https://kwarc.info/systems/smglom/)-archives; contained in the [deploy/mathhub](https://github.com/Jazzpirate/DAAD-FiFoM/blob/master/deploy/mathhub)-folder) and logged to the console. Green fragments are considered sTeX, yellow fragments are assumed to be variables, red fragments are considered invalid (but may still be valid LaTeX).
- `-mathhub <folder>`: An (optional) path to a folder containing MMT-archives, which is needed for the following options. If present, all translations will be heuristically checked for whether they are valid sTeX-expressions (on the basis of the [smglom](https://kwarc.info/systems/smglom/)-archives; contained in the [deploy/mathhub](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/mathhub)-folder) and logged to the console. Green fragments are considered sTeX, yellow fragments are assumed to be variables, red fragments are considered invalid (but may still be valid LaTeX).
- `-eval_latex`: If present, all translations will have their containing sTeX-macros expanded, the result is heuristically normalized and compared to the original input.
- `-alignments <folder>`: An optional path to a repository containing [Alignments](https://kwarc.info/people/dmueller/pubs/thesis.pdf) between sTeX/smglom-macros and the strongly typed [Math-in-the-Middle](https://kwarc.info/people/dmueller/pubs/thesis.pdf) MMT-archive (also contained in the [deploy/mathhub](https://github.com/Jazzpirate/DAAD-FiFoM/blob/master/deploy/mathhub)-folder), used for the following option.
- `-alignments <folder>`: An optional path to a repository containing [Alignments](https://kwarc.info/people/dmueller/pubs/thesis.pdf) between sTeX/smglom-macros and the strongly typed [Math-in-the-Middle](https://kwarc.info/people/dmueller/pubs/thesis.pdf) MMT-archive (also contained in the [deploy/mathhub](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/mathhub)-folder), used for the following option.
- `-stex <file>`: An (optional) path to a text file containing the correct features (one per line) for the input tex-file. If present, the translations will be evaluated in comparison to these.
- `-typecheck`: If present, all translations will be:
- Converted to [OMDoc](https://kwarc.info/systems/omdoc/) and imported to MMT. This requires a working installation of [LaTeXML](https://dlmf.nist.gov/LaTeXML/) and the [LaTeXML-sTeX-Plugin](https://github.com/slatex/LaTeXML-Plugin-sTeX) to work.
- Translated - using the available alignments - to a typed Math-in-the-Middle expression, and
- Type checked by having its Math-in-the-Middle type inferred. A success guarantees that the translation is well-formed and well-typed.
The Scala and Python parts communicate via TCP on port 65432; with the [Python file](https://github.com/Jazzpirate/DAAD-FiFoM/blob/master/deploy/python/la2s.py) acting as server and the [ML.jar](https://github.com/Jazzpirate/DAAD-FiFoM/blob/master/deploy/ML.jar) acting as client. If `ML.jar` can not connect to a server on port 65432, it will attempt to run `python3 la2s.py` in `./python`, effectively running a server (which will be closed automatically when the program terminates).
## Additional Information
The Scala and Python parts communicate via TCP on port 65432; with the [Python file](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/python/la2s.py) acting as server and the [ML.jar](https://gl.kwarc.info/dmueller/fifom/-/blob/master/deploy/ML.jar) acting as client. If `ML.jar` can not connect to a server on port 65432, it will attempt to run `python3 la2s.py` in `./python`, effectively running a server (which will be closed automatically when the program terminates).
A *known bug* occasionally causes the program to freeze - it is unclear wheher that happens in the python code or the TCP-connection. `la2s.py` logs to the file `l2s.log`, but without errors. Curiously, the error disappears if server and client are started separately (making it practically impossible to debug this error). Consequently, if this happens, I recommend first running `python3 la2s.py` manually, waiting until it outputs `Server running` and then running `ML.jar`, which will connect to the already running python server.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment