thesis-projects issues

semantics extractions based on machine learning

2017-04-13T08:05:33Z

We have a couple of corpora from which we want to extract semantical features. Examples are * quantity expressions like "3m/s" (three meters per second) or "two furlongs per fortnight" * polarity of identifiers in formulae (essentially, which symbols in a formula can be substituted for) * where are "definitions/theorems/assumptions" (and what are their definienda, definienses, and statemnets). * or more generally what is the content form of a formula If we know any of those, we could extend nice semantic features (e.g. better screen readers for visually challenged people or better scientific search engines) relatively directly. We have a couple of large corpora e.g. the [arXMLiv corpus](cortex.mathweb.org/corpus/arXMLiv/tex_to_html) or the data behind the [Online Encyclopaedia of Integer Sequences](http://oeis.org) All of them are (probably) amenable to machine-learning methods. In some cases, we already have some data about the phenomena above which can act as a baseline. The topic is to pick one or more of these aspects of semantics and see what contemporary statistical AI methods can do to scale these up to corpus size and develop an symbolic application (possibly with a lot of help from the group).

Theory Graph Minimization

2017-07-31T19:36:29Z

KWARC develops and imports theory graphs for mathematical knowledge. A theory consists of of symbols and declarations (statements that describe the properties and interactions of symbols). For an "object-oriented" development we use special declarations: inclusions and structures that import other theories (symbol/declaration inheritance). Furthermore theories can be connected by special truth-preserving mappings: views. In practice (especially when curating theory graphs) it often happens that the inclusions, structures, and views are not minimal: We call an inclusion A includes B minimal, iff there is no theory C, such that including C in A would give the same result (details more complicated, but rather straightforward). Minimality of other theory graph components is similar, and a theory graph is minimal, iff all its components are. Theoretically, non-minimal theory graphs are not a problem, but minimizing them maximizes the induced knowledge space (more theorems are induced). Furthermore practically, minimal theory graphs are easier to deal with (e.g. inclusions and views form a transitive skeleton, which reduces edge clutter, which can be fatal in large graphs). For bigger theory graphs, minimality is not trivial (and very tedious) for humans to determine, so we would like a tool to support this. This minimality would be implemented in the [MMT API](http://uniformal.github.io) (i.e. in Scala) and would propose changes that make the graph minimal. These could be delivered as a patch to the surface symatx of the theory graph (i.g. [MMT native syntax](http://uniformal.github.io) or [sTeX](https://github.com/KWARC/sTeX), which can directly be applied (e.g. in an interactive text-based tool that can be integrated into an editor) or in the form of a pull request on http://gl.mathhub.info (where the theory graphs live).

Active Course notes

2017-07-31T19:36:30Z

All of Michael's teaching materials are marked up semantically in [sTeX](https://github.com/KWARC/sTeX), which can be transformed into OMDoc/iMMT-based **active documents**, which have embedded semantic services e.g. * **guided tours** (generated mini-courses that explain a particular concept in the document) * **definition lookup** (click on a word or symbol in a formula and get a popup with the definition) * **technical dictionary** find the German words for technical terms in the English slides. * **user modelling**: the system monitors your progress of understanding the material and adapts its services (e.g. the guided tours do not show you things you already know). * **pop-quiz** where you can show (the user model) that you understood things. We have the basic technology to do this, but this needs to be revisited to actually make this into a tool that students can use.

Finding and Presenting Alignments in Math Libraries

2017-07-31T19:36:30Z

We importer formal and informal libraries of mathematics into the [MathHub](http://mathhub.info) system. The formal libraries are mostly theorem prover libraries (Mizar, HOL Light, Isabelle, IMPS, PVS, ... ) and the informal ones are based on annotated LaTeX. We observe that there is quite a lot of overlap between the libraries, and we have developed a notion of "alignments" (see a [recent paper](http://kwarc.info/kohlhase/submit/alignments16.pdf) and have collected an [initial set](https://gl.mathhub.info/alignments/Public). This general topic has various sub-projects that can be tackled individually or together 1. building an alignment navigator/interface, which allows users to get an overview over the overlap and how concepts relate (Akbar Oripov is working on this at Jacobs University). 2. manually finding complex alignments and categorizing them, and building a curation system for that. 3. building software that goes over the libraries and suggests alignments of various categories (to be curated in the system above).

Semantics extraction for assistive tools for scientists with disabilities (Ba...

2017-07-31T19:36:30Z

There is a study that about 9% of the internet users have disabilities that make reading difficult (blindness, dyslexia, ...). For regular web pages, books, etc. there are assistive technologies like screen readers like Jaws (Windows) or voiceover. Recently, these have acquired functionality to deal with MathML (e.g. MathML3 is now part of the DAISY and ARIA standards). But the functionality is only as good as the input ("junk-in->junk-out" principle), so a formula like $X^{1/2}$ reads as "X upper index 1 slash 2" in e.g. JAWS instead of "X to the power of one half" (or even "square root of X"). If we had a system for semantics extraction (see #2 for an example) we could generate better input for screen readers and annotate it into the documents. Semantics extraction + integration into screen readers is a topic that could be in part (Bachelor) or fully (Master) tackled in a thesis. We have excellent contacts into the assistive technologies community that would lead to real-world deployments and thus real-world impact.

Metalogical formalization of Dynamic Logics

2017-07-31T19:36:30Z

Dynamic logics are popular as target systems for semantics construction. There is quite a ariety of them: * DRT (Kamp & Reyle) and variants (including sDRT and lamba-DRT) * DPL (Gronendijk and Stokof) and variants (includeing DMG) * ... it would be nice if the [LATIN](https://mathhub.info/MMT/LATIN) logic atlas had some of these logics. The main problem is that we cannot just use the LF metalogical framework directly, but have to extend it with dynammic primitieves. This makes things a lot harder. @dmueller has already worked a bit on this, so he can supervise.

A Mathematical Infrastructure for slightly advanced Analysis

2017-07-31T19:36:30Z

We need for (e.g. #3) an seed infrastructure of flexiformal theories for (a bit more advanced) analysis. On the one hand this could be done in MMT and on the other hand in [the SMGloM](http://smglom.mathhub.info) (actually both, they need also be synchronized by symbols). In particular * Integrals (Riemann, Lebesque, path, volume, ...) * Manifolds * vector fields and tensors * div/grad representations vs. coordinate representations (can we write down the views?) * ODEs and PDEs

Spotting and Searching Quantity Expressions technical/scientific documents (t...

2018-02-07T14:28:44Z

KWARC group has converted the [Cornell EPrint arXiv](http://arxiv.org) (> 1.1 Million papers in physics, maths, CS, etc) into HTML5. This resource contains millions of quantity expressions like "3m/s" , "13 square mile feet", or "five astronomical units". One step towards a "physics search engine" would be to make those searchable as quantity expressions, i.e. if we could find "3m/s" by searching for "10.8 km/h" or "?x furlongs per fortnight" - independently of the unit used to represent the quantity. We have most of the parts needed for such a search engine; here is what would be needed to build it: * build a spotter for quantity expressions for deployment in [CorTeX](http://cortex.mathweb.org) cc: @dmueller @miancu

implementing Realms as an MMT structural feature

2018-02-07T14:41:37Z

We have this global language feature of realms, see http://kwarc.info/kohlhase/papers/cicm14-realms.pdf and http://kwarc.info/kohlhase/papers/cicm15-induced.pdf and http://kwarc.info/kohlhase/papers/cicm15-recaps.pdf This is a very well-understood practical device that makes theory graphs much closer to mathematical practice. This needs to be implemented in MMT (following and building on Mihnea Iancu's thesis), and a management interface needs to be designed. And of course tested on the examples Mihnea and I have designed.

applications of viewfinding

2018-05-15T06:02:08Z

@dmueller has recently implemented a viewfinder, i.e. a function that finds MMT views (and partial ones as well) in an OMDoc/MMT library or between more than one. This [paper](http://kwarc.info/kohlhase/submit/cicm18-viewfinder.pdf) describes the specifics and lists a bunch of applications in the back. The project is to implement one or more of these applications in [MathHub](https://mathhub.info). The necessary steps are * fully understand viewfinding and the intended application * build an extended example and produce the mathematical resources * specify the context of use and the user interactdions * build a front-end that supports these.

Integrate Kwarc tools with GitLab

2018-05-17T13:20:01Z

GitLab is a major open-source locally-installable GitHub-style git repository manager. The Kwarc uses it as the backend of our MathHub system. See [https://gl.mathhub.info/users/sign_in](https://gl.mathhub.info/users/sign_in). GitLab is mostly monolithic, but like any open-source toll it must offer some interface to customize it. In this project, a student * peruses GitLab for such options * identifies potential for integrating Kwarc knowledge management solutions with it * designs and implements these integrations Examples and starting points: * GitLab uses some general purpose syntax highlighting framework. But currently our .mmt files are rendered as plain text. Find out which framework GitLab uses, write a syntax highlighter for .mmt files in it, and configure GitLab to use it. * GitLab provides a very simple plugin interface for event handling. Write an even handler that scans every new issue for special annotations that tie individual issues to MMT URIs. Then customize MMT-based tools to display links to these issues, e.g., jEdit can display button whenever an issue for a declaration exists.

MMT as a parser framework for Scala APIs

2018-07-02T20:21:23Z

See https://github.com/UniFormal/MMT/issues/368

Data Dimension for MathHub.info

2018-07-25T00:03:43Z

The [MathHub.info](http://mathhub.info) system currently only has active/semantic documents and theory graphs. But we would like to integrate and host mathematical data bases like the [OEIS](http://oeis.org) or [LMFDB](http://lmfdb.org) with mathematical interfaces. The theory and part of the implementation has already been done in [virtual theories](https://github.com/OpenDreamKit/OpenDreamKit/blob/master/WP6/MACIS17-vt/crc.pdf) and is proposed as part of #20. Now we need to build the infrastruture in [MathHub.info](http://mathhub.info). Concretely, this involves * installing and exposing a data base server as (probably Postgres or so) * extending the MMT `lmh` extension to create data bases for any archive with virtual theories (i.e. theories for which the necessary codecs and schema theories exist) * developing a process for syncing the access permissions in the data base with those for the archive at http://gl.mathhub.info * develop a MMT `data` extension with a structure component that creates/updates tables for MMT theories and whose object component uses queries for MMT expressions.

Implement Lenses/States for MathHub

2018-11-24T09:41:32Z

We would like to have an infrastructyure quality control in MathHub. The model that seems adequate is the "lenses" model introduced by the [connexions project; now OpenStax](http://cnx.org). The idea is to allow open submission to MathHub and do the quality control later by allowing anyone to define a named "lens", which endorses certain objects as "interesting and (sufficiently) good" and which can be used by users to direct their attention. The important innovation is that lenses are public and can be created, followed, copied & extended by anyone. Communities will usually have their own "quality approved" lens, and lenses can also be used as an "overlay journal" over MathHub. Some details of the idea are in this [blue note](https://gl.kwarc.info/smglom/blue/tree/master/contmgt/note.pdf), and there is a recent [MathHub issue](https://github.com/MathHubInfo/Frontend/issues/88) that goes more into detail about implementation.

Formalization of legal content

2019-05-02T10:12:22Z

As part of the ALMANAC-project, we are currently developing new MMT-modules to deal with conflict-laden content. These include context graphs (theory graphs with attack relations), formalizations of nonmonotonic logics and argumentation theories (as context graphs or in classical MMT) and computation and visualization components (Embedding of argumentation semantic solvers, TGView graph viewer). The goal of this project is a first application of these developments in an effort to formalize paradigmatic documents (rulings, example cases) in law. Legal content involves a plethora of interesting formalization challenges such as defeasible reasoning, ambibuity, precedence and argumentation. There for it provides an ideal arena to gain a sense of the current capabilities of our systems as well as the open requirements. The project would entail choosing a suitable document (or collection of documents) and flexiformalizing it. The final document will include content of various formalization levels which can be visualized and processed by ALMANAC's graphical and computational tools. The undertaking would be closely supervised jointly by members of the ALMANAC-project and our Legal-Tech cooperation partner [Prof. Dr. Axel Adrian](https://www.str2.rw.fau.de/lehrstuhl/honorarprofessor/).

Identifying Mathematical Objects up to Canonical Isomorphpism

2019-06-04T22:57:39Z

@mkohlhase @dmueller # Motivation Mathematics pervasively identifies objects if they are canonically isomorphic. Examples: * define the rationals as fractions (pairs of integers, quotiented by cancelling), then identify the integer z with the fraction z/1 * the entire number hierarchy N <: Z <: Q etc. is built like above * elements of ring R are identified with constant polynomials over R * elements of ring R are identified with elements in localizations over R * polynomials are identified with the respective elements in the field of fractions * generators of a group are identified with elements of the generated group (which are technically equivalence classes) It is very common that * the isomorphism is from a smaller to a subset of a bigger structure * the bigger structure is built from the smaller one (i.e., we cannot simply *define* the smaller one as a subset of the bigger one because we need it to build the bigger one) * the embedding of the smaller into the bigger structure preserves some operations (e.g., the embedding N -> Z is a semiring morphism, the embedding of R into R[X] is a ring moprhism) and can therefore not simply be represented as just a function * the set preserved properties is extended later on as additional properties are considered (e.g., the embedding N -> Z is also an order morphism) Virtually all tools for formalized mathematics cannot handle this at all, let alone elegantly. It requires fundamentally different formal systems that we have not designed yet. # Idea For a certain special case, theory morphisms may be a solution: * We can define the structures as theories and the embedding as a theory morphism, e.g., * N={n: type, z: n, s: n->n} * Z={i: type, z:i, s:i->i, p:i->i, s(p(x)=x, p(s(x)=x, leq:i->i->prop} * emb: N->Z = {n={x:i|z leq x}, z=z, s=s} In particular, the theory morphism would capture which operations are preserved. In this topic, you build a case study formalizing canonical isomorphisms in this way. You will identify and possibly solve any theoretical or practical problems that come up along the way. # Technical Problems ## Weak Embeddings Often the embedding function is not a theory morphism because it does not preserve all properties of a structure s. Such embeddings can only be expressed as theory morphism out of S if S is an inadequately weak formalization of s. Often the theory S amounts to specifying a category in which s is the initial object. More generally, we can think of s as the structure that is freely generated by the syntax of S. An important special case are non-injective embeddings. For example, the embedding of the natural numbers into the theory of rings is not necessarily injective. This can only be expressed as a theory morphism if the natural numbers are formulated without the injectivity of successor. Similarly, we may have to allow for non-surjective embeddings if we cannot capture the image of the embedding as a predicate subtype (e.g., when embedding natural numbers into real numbers, where the logic cannot express the needed predicate). In the case of natural numbers, this means the induction axiom should be optional as well. ## Models as Theories There are multiple ways to represent mathematical structures (here called models) in a formal system. *Concrete* models of theory T are given as tuples of their universes and operations, e.g., (N,0,1+,*) for the semiring of natural numbers. (Technically, this tuple also includes proofs of all axioms.) Concrete models can be represented in two different way: * internal models: models are represented as record values of a record type corresponding to T * external models: models are represented as theory morphisms out of T *Abstract* models of theory T are represented as theories. These theories contain undefined constants and axioms. Examples are the models for N and Z above. Typically, these theories M extend T or admit very simpl morphisms, usually injective renamings, T -> M. The critical idea of this case study is to use abstract models. This is the only way that allow using theory morphisms for the embeddings. It remains open how this relates to other, concrete, representations of models.

Representing Relational Databases in a Logical Framework

2019-10-08T11:08:32Z

@kohlhase @twiesing There is a very elegant embedding of relational databases into MMT. * We write a theory D that declares all the basic datatypes (integers, string, products, etc.) of the database language. * A table schema is represented as a theory S with meta-theory D. The constants of S correspond to the columns of the table. * Local consistency conditions (which are often omitted in relational databases) are represented as axioms in S. * A table entry in S is represented as a morphism S -> D, i.e., entries are records whose fields are given by S. * A join is represented as a pushout. For example, the join of S with S' where the field f of S should be equal to the field f' of S', is the pushout of {f} -include-> S and {f} -{f=f'}-> S'. The entries of the join are the set of universal morphisms out of the pushout for every pair (e,e') of entries for which the respective diagram commutes. * A database view providing a table S backed by a table T are MMT morphisms m is represented as a morphism S -> T. As a special case, include morphisms represent the selection of some columns from a table. The entries of the view are the morphisms m;e for all entries e of T. * A writable view providing S backed by T allows modifying an entry of T if the change can be propagated back to the corresponding entry of T. If T is a declared theory, this is possible if m is a renaming/inclusion. If m is a view or join, this is more complicated. The goal of this topic is to * work out the details of the above * investigate the propagation of changes along writable views in general * determine how these representations can be used to provide added value to database applications In particular, we could represent the schema of a database in MMT (as sets of theories and views). The entries would be stored in an actual database. The theory D can use high-level datatypes, which MMT's codecs translate to the concrete types of the database. Queries (like joins, selections, and views) can be formulated in MMT and translated to relational queries; conversely, entries can be translated back to MMT views.

Intellij support for MMT-jupyter-notebooks

2020-01-03T09:11:22Z

See https://github.com/UniFormal/IntelliJ-MMT/issues/28

More MathWebSearch Instances

2020-03-20T07:45:59Z

There are numerous mathematical resources that need [formula search engines](https://search.mathweb.org). Given Christoph Alt's thesis and MWS front-end this is mostly a matter of writing a good harvester. Some examples that come to mind * The [Stacks Project](https://stacks.math.columbia.edu/) 7000 pages of algebra in LaTeX; see https://github.com/MathWebSearch/frontend/issues/18 * the [OEIS](https://oeis.org), We had a search engine for the formulae at one point, we just need to resurrect and modernize it (see #27) * ... And the best would be to create a joint index where MWS searches all of these together.

Lexicon Management in GLIF

2020-04-20T14:32:19Z

[**GLIF**](https://github.com/kwarc/GLIF) is a framework for describing the translation of natural language into logical expressions. This requires the specification of a grammar, a target logic, a domain theory in that logic, and a semantics construction (mapping of parse trees into the domain theory). If you participated in the LBS lecture, you should be familiar with the setup. Adding a new word (like "woman") to a GLIF pipeline requires the following additions: * abstract syntax: `woman_N : N;` * concrete syntax: `woman_N = mkN "woman" "women";` (if e.g. German is also supported: `woman_N = mkN "Frau" feminine;`) * domain theory: `woman : i -> o` * semantics construction: `woman_N = woman` There clearly is a lot of repetition here. Your task would be to improve this by designing a lexicon format from which the necessary files can be generated automatically. A naive attempt to write a lexicon entry could look like this: ``` woman noun eng: "woman" "women" ger: "Frau" feminine ``` We probably want customization in different places. For example, the semantics construction for the name John might be `john_PN = john` or `john_PN = [P] P john` depending on the context. Also, not every project uses the resource grammar library, so the operations used in the concrete syntax might vary. The lexicon management should also be supported in GLIF's Jupyter front-end. It would also be interesting to take existing lexica and generate a generic lexicon from that, which can be imported into any projects (with the required customization to make it fit in).