\caption{Theory Classification for beautiful sets}\label{fig:theory-classification-ex}

\end{figure}

\paragraph{Approach and Contribution}

We have developed the MMT language \cite{RK:mmt} and the concrete syntax of the OMDoc XML format \cite{omdoc} as a uniform representation language for mathematical knowledge.

Moreover, we have exported multiple proof assistant libraries into this format, including the ones of PVS in \cite{KMOR:pvs:17} and HOL Light in \cite{RK:hollight:15}.

This enables us, for the first time, to apply generic methods --- i.e., methods that work at the MMT level --- to search for theory morphisms in these libraries.

\paragraph{Related Work}

Existing systems have so far only worked with explicitly given theory morphisms, e.g., in IMPS \cite{imps} or Isabelle \cite{isabelle}.

Automatically and systematically searching for new theory morphisms was first undertaken in \cite{immanuel's paper at MKM 2006} in 2006.

However, at that time no large corpora of formalized mathematics were available in standardized formats that would have allowed easily testing the ideas in large corpora.

This situation has changed since then as multiple such exports have become available.

In particular, we have developed the MMT language \cite{RK:mmt} and the concrete syntax of the OMDoc XML format \cite{omdoc} as a uniform representation language for such corpora.

And we have translated multiple proof assistant libraries into this format, including the ones of PVS in \cite{KMOR:pvs:17} and HOL Light in \cite{RK:hollight:15}.

Building on these developments, we are now able, for the first time, to apply generic methods --- i.e., methods that work at the MMT level --- to search for theory morphisms in these libraries.

While inspired by the ideas of \cite{immanuel's paper at MKM 2006}, our design and implementation are completely novel.

In particular, the theory makes use of the rigorous language-independent definitions of \emph{theory} and \emph{theory morphism} provided by MMT, and the practical implementation makes use of the MMT system, which provides high-level APIs for these concepts.

\cite{cezary+thibault paper} applies techniques related to ours to a related problem.

Instead, of theory morphisms inside a single corpus, they use machine learning to find similar constants in two different corpora.

Their results can roughly be seen as a single partial morphism from one corpus to the other.

\paragraph{Approach and Contribution}

Our contribution is twofold.

Firstly, we present such a generic theory morphism finder. The algorithm tries to match two symbols by unifying their types. This is made efficient by separating the term into a hashed representation of its abstract syntax tree (which serves as a fast plausibility check for pre-selecting matching candidates) and the list of symbol occurences in the term, into which the algorithm recurses.

Firstly, we present a the design and implementation of a generic theory morphism finder that works with arbitrary corpora represented in MMT.

The algorithm tries to match two symbols by unifying their types.

This is made efficient by separating the term into a hashed representation of its abstract syntax tree (which serves as a fast plausibility check for pre-selecting matching candidates) and the list of symbol occurrences in the term, into which the algorithm recurses.

Secondly, we apply this view finder in two concrete case studies. \ednote{add 1-2 sentences for each case study}

\paragraph{Related Work}

Existing systems have so far only worked with explicitly given theory morphisms, e.g., in IMPS \cite{imps} or Isabelle \cite{isabelle}.

Automatically and systematically searching for new theory morphisms, let alone doing so generically, is entirely novel as far as we know.

\ednote{FR: I really don't know any related work. Is there anything?}

\paragraph{Overview}

In Section~\ref{sec:prelim}, we revise the basics of MMT and the representations of (exemplary) the PVS and HOL Light libraries.

Intuitively, \mmt is a declarative language for theories and theory morphisms over an arbitrary object language.

Its treatment of object languages is abstract enough to subsume most practically relevant logics and type theories.

Fig.~\ref{fig:mmt} gives an overview of the fundamental MMT concepts.

In the simplest case, \textbf{theories}$\Sigma$ are lists of \textbf{constant declarations}$c:E$, where $E$ is an expression that may use the previously declared constants.

Naturally, $E$ must be subject to some type system (which MMT is also parametric in), but the details of this are not critical for our purposes here.

We say that $\Sigma'$ includes $\Sigma$ if it contains every constant declaration of $\Sigma$.

Correspondingly, a \textbf{theory morphism}$\sigma:\Sigma\to\Sigma'$ is a list of \textbf{assignments}$c\mapsto e'$ of $\Sigma'$-expressions $e'$ to $\Sigma$-constants $c$.

To be well-typed, $\sigma$ must preserve typing, i.e., we must have $\vdash_{\Sigma'}e':\ov{\sigma}(E)$.

Here $\ov{\sigma}$ is the homomorphic extension of $\sigma$, i.e., the map of $\Sigma$-expressions to $\Sigma'$-expressions that substitutes every occurrence of a $\Sigma'$-constant with the $\Sigma'$-expression assigned by $\sigma$.

We call $\sigma$\textbf{simple} if the expressions $e'$ are always $\Sigma'$-\emph{constants} rather than complex expressions.

The type-preservation condition for an assignment $c\mapsto c'$ reduces to $\ov{\sigma}(E)=E'$ where $E$ and $E'$ are the types of $c$ and $c'$.

We call $\sigma$\textbf{partial} if it does not contain an assignment for every $\Sigma$-constant.

A partial morphism from $\Sigma$ to $\Sigma'$ is the same as a total morphism from some theory included by $\Sigma$ to $\Sigma'$.

Importantly, we can then show generally at the MMT-level that if $\sigma$ is well-typed, then $\ov{\sigma}$ preserves all $\Sigma$-judgments.

In particular, if we represent proofs as typed terms, theory morphisms preserve the theoremhood of propositions.

This property makes theory morphism so valuable for structuring, refactoring, and integrating large corpora.

MMT achieves language-independence through the use of \textbf{meta-theories}: every MMT-theory may designate a previously defined theory as its meta-theory.

For example, when represent the HOL Light library in MMT, we first write a theory $L$ for HOL Light.

Then each theory in the HOL Light library is represented as a theory with $L$ as its meta-theory.

In fact, we usually go one step further: $L$ itself is a theory, whose meta-theory is a logical framework such as LF.

That allows $L$ to concisely define the syntax and inference system of HOL Light.

However, for our purposes, it suffices to say that the meta-theory is some fixed theory relative to which all concepts are defined.

Thus, we assume that $\Sigma$ and $\Sigma'$ have the same meta-theory $M$, and that $\ov{\sigma}$ maps all $M$-constants to themselves.

\begin{figure}[htb]

\begin{center}

\begin{tabular}{|l|l|l|}

\hline

\multicolumn{3}{|c|}{meta-theory: a fixed theory $M$}\\

%\hline

\hline

& Theory $\Sigma$& Morphism $\sigma:\Sigma\to\Sigma'$\\

\hline

set of & typed constant declarations $c:E$& assignments $c\mapsto E'$\\

$\Sigma$-expressions $E$& formed from $M$- and $\Sigma$-constants & mapped to $\Sigma'$ by homomorphic extension \\

\hline

\end{tabular}

\end{center}

\caption{Overview of MMT Concepts}\label{fig:mmt}

\end{figure}

It remains to define the exact syntax of expressions:

\begin{ctabular}{l@{\quad}c@{\quad}l}

$\Gamma$&$::=$&$(x:E)^\ast$\\

$E$&$::=$&$c \mid x \mid\ombind{E}{\Gamma}{E^+}$\\

\end{ctabular}

Here $c$ refers to constants (of the meta-theory or previously declared in the current theory) and $x$ refers to bound variables.

Complex expressions are of the form $\ombind{o}{x_1:t_1,\ldots,x_m:t_m}{a_1,\ldots,a_n}$, where

\begin{compactitem}

\item$o$ is the operator that forms the complex expression,

\item$x_i:t_i$ declares variable of type $t_i$ that are bound by $o$ in subsequent variable declarations and in the arguments,

\item$a_i$ is an argument of $o$

\end{compactitem}

The bound variable context may be empty, and we write $\oma{o}{\vec{a}}$ instead of $\ombind{o}{\cdot}{\vec{a}}$.

For example, \ednote{give examples}

Finally, we remark on a few additional features of the MMT language that are important for large-scale case studies but not critical to understand the basic intuitions of results.

MMT provides a module systems that allows theories to instantiate and import each other. The module system is conservative: every theory can be elaborated into one that only declares constants.

MMT constants may carry an optional definiens, in which case we write $c:E=e$.

Defined constants can be eliminated by definition expansion.

%universe of kinds & $\cons{\kind}{\cdot}{\cdot}$ & $\kind$ \\

%dependent function types & $\Pi(x:A;t)$ & $\Pi x:A.t$ \\

%\hline

%\end{tabular}

%\end{center}

%\end{example

\begin{oldpart}{FR: replaced with the above}

For the purposes of this paper, we will work with the (only slightly simplified) grammar given in Figure \ref{fig:mmtgrammar}.

\begin{figure}[ht]\centering\vspace*{-1em}

...

...

@@ -42,6 +132,8 @@ We can eliminate all includes in a theory $T$ by simply copying over the constan

An assignment in a view $V:T_1\to T_2$ is syntactically well-formed if for any assignment $C=t$ contained, $C$ is a constant declared in the flattened domain $T_1$ and $t$ is a syntactically well-formed term in the codomain $T_2$. We call a view \emph{total} if all \emph{undefined} constants in the domain have a corresponding assignment and \emph{partial} otherwise.

\end{oldpart}

\subsection{Proof Assistant Libraries in MMT}\label{sec:oaf}

As part of the OAF project \ednote{cite}, we have imported several proof assistant libraries into the MMT system. To motivate some of the design choices made in this paper, we will outline the general procedure behind these imports.

Consider a corpus $C$ of theories with the same fixed meta-theory $M$.

Our goal is to find theory morphisms between them.

The cost of this problem quickly explodes.

First of all, it is advisable to restrict attention to simple morphisms.

Eventually we want to search for arbitrary morphisms as well.

But that problem is massively harder because it subsumes theorem proving: a morphism from $\Sigma$ to $\Sigma'$ maps $\Sigma$-axioms to $\Sigma'$-proofs, i.e., searching for a theory morphism requires searching for proofs.

Secondly, if $C$ has $n$ theories, we have $n^2$ pairs of theories between which to search.

(It is exactly $n^2$ because the direction matters, and even morphisms from a theory to itself are interesting.)

Moreover, for two theories with $m$ and $n$ constants, there are $n^m$ possible simple morphisms.

(It is exactly $n^m$ because morphisms may map different constants to the same one.)

Thus, we can in principle enumerate and check all possible simple morphisms in $C$.

But for large $C$, it quickly becomes important to do so in an efficient way that eliminates ill-typed or uninteresting morphisms early on.

Thirdly, it is desirable to search for \emph{partial} theory morphisms as well.

In fact, identifying refactoring potential is only possible if we find partial morphisms: then we can refactor the involved theories in a way that yields a total morphism.

Moreover, many proof assistant libraries do not follow the little theories paradigm or do not employ any theory-like structuring mechanism at all.

These can only be represented as a single huge theory, in which case we have to search for partial morphisms from this theory to itself.

While partial morphism can be reduced to and then checked like total ones, searching for partial morphisms makes the number of possible morphisms that must be checked much larger.

Finally, even for a simple theory morphism, checking reduces to a set of equality constraints, namely the constraints $\vdash_{\Sigma'}\ov{\sigma}(E)=E'$ for the type-preservation condition.

Depending on $M$, this equality judgment may be undecidable and require theorem proving.

A central motivation for our algorithm is that equality in $M$ can be soundly approximated very efficiently by using a normalization function on $M$-expressions.

This has the additional benefit that relatively little meta-theory-specific knowledge is needed, and all such knowledge is encapsulated in a single well-understood function.

Thus, we can implement theory morphism--search generically for arbitrary $M$.

Our algorithm consists of two steps.

First, we preprocess all constant declarations in $C$ with the goal of moving as much intelligence as possible into a step whose cost is linear in the size of $C$.

Then, we perform the morphism search on the optimized data structures produced by the first step.

\subsection{Preprocessing}

The preprocessing phase computes for every constant declaration $c:E$ a normal form $E'$ and then the long abstract syntax tree for $E'$.

Both steps are described below.

\paragraph{Normalization}

Normalization involves two steps:

\begin{compactenum}

\item MMT-level normalization performs generic transformations that do not depend on the meta-theory $M$.

These are in particular elaboration of structured theories and definition expansion, which we mentioned in Sect.~\ref{sec:prelim}.

\item Meta-theory-level normalization applies an $M$-specific normalization function, which we assume as a black box for now and discuss further in Sect.~\ref{sec:}.

\end{compactenum}

\paragraph{Abstract Syntax Trees}

We define \textbf{abstract syntax trees} as pairs $(t,s)$ where $t$ is subject to the grammar

(where $Nat$ is a non-terminal for natural numbers) and $s$ is a list of constant names.

We obtain an abstract syntax tree from an MMT expression $E$ by (i) switching to de-Bruijn representation of bound variables and (ii) replacing all occurrences of constants with $Con(i)$ in such a way that every $Con(i)$ refers to the $i$-th element of $s$.

Abstract syntax trees have the nice property that they commute with the application of simple morphisms $\sigma$:

If $(t,s)$ represents $E$, then $\sigma(E)$ is represented by $(t,s')$ where $s'$ arises from $s$ by replacing every constant with its $\sigma$-assignment.

The above does not completely specify $i$ and $s$ yet, and there are several possible canonical choices among the abstract syntax trees representing the same expression.

The trade-off is subtle because we want to make it easy to both identify and check theory morphisms later on.

We call $(t,s)$ the \textbf{long} abstract syntax tree for $E$ if $Con(i)$ replaces the $i$-th occurrence of a constant in $E$ when $E$ is read in left-to-right order.

In particular, the long tree does not merge duplicate occurrences of the same constant into the same number.

The \textbf{short} abstract syntax tree for $E$ arises from the long one by removing all duplicates from $s$ and replacing the $Con(i)$ accordingly.

\ednote{continue a running example somewhere here}

In our algorithm, we pick the \emph{long} abstract syntax tree, which may appear surprising.

The reason is that shortness is not preserved when applying a simple morphism: whenever a morphism maps two different constants to the same constant, the resulting tree is not short anymore.

Longness, on the other hand, is preserved.

The disadvantage that long trees take more time to traverse is outweighed by the advantage that we never have to renormalize the trees.

\begin{oldpart}{FR: replaced with the above}

\subsection{The General Algorithm} The aim is to find typing-preserving morphisms between theories, i.e. given a constant $C:t$ in a theory $T_1$, we want to find a view $V:T_1\to T_2$ such that if $V(C)=C'$ and $C':t'\in T_2$, we have $V(t)=t'$. Correspondingly, we need to consider the types of the constants in our two theories, which we assume to be flat.

To run a proper unficiation algorithm is in our situation infeasible, since the flat version of a theory can become prohibitively large (and obviously finding two unifiable types in two theories is a search problem quadratic in the number of constants). To solve that problem, we first preprocess our theories such that pre-selecting plausibly ``unifiable'' constants becomes as fast as possible.

To run a proper unification algorithm is in our situation infeasible, since the flat version of a theory can become prohibitively large (and obviously finding two unifiable types in two theories is a search problem quadratic in the number of constants). To solve that problem, we first preprocess our theories such that pre-selecting plausibly ``unifiable'' constants becomes as fast as possible.

\paragraph{} We do so by first transforming each constant $C$ in a theory to an \emph{encoding}$(\cn{h}(C),\cn{syms}(C))$ the following way:

...

...

@@ -11,6 +80,25 @@ As a result, we get a pair $(\cn{tree}(C),\cn{syms}(C))$. Now an assignment $V(C

Furthermore, since we only need $\cn{tree}(C)$ for an equality check, we can immediately replace $\cn{tree}(C)$ by an integer hashcode $\cn{h}(C)$.

Now given a constant $C\in T_1$, we can find valid matches in $T_2$ by computing the encodings for each constant in $T_1$ and $T_2$, taking the list of constants $C'\in T_2$ with $\cn{h}(C')=\cn{h}(C)$ and recursing into their respective lists of symbols.

\end{oldpart}

\subsection{Search}

Consider two constants $c:E$ and $c':E'$ preprocessed into long abstract syntax trees $(t,s)$ and $(t',s')$.

It is now straightforward to show the following \textbf{lemma}:

the assignment $c\mapsto c'$ is well-typed in a morphism $\sigma$ if $t=t'$ (in which case $s$ and $s'$ must have the same length $l$) and $\sigma$ also contains $s_i\mapsto s'_i$ for $i=1,\ldots,l$.

(Of course, the condition about $s_i\mapsto s'_i$ may be redundant if $s$ contain duplicates; but because $s$ has to be traversed anyway, it is cheap to skip all duplicates.)

We call the set of assignments $s_i\mapsto s'_i$ the \textbf{prerequisites} of $c\mapsto c'$.

\ednote{maybe make this a formal lemma}

This lemma is the center of our search algorithm:

We can pick any two constant declarations $c$ and $c'$ and build a morphism by starting with $\sigma=c\mapsto c'$ and recursively adding all prerequisites to $\sigma$ until

\begin{compactitem}

\item either the process terminates, in which case we have found a morphism,

\item or $\sigma$ contains two different assignments for the same constant (a contradiction), in which case we backtrack to the point where we picked $c$ and $c'$.

\end{compactitem}

\ednote{continue description of algorithm: how do we pick $c\mapsto c'$, and so on}

\subsection{Improvements and Parameters of the Algorithm}