viewfinder.tex 17.5 KB
 Michael Kohlhase committed Apr 26, 2018 1 2 Let $C$ be a corpus of theories with the same fixed meta-theory $M$. We call the problem of finding theory morphisms between theories of $C$ the \defemph{view finding problem} and an algorithm that solves it a \defemph{view finder}.  Florian Rabe committed Apr 27, 2018 3   Michael Kohlhase committed Apr 26, 2018 4 Note that a view finder is sufficient to solve the theory classification use case from the introduction:  Dennis Müller committed Apr 28, 2018 5 Jane provides a $M$-theory $Q$ of beautiful sets, the view finder computes all (total) morphisms from $Q$ into $C$, and returns presentations of target theories and the assignments made by the morphisms.  Florian Rabe committed Apr 26, 2018 6   Florian Rabe committed Apr 27, 2018 7 \paragraph{Efficiency Considerations}  Michael Kohlhase committed Apr 26, 2018 8 The cost of this problem quickly explodes.  Florian Rabe committed Apr 26, 2018 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 First of all, it is advisable to restrict attention to simple morphisms. Eventually we want to search for arbitrary morphisms as well. But that problem is massively harder because it subsumes theorem proving: a morphism from $\Sigma$ to $\Sigma'$ maps $\Sigma$-axioms to $\Sigma'$-proofs, i.e., searching for a theory morphism requires searching for proofs. Secondly, if $C$ has $n$ theories, we have $n^2$ pairs of theories between which to search. (It is exactly $n^2$ because the direction matters, and even morphisms from a theory to itself are interesting.) Moreover, for two theories with $m$ and $n$ constants, there are $n^m$ possible simple morphisms. (It is exactly $n^m$ because morphisms may map different constants to the same one.) Thus, we can in principle enumerate and check all possible simple morphisms in $C$. But for large $C$, it quickly becomes important to do so in an efficient way that eliminates ill-typed or uninteresting morphisms early on. Thirdly, it is desirable to search for \emph{partial} theory morphisms as well. In fact, identifying refactoring potential is only possible if we find partial morphisms: then we can refactor the involved theories in a way that yields a total morphism. Moreover, many proof assistant libraries do not follow the little theories paradigm or do not employ any theory-like structuring mechanism at all. These can only be represented as a single huge theory, in which case we have to search for partial morphisms from this theory to itself. While partial morphism can be reduced to and then checked like total ones, searching for partial morphisms makes the number of possible morphisms that must be checked much larger. Finally, even for a simple theory morphism, checking reduces to a set of equality constraints, namely the constraints $\vdash_{\Sigma'}\ov{\sigma}(E)=E'$ for the type-preservation condition. Depending on $M$, this equality judgment may be undecidable and require theorem proving.  Florian Rabe committed Apr 27, 2018 29 \paragraph{Algorithm Overview}  Florian Rabe committed Apr 26, 2018 30 31 32 33 34 35 36 37 38 39 40 41 42 43 A central motivation for our algorithm is that equality in $M$ can be soundly approximated very efficiently by using a normalization function on $M$-expressions. This has the additional benefit that relatively little meta-theory-specific knowledge is needed, and all such knowledge is encapsulated in a single well-understood function. Thus, we can implement theory morphism--search generically for arbitrary $M$. Our algorithm consists of two steps. First, we preprocess all constant declarations in $C$ with the goal of moving as much intelligence as possible into a step whose cost is linear in the size of $C$. Then, we perform the morphism search on the optimized data structures produced by the first step. \subsection{Preprocessing} The preprocessing phase computes for every constant declaration $c:E$ a normal form $E'$ and then the long abstract syntax tree for $E'$. Both steps are described below. \paragraph{Normalization}  Florian Rabe committed Apr 27, 2018 44 45 46 47 48 49 Normalization involves two steps. Firstly, \textbf{MMT-level} normalization performs generic transformations that do not depend on the meta-theory $M$. These include elaboration of structured theories and definition expansion, which we mentioned in Sect.~\ref{sec:prelim}. Secondly, \textbf{meta-theory-level} normalization applies an $M$-specific normalization function, which we assume as a black box for now and discuss further in Sect.~\ref{sec:preproc}.  Florian Rabe committed Apr 26, 2018 50 51 52  \paragraph{Abstract Syntax Trees} We define \textbf{abstract syntax trees} as pairs $(t,s)$ where $t$ is subject to the grammar  Dennis Müller committed Apr 26, 2018 53  $t ::= C_{Nat} \mid V_{Nat} \mid \ombind{t}{t^+}{t^+}$  Florian Rabe committed Apr 26, 2018 54 (where $Nat$ is a non-terminal for natural numbers) and $s$ is a list of constant names.  Dennis Müller committed Apr 26, 2018 55 We obtain an abstract syntax tree from an MMT expression $E$ by (i) switching to de-Bruijn representation of bound variables and (ii) replacing all occurrences of constants with $C_i$ in such a way that every $C_i$ refers to the $i$-th element of $s$.  Florian Rabe committed Apr 26, 2018 56 57 58 59 60 61  Abstract syntax trees have the nice property that they commute with the application of simple morphisms $\sigma$: If $(t,s)$ represents $E$, then $\sigma(E)$ is represented by $(t,s')$ where $s'$ arises from $s$ by replacing every constant with its $\sigma$-assignment. The above does not completely specify $i$ and $s$ yet, and there are several possible canonical choices among the abstract syntax trees representing the same expression. The trade-off is subtle because we want to make it easy to both identify and check theory morphisms later on.  Dennis Müller committed Apr 26, 2018 62 We call $(t,s)$ the \textbf{long} abstract syntax tree for $E$ if $C_i$ replaces the $i$-th occurrence of a constant in $E$ when $E$ is read in left-to-right order.  Florian Rabe committed Apr 26, 2018 63 In particular, the long tree does not merge duplicate occurrences of the same constant into the same number.  Dennis Müller committed Apr 26, 2018 64 The \textbf{short} abstract syntax tree for $E$ arises from the long one by removing all duplicates from $s$ and replacing the $C_i$ accordingly.  Florian Rabe committed Apr 26, 2018 65   Florian Rabe committed Apr 27, 2018 66 \begin{example}  Dennis Müller committed Apr 28, 2018 67 Consider again the axiom $\forall x:\cn{set},y:\cn{set}.\; \cn{beautiful}(x) \wedge y \subseteq x \Rightarrow \cn{beautiful}(y)$ with internal representation $\ombind{\forall}{x:\cn{set},y:\cn{set}}{\oma{\Rightarrow}{\oma{\wedge}{\oma{\cn{beautiful}}{x},\oma{\subseteq}{y,x}},\oma{\cn{beautiful}}{y}}}.$  Florian Rabe committed Apr 26, 2018 68   Dennis Müller committed Apr 26, 2018 69 The \emph{short} syntax tree and list of constants associated with this term would be:  Dennis Müller committed Apr 28, 2018 70 71 72 73 74 $t=\ombind{C_1}{C_2,C_2}{\oma{C_3}{\oma{C_4}{\oma{C_5}{V_2},\oma{C_6}{V_1,V_2}},\oma{C_5}{V_1}}}$ $s=(\forall,\cn{set},\Rightarrow,\wedge,\cn{beautiful},\subseteq)$ The corresponding long syntax tree is : $t=\ombind{C_1}{C_2,C_3}{\oma{C_4}{\oma{C_5}{\oma{C_6}{V_2},\oma{C_7}{V_1,V_2}},\oma{C_8}{V_1}}}$ $s=(\forall,\cn{set},\cn{set},\Rightarrow,\wedge,\cn{beautiful},\subseteq,\cn{beautiful})$  Florian Rabe committed Apr 27, 2018 75 \end{example}  Dennis Müller committed Apr 26, 2018 76   Florian Rabe committed Apr 27, 2018 77 For our algorithm, we pick the \emph{long} abstract syntax tree, which may appear surprising.  Florian Rabe committed Apr 26, 2018 78 79 80 81 The reason is that shortness is not preserved when applying a simple morphism: whenever a morphism maps two different constants to the same constant, the resulting tree is not short anymore. Longness, on the other hand, is preserved. The disadvantage that long trees take more time to traverse is outweighed by the advantage that we never have to renormalize the trees.  Dennis Müller committed Apr 26, 2018 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 %\begin{oldpart}{FR: replaced with the above} % %\subsection{The General Algorithm} The aim is to find typing-preserving morphisms between theories, i.e. given a constant $C:t$ in a theory $T_1$, we want to find a view $V:T_1\to T_2$ such that if $V(C)=C'$ and $C':t'\in T_2$, we have $V(t)=t'$. Correspondingly, we need to consider the types of the constants in our two theories, which we assume to be flat. % %To run a proper unification algorithm is in our situation infeasible, since the flat version of a theory can become prohibitively large (and obviously finding two unifiable types in two theories is a search problem quadratic in the number of constants). To solve that problem, we first preprocess our theories such that pre-selecting plausibly unifiable'' constants becomes as fast as possible. % %\paragraph{} We do so by first transforming each constant $C$ in a theory to an \emph{encoding} $(\cn{h}(C),\cn{syms}(C))$ the following way: % %Consider the syntax tree of the type $t$ of a constant $C$. We first systematically replace the leaves by an abstract representation, yielding a data structure $\cn{tree}(C)$. We can eliminate variables by replacing them by their De Bruijn index, and symbol references by enumerating them and storing the symbol's names in a list $\cn{syms}(C)$. % %As a result, we get a pair $(\cn{tree}(C),\cn{syms}(C))$. Now an assignment $V(C)=D$ is valid, iff $\cn{tree}(C)=\cn{tree}(D)$, the lists $\cn{syms}(C)$ and $\cn{syms}(D)$ have the same length and their pairwise assignments $V(\cn{syms}(C)_i)=\cn{syms}(D)_i$ are all valid. % %Furthermore, since we only need $\cn{tree}(C)$ for an equality check, we can immediately replace $\cn{tree}(C)$ by an integer hashcode $\cn{h}(C)$. % %Now given a constant $C\in T_1$, we can find valid matches in $T_2$ by computing the encodings for each constant in $T_1$ and $T_2$, taking the list of constants $C'\in T_2$ with $\cn{h}(C')=\cn{h}(C)$ and recursing into their respective lists of symbols. %\end{oldpart}  Florian Rabe committed Apr 26, 2018 98 99 100 101  \subsection{Search} Consider two constants $c:E$ and $c':E'$ preprocessed into long abstract syntax trees $(t,s)$ and $(t',s')$.  Dennis Müller committed Apr 26, 2018 102 103 104 105 106 It is now straightforward to show the following Lemma: \begin{lemma} The assignment $c\mapsto c'$ is well-typed in a morphism $\sigma$ if $t=t'$ (in which case $s$ and $s'$ must have the same length $l$) and $\sigma$ also contains $s_i\mapsto s'_i$ for $i=1,\ldots,l$. \end{lemma} Of course, the condition about $s_i\mapsto s'_i$ may be redundant if $s$ contain duplicates; but because $s$ has to be traversed anyway, it is cheap to skip all duplicates. We call the set of assignments $s_i\mapsto s'_i$ the \textbf{prerequisites} of $c\mapsto c'$.  Florian Rabe committed Apr 26, 2018 107   Florian Rabe committed Apr 27, 2018 108 109 110 111 This lemma is the center of our search algorithm explained in \begin{lemma}[Core Algorithm] Consider two constant declarations $c$ and $c'$ in theories $\Sigma'$ and $\Sigma'$. We define a morphism by starting with $\sigma=c\mapsto c'$ and recursively adding all prerequisites to $\sigma$ until  Florian Rabe committed Apr 26, 2018 112 \begin{compactitem}  Florian Rabe committed Apr 27, 2018 113 114  \item either the recursion terminates \item or $\sigma$ contains two different assignments for the same constant, in which case we fail.  Florian Rabe committed Apr 26, 2018 115 116 \end{compactitem}  Florian Rabe committed Apr 27, 2018 117 118 119 If the above algorithm succeeds, then $\sigma$ is a well-typed partial simple morphism from $\Sigma$ to $\Sigma'$. \end{lemma}  Dennis Müller committed Apr 28, 2018 120 121 122 123 124 125 126 \begin{example}\label{ex:beautifinite} Consider two constants $c,c'$ with types $\forall x:\cn{set},y:\cn{set}.\; \cn{beautiful}(x) \wedge y \subseteq x \Rightarrow \cn{beautiful}(y)$ and $\forall x:\cn{powerset},y:\cn{powerset}.\; \cn{finite}(x) \wedge y \subseteq x \Rightarrow \cn{finite}(y)$. Their syntax trees are $t=t'=\ombind{C_1}{C_2,C_3}{\oma{C_4}{\oma{C_5}{\oma{C_6}{V_2},\oma{C_7}{V_1,V_2}},\oma{C_8}{V_1}}}$ $s=(\forall,\cn{set},\cn{set},\Rightarrow,\wedge,\cn{beautiful},\subseteq,\cn{beautiful})$ $s'=(\forall,\cn{powerset},\cn{powerset},\Rightarrow,\wedge,\cn{finite},\subseteq,\cn{finite})$ Since $t=t'$, we set $c\mapsto c'$ and compare $s$ with $s'$, meaning we check (ignoring duplicates) that $\forall\mapsto\forall$, $\cn{set}\mapsto\cn{powerset}$, $\Rightarrow\mapsto\Rightarrow$, $\wedge\mapsto\wedge$, $\cn{beautiful}\mapsto\cn{finite}$ and $\subseteq\mapsto\subseteq$ are all valid. \end{example}  Florian Rabe committed Apr 27, 2018 127 128 129 130 131 To find all morphisms from $\Sigma$ to $\Sigma'$, we first run the core algorithm on every pair of $\Sigma$-constants and $\Sigma'$-constants. This usually does not yield big morphisms yet. For example, consider the typical case where theories contain some symbol declarations and some axioms, in which the symbols occur. Then the core algorithm will only find morphisms that map at most $1$ axiom.  Dennis Müller committed Apr 28, 2018 132 Depending on what we intend to do with the results, we might prefer to consider them individually (e.g. to yield \emph{alignments} in the sense of \cite{KKMR:alignments:16}). But we can also use these small morphisms as building blocks to construct larger, possibly total ones:  Florian Rabe committed Apr 27, 2018 133 134 135 136 137 \begin{lemma}[Amalgamating Morphisms] We call two partial morphisms \textbf{compatible} if they agree on all constants for which both provide an assignment. The union of compatible well-typed morphisms is again well-typed. \end{lemma}  Dennis Müller committed Apr 28, 2018 138 139 140 141 142 143 \begin{example} Consider the partial morphism from Example \ref{ex:beautifinite} and imagine a second partial morphism for the axioms $\cn{beautiful}(\emptyset)$ and $\cn{finite}(\emptyset)$. The former has the requirements $\forall\mapsto\forall,\quad \cn{set}\mapsto\cn{powerset}\quad \Rightarrow\mapsto\Rightarrow \quad \wedge\mapsto\wedge \quad \cn{beautiful}\mapsto\cn{finite} \quad \subseteq\mapsto\subseteq$ The latter requires only $\cn{set}\mapsto\cn{powerset}$ and $\emptyset\mapsto\emptyset$. Since both morphisms agree on all assignments, we can merge them into one morphism mapping both axioms and all requirements of both. \end{example}  Dennis Müller committed Apr 20, 2018 144   Florian Rabe committed Apr 27, 2018 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 \subsection{Optimizations}\label{sec:algparams} The above presentation is intentionally simple to convey the general idea. In the sequel, we describe several improvements that we have implemented to be scalable. \paragraph{Choice of Meta-Theory and Meta-Morphisms} The above descriptions treats meta-theory constants like any other constant. As such it finds also morphisms that do not fix the meta-theory. Our implementation extends our algorithm so that a meta-theory is fixed. All we have to do is when building the abstract syntax trees to retain all references to constants of the meta-theory instead of replacing them with numbers. %Firstly, consider a library of theories in logic $L$ in which we want to find intra-library morphisms. %At first glance, the meta-theory should consist of $L$ and the logical framework used to define $L$ (if any). %Therefore, our algorithm allows fixing a meta-theory: % %But often the library contains certain theories that represent concrete mathematical objects, e.g., the real numbers and constants for their operations. %Most of the time, we want to fix those constants as well instead of finding morphisms that, e.g., map real addition to any other commutative binary operator anywhere in the library. %But sometimes we do not want to fix them, e.g., if there are multiple axiomatizations of the real numbers that we want to refactor. % %Secondly, consider two libraries in logics $L$ and $L'$ between which we want to find inter-library morphisms. %Here it makes sense to use only the logical framework used to define the two logics as the meta-theory. %Moreover, if we have already found a partial morphism $m$ that maps, e.g., the propositional connectives of $L$ to those of $L'$, we can use $m$ as a \textbf{meta-morphism}. %In that case, we are only looking for morphisms that extend $m$. % %That can be useful because it lets us use the same preprocessing result even if we flexibly change the meta-theory. %Changing the meta-theory happens more often than one might think, and we present two examples.  Dennis Müller committed Apr 20, 2018 172   Florian Rabe committed Apr 27, 2018 173 The symbols will hence be encoded in the component $t$ instead of the list $s$. This will not only exclude spurious matches, but also reduce the number of plausible matches and consequently speed up the algorithm.  Dennis Müller committed Apr 20, 2018 174   Dennis Müller committed Apr 26, 2018 175 \paragraph{Picking starting pairs:} Note, that we will still find many spurious matches if executed in its general form. The reason being that (for example) all atomic types match each other, as will all binary operators etc. Most of these results will not be interesting. Furthermore, since the algorithm needs to recurse into the lists $s$, many potential matches will need to be checked repeatedly. Both problems can be massively reduced by selecting specific pairs of encodings as \emph{starting pairs} for the algorithm, so that the majority of matching constants will only be considered if the algorithm runs into them during recursing. Potential useful starting points are:  Dennis Müller committed Apr 20, 2018 176 \begin{itemize}  Dennis Müller committed Apr 20, 2018 177  \item \emph{Axioms:} Since we are mostly interested in matching constants that share the same mathematical properties, by using axioms as starting point we can ensure that the algorithm only matches constants that have at least one (axiomatic) property in common (e.g. only commutative, or associative operators).  Dennis Müller committed Apr 26, 2018 178  \item \emph{The length of $s$ in the short abstract syntax tree:} By choosing a minimal length for $s$, we can guarantee that only morphisms will be produced that relate a minimal number of distinct constants.  Dennis Müller committed Apr 20, 2018 179 180 \end{itemize}  Dennis Müller committed Apr 28, 2018 181 \paragraph{Picking starting theories:} If we try to find morphisms between whole collections of theories, we can obviously disregard all theories that are already included in some other theory in our collections, since we work with a normalized (and dependency-closed) version of a theory. Consequently, by only using maximal theories we do not find any fewer morphisms but speed up the search significantly.  Dennis Müller committed Apr 28, 2018 182   Michael Kohlhase committed Apr 29, 2018 183 \paragraph{Storing Encodings} Finally, for computing the encodings of a theory we only need to know the symbols to be fixed in the component $t$ of an abstract syntax tree, for which only the meta-theories of the theories are relevant. They also determine the specific preprocessings and translations we want to likely use. Since there is only a small number of meta-theories involved that are relevant in practice, we can store and retrieve the encodings for the most important situations. Since computing the encodings (as well as sorting the theories in a library by their dependencies) is the most expensive part of the algorithm, this -- once computed and stored -- makes the view finding process itself rather efficient.  Michael Kohlhase committed Apr 26, 2018 184 185 186  %%% Local Variables: %%% mode: latex  Michael Kohlhase committed Apr 29, 2018 187 188 %%% eval: (set-fill-column 5000) %%% eval: (visual-line-mode)  Michael Kohlhase committed Apr 26, 2018 189 190 %%% TeX-master: "paper" %%% End:  Michael Kohlhase committed Apr 28, 2018 191 192  % LocalWords: defemph defemph emph vdash_ normalization optimized compactenum textbf th  Michael Kohlhase committed Apr 29, 2018 193 % LocalWords: t,s ednote renormalize oldpart syms mapsto mapsto ldots,l compactitem y,x  Michael Kohlhase committed Apr 28, 2018 194 % LocalWords: normalizing subtyping subterms preprocessings ldots ldots formalization  Michael Kohlhase committed Apr 29, 2018 195 196 197 % LocalWords: vdash Rightarrow vdash vdash vdash forall vdash sec:preproc subseteq c,c % LocalWords: subseteq ex:beautifinite powerset powerset emptyset emptyset normalized % LocalWords: Optimizations sec:algparams axiomatizations