viewfinder.tex 16.2 KB
 Michael Kohlhase committed Apr 26, 2018 1 2 3 4 5 6 \begin{definition}\rm Let $C$ be a corpus of theories with the same fixed meta-theory $M$. We call the problem of finding theory morphisms between theories of $C$ the \defemph{view finding problem} and an algorithm that solves it a \defemph{view finder}. \end{definition} Note that a view finder is sufficient to solve the theory classification use case from the introduction: Jane provides a $M$-theory $Q$ of beautiful sets, the view finder computes all (total) views from $Q$ into $C$, and returns presentations of target theories and the assignments made by the views.  Florian Rabe committed Apr 26, 2018 7   Michael Kohlhase committed Apr 26, 2018 8 The cost of this problem quickly explodes.  Florian Rabe committed Apr 26, 2018 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 First of all, it is advisable to restrict attention to simple morphisms. Eventually we want to search for arbitrary morphisms as well. But that problem is massively harder because it subsumes theorem proving: a morphism from $\Sigma$ to $\Sigma'$ maps $\Sigma$-axioms to $\Sigma'$-proofs, i.e., searching for a theory morphism requires searching for proofs. Secondly, if $C$ has $n$ theories, we have $n^2$ pairs of theories between which to search. (It is exactly $n^2$ because the direction matters, and even morphisms from a theory to itself are interesting.) Moreover, for two theories with $m$ and $n$ constants, there are $n^m$ possible simple morphisms. (It is exactly $n^m$ because morphisms may map different constants to the same one.) Thus, we can in principle enumerate and check all possible simple morphisms in $C$. But for large $C$, it quickly becomes important to do so in an efficient way that eliminates ill-typed or uninteresting morphisms early on. Thirdly, it is desirable to search for \emph{partial} theory morphisms as well. In fact, identifying refactoring potential is only possible if we find partial morphisms: then we can refactor the involved theories in a way that yields a total morphism. Moreover, many proof assistant libraries do not follow the little theories paradigm or do not employ any theory-like structuring mechanism at all. These can only be represented as a single huge theory, in which case we have to search for partial morphisms from this theory to itself. While partial morphism can be reduced to and then checked like total ones, searching for partial morphisms makes the number of possible morphisms that must be checked much larger. Finally, even for a simple theory morphism, checking reduces to a set of equality constraints, namely the constraints $\vdash_{\Sigma'}\ov{\sigma}(E)=E'$ for the type-preservation condition. Depending on $M$, this equality judgment may be undecidable and require theorem proving. A central motivation for our algorithm is that equality in $M$ can be soundly approximated very efficiently by using a normalization function on $M$-expressions. This has the additional benefit that relatively little meta-theory-specific knowledge is needed, and all such knowledge is encapsulated in a single well-understood function. Thus, we can implement theory morphism--search generically for arbitrary $M$. Our algorithm consists of two steps. First, we preprocess all constant declarations in $C$ with the goal of moving as much intelligence as possible into a step whose cost is linear in the size of $C$. Then, we perform the morphism search on the optimized data structures produced by the first step. \subsection{Preprocessing} The preprocessing phase computes for every constant declaration $c:E$ a normal form $E'$ and then the long abstract syntax tree for $E'$. Both steps are described below. \paragraph{Normalization} Normalization involves two steps: \begin{compactenum} \item MMT-level normalization performs generic transformations that do not depend on the meta-theory $M$. These are in particular elaboration of structured theories and definition expansion, which we mentioned in Sect.~\ref{sec:prelim}. \item Meta-theory-level normalization applies an $M$-specific normalization function, which we assume as a black box for now and discuss further in Sect.~\ref{sec:}. \end{compactenum} \paragraph{Abstract Syntax Trees} We define \textbf{abstract syntax trees} as pairs $(t,s)$ where $t$ is subject to the grammar $t ::= Con(Nat) \mid Var(Nat) \mid \ombind{t}{t^+}{t^+}$ (where $Nat$ is a non-terminal for natural numbers) and $s$ is a list of constant names. We obtain an abstract syntax tree from an MMT expression $E$ by (i) switching to de-Bruijn representation of bound variables and (ii) replacing all occurrences of constants with $Con(i)$ in such a way that every $Con(i)$ refers to the $i$-th element of $s$. Abstract syntax trees have the nice property that they commute with the application of simple morphisms $\sigma$: If $(t,s)$ represents $E$, then $\sigma(E)$ is represented by $(t,s')$ where $s'$ arises from $s$ by replacing every constant with its $\sigma$-assignment. The above does not completely specify $i$ and $s$ yet, and there are several possible canonical choices among the abstract syntax trees representing the same expression. The trade-off is subtle because we want to make it easy to both identify and check theory morphisms later on. We call $(t,s)$ the \textbf{long} abstract syntax tree for $E$ if $Con(i)$ replaces the $i$-th occurrence of a constant in $E$ when $E$ is read in left-to-right order. In particular, the long tree does not merge duplicate occurrences of the same constant into the same number. The \textbf{short} abstract syntax tree for $E$ arises from the long one by removing all duplicates from $s$ and replacing the $Con(i)$ accordingly. \ednote{continue a running example somewhere here} In our algorithm, we pick the \emph{long} abstract syntax tree, which may appear surprising. The reason is that shortness is not preserved when applying a simple morphism: whenever a morphism maps two different constants to the same constant, the resulting tree is not short anymore. Longness, on the other hand, is preserved. The disadvantage that long trees take more time to traverse is outweighed by the advantage that we never have to renormalize the trees. \begin{oldpart}{FR: replaced with the above}  Dennis Müller committed Apr 20, 2018 74 \subsection{The General Algorithm} The aim is to find typing-preserving morphisms between theories, i.e. given a constant $C:t$ in a theory $T_1$, we want to find a view $V:T_1\to T_2$ such that if $V(C)=C'$ and $C':t'\in T_2$, we have $V(t)=t'$. Correspondingly, we need to consider the types of the constants in our two theories, which we assume to be flat.  Dennis Müller committed Apr 18, 2018 75   Florian Rabe committed Apr 26, 2018 76 To run a proper unification algorithm is in our situation infeasible, since the flat version of a theory can become prohibitively large (and obviously finding two unifiable types in two theories is a search problem quadratic in the number of constants). To solve that problem, we first preprocess our theories such that pre-selecting plausibly unifiable'' constants becomes as fast as possible.  Dennis Müller committed Apr 18, 2018 77   Dennis Müller committed Apr 20, 2018 78 79 \paragraph{} We do so by first transforming each constant $C$ in a theory to an \emph{encoding} $(\cn{h}(C),\cn{syms}(C))$ the following way:  Dennis Müller committed Apr 23, 2018 80 Consider the syntax tree of the type $t$ of a constant $C$. We first systematically replace the leaves by an abstract representation, yielding a data structure $\cn{tree}(C)$. We can eliminate variables by replacing them by their De Bruijn index, and symbol references by enumerating them and storing the symbol's names in a list $\cn{syms}(C)$.  Dennis Müller committed Apr 18, 2018 81 82 83 84 85  As a result, we get a pair $(\cn{tree}(C),\cn{syms}(C))$. Now an assignment $V(C)=D$ is valid, iff $\cn{tree}(C)=\cn{tree}(D)$, the lists $\cn{syms}(C)$ and $\cn{syms}(D)$ have the same length and their pairwise assignments $V(\cn{syms}(C)_i)=\cn{syms}(D)_i$ are all valid. Furthermore, since we only need $\cn{tree}(C)$ for an equality check, we can immediately replace $\cn{tree}(C)$ by an integer hashcode $\cn{h}(C)$.  Dennis Müller committed Apr 20, 2018 86 Now given a constant $C\in T_1$, we can find valid matches in $T_2$ by computing the encodings for each constant in $T_1$ and $T_2$, taking the list of constants $C'\in T_2$ with $\cn{h}(C')=\cn{h}(C)$ and recursing into their respective lists of symbols.  Florian Rabe committed Apr 26, 2018 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 \end{oldpart} \subsection{Search} Consider two constants $c:E$ and $c':E'$ preprocessed into long abstract syntax trees $(t,s)$ and $(t',s')$. It is now straightforward to show the following \textbf{lemma}: the assignment $c\mapsto c'$ is well-typed in a morphism $\sigma$ if $t=t'$ (in which case $s$ and $s'$ must have the same length $l$) and $\sigma$ also contains $s_i\mapsto s'_i$ for $i=1,\ldots,l$. (Of course, the condition about $s_i\mapsto s'_i$ may be redundant if $s$ contain duplicates; but because $s$ has to be traversed anyway, it is cheap to skip all duplicates.) We call the set of assignments $s_i\mapsto s'_i$ the \textbf{prerequisites} of $c\mapsto c'$. \ednote{maybe make this a formal lemma} This lemma is the center of our search algorithm: We can pick any two constant declarations $c$ and $c'$ and build a morphism by starting with $\sigma=c\mapsto c'$ and recursively adding all prerequisites to $\sigma$ until \begin{compactitem} \item either the process terminates, in which case we have found a morphism, \item or $\sigma$ contains two different assignments for the same constant (a contradiction), in which case we backtrack to the point where we picked $c$ and $c'$. \end{compactitem} \ednote{continue description of algorithm: how do we pick $c\mapsto c'$, and so on}  Dennis Müller committed Apr 20, 2018 106 107 108 109 110 111 112 113 114  \subsection{Improvements and Parameters of the Algorithm} \paragraph{Extending the abstract syntax tree} By the very nature of the approach described in Section \ref{sec:oaf}, many symbols will be common to domain and codomain of a given viewfinding problem: Since all of the libraries available are ultimately based on (extensions of) LF, we never want to reassign the symbols provided by the LF theory. Furthermore, if we want to find views between theories that share a common meta-theory (e.g. theories within the same library), we most certainly do not want to reassign the symbols in that meta-theory. In these situations, we can consider the \emph{common} symbols as a fixed part of the abstract syntax tree of a constant. The symbols will hence be encoded in the component $\cn{h}(C)$ instead of the list $\cn{syms}(C)$. This will not only exclude spurious matches, but also reduce the number of plausible matches and consequently speed up the algorithm. \paragraph{Picking starting pairs:} Note, that we will still find many spurious matches if executed in its general form. The reason being that (for example) all atomic types match each other, as will all binary operators etc. Most of these results will not be interesting. Furthermore, since the algorithm needs to recurse into the lists $\cn{syms}(C)$, many potential matches will need to be checked repeatedly. Both problems can be massively reduced by selecting specific pairs of encodings as \emph{starting pairs} for the algorithm, so that the majority of matching constants will only be considered if the algorithm runs into them during recursing. Potential useful starting points are: \begin{itemize}  Dennis Müller committed Apr 20, 2018 115 116  \item \emph{Axioms:} Since we are mostly interested in matching constants that share the same mathematical properties, by using axioms as starting point we can ensure that the algorithm only matches constants that have at least one (axiomatic) property in common (e.g. only commutative, or associative operators). \item \emph{The length of $\cn{syms}(C)$:} This way we can guarantee, that only matches will be produced that relate at least two (or more) constant to each other -- if e.g. we are interested in structures that have a distributive property, this will make sure that we only find those views, where both operators are matched as intended.  Dennis Müller committed Apr 20, 2018 117 118 119 120 121 122 123 124 \end{itemize} \paragraph{Picking starting theories:} If we try to find views between whole collections of theories, we can obviously disregard all theories that are included in some other theory in our collections, since we need to work with the flat theories anyway. Consequently, by only using maximal theories we do not find any fewer views but speed up the search significantly. \paragraph{Preprocessing and Translations} Especially when looking for views between theories in different libraries (and built on different foundations), various differences in available foundational primitives and library-specific best practices and idiosyncracies can prevent the algorithm from finding desired matches. There are two approaches to increasing the number of results in these cases: \begin{itemize} \item We can preprocess theories, by e.g. normalizing expressions, eliminating higher-order abstract syntax encodings or encoding-related redundant information (such as the type of a typed equality, which in the presence of subtyping can be different from the types of both sides of an equation), or elaborating abbreviations/definitions.  Dennis Müller committed Apr 20, 2018 125  \item In many instances, the translations between two foundations is too complex to be discovered purely syntactically. In these cases we can provide arbitrary translations between theories, which are applied before computing the encoding.\ednote{Mention/cite alignment-translation paper}  Dennis Müller committed Apr 20, 2018 126 \end{itemize}  Dennis Müller committed Apr 20, 2018 127 When elaborating definitions, it is important to consider that this may also reduce the number of results, if both theories use similar abbreviations for complex terms, or the same concept is declared axiomatically in one theory, but definitionally in the other. For that reason, we can allow several encodings for the same constant during preprocessing, such as one with definitions expanded and one as is''.  Dennis Müller committed Apr 18, 2018 128   Dennis Müller committed Apr 20, 2018 129 130 Similarly, certain idiosyncracies -- such as PVS's common usage of theory parameters -- call for not just matching symbol references, but also variables or possibly even complex expressions. To handle these situations, we additionally allow for holes'' in the lists $\cn{syms}(C)$, which may be unified with any other symbol or hole, but are not recursed into. The subterms that are to be considered holes can be marked as such during preprocessing.  Dennis Müller committed Apr 25, 2018 131 132 133 134 135 136 137 138 139 140 141 142 143 144 \paragraph{Storing Encodings} Finally, for computing the encodings of a theory we only need to know the symbols to be fixed in the component $\cn{h}(C)$ of an encoding, for which only the meta-theories of the theories are relevant. They also determine the specific preprocessings and translations we want to likely use. Since there is only a small number of meta-theories incolved that are relevant in practice, we can store and retrieve the encodings for the most important situations. Since computing the encodings (as well as sorting the theories in a library by their dependencies) is the most expensive part of the algorithm, this -- once computed and stored -- makes the viewfinding process itself rather efficent. \subsection{Preprocessing and Normalization} The common logical framework used for all the libraries at our disposal makes it easy to systematically normalize theories built on various logical foundations. We currently use the following approaches to preprocessing theories: \begin{itemize} \item Free variables in a term, often occurences of theory parameters as e.g. used extensively in the PVS system, are replaced by holes. \item For foundations that use product types, we curry function types $(A_1 \times\ldots A_n)\to B$ to $A_1 \to \ldots \to A_n\to B$. We treat lambda-expressions and applications accordingly. \item Higher-order abstract syntax encodings are eliminated by raising atomic types, function types, applications and lambdas to the level of the logical framework. This eliminates (redundant) implicit arguments that only occur due to their formalization in the logical framework. This has the advantage that possible differences between the types of the relevant subterms and implicit type arguments (e.g. in the presence of subtyping) do not negatively affect viewfinding. \item We use the curry-howard correspondence to transform axioms and theorems of the form $\vdash (P\Rightarrow Q)$ to function types $\vdash P \to \vdash Q$. Analogously, we transform judgments of the form $\vdash \forall x : A.\;P$ to $\prod_{x:A}\vdash P$. \item For classical logics, we afterwards rewrite all logical connectives using their usual definitions using negation and conjunction only. Double negations are eliminated. \item Typed Equalities are transformed to untyped ones; again getting rid of the redundant type argument of the equality. \item The arguments of conjunctions and equalities are reordered (currently only by their number of subterms).  Michael Kohlhase committed Apr 26, 2018 145 146 147 148 149 150 151 \end{itemize} %%% Local Variables: %%% mode: latex %%% eval: (visual-line-mode) (set-fill-coloumn 5000) %%% TeX-master: "paper" %%% End: