diff --git a/doc/report/applications.tex b/doc/report/applications.tex index 22a680444e5d14f55cb070412760e6cbd9a92db7..f8f34d78504247073def26f4601814bd24493ac4 100644 --- a/doc/report/applications.tex +++ b/doc/report/applications.tex @@ -18,7 +18,8 @@ endpoint. structured (Section~\ref{sec:tetraq}). \end{itemize} -\noindent Both applications will now be discussed in a dedicated section. +\noindent These applications will now be discussed in dedicated +sections. \subsection{Exploring Existing Data Sets}\label{sec:expl} @@ -80,156 +81,164 @@ can be realized with ULO data sets and where other data sources are required. Where possible, we evaluate proof of concept implementations. -\begin{itemize} - \item \textbf{$\mathcal{Q}_1$ ``Find theorems with non-elementary - proofs.''} Elementary proofs are those that are considered easy - and obvious~\cite{elempro}. In consequence,~$\mathcal{Q}_1$ has - to search for all proofs which are not trivial. Of course, just - like nay distinction between ``theorem'' and ``corollary'' is - going to be somewhat arbitrary, so is any judgment about whether - a proof is easy or not. - - Existing research on proof difficulty is either very broad or - specific to one problem. For example, some experiments showed that - students and prospective school teachers have problems with - notation, term rewriting and required - prerequisites~\cite{proofund, proofteach}, none of which seems - applicable for grading individual proofs for difficulty. On the - other hand, there is research on rating proofs for individual - subsets of problems, e.g.\ on the satisfiability of a given CNF - formula. A particular example is focused on heuristics for how - long a SAT solver will take to find a solution for a - given~{CNF}. Here, solutions that take long are considered - harder~\cite{proofsat}. - - \textbf{Organizational Aspect} A first working hypothesis might be - to assume that elementary proofs are short. In that case, the - size, that is the number of bytes to store the proof, is our first - indicator of proof complexity. This is by no means perfect, as - even identical proofs can be represented in different ways that - might have vastly different size in bytes. It might be tempting to - imagine a unique normal form for each proof, but finding such a - normal form might very well be impossible. As it is very difficult - to find a generalized definition of proof difficulty, we will - accept proof size as a first working hypothesis. - - {ULO} offers the \texttt{ulo:external-size} predicate which will - allow us to sort by file size. Maybe proof complexity also leads - to quick check times in proof assistants and automatic theorem - provers. With this assumption in mind we could use the - \texttt{ulo:check-time} predicate. Correlating proof complexity - with file size allows us to define one indicator of proof - complexity based on organizational knowledge alone. - - \textbf{Other Aspects} A tetrapodal search system should probably - also take symbolic knowledge into account. Based on some kind of - measure of formula complexity, different proofs could be - rated. Similarly, with narrative knowledge available to us, we - could count the number of words, references and so on to rate the - narrative complexity of a proof. Combining symbolic knowledge, - narrative knowledge and organizational knowledge allows us to - find proofs which are probably straight forward. - - \input{applications-q1.tex} - - \textbf{Implementation} Implementing a naive version of the - organizational aspect can be as simple as querying for all - theorems justified by proofs, ordered by size (or check time). - Figure~\ref{fig:q1short} illustrates how this can be achieved with - a SPARQL query. Maybe we want to go one step further and - calculate a rating that assigns each proof some numeric score of - complexity based on a number of properties. We can achieve this in - SPARQL as recent versions support arithmetic as part of the SPARQL - specification; Figure~\ref{fig:q1long} shows an example. Finding - a reasonable rating is its own topic of research, but we see that - as long as it is based on standard arithmetic, it will be possible - to formulate in a SPARQL query. - - The queries in Figure~\ref{fig:q1full} return a list of all - theorems and associated proofs, naturally this list is bound to be - very long. A suggested way to solve this problem is to introduce - some kind of cutoff value for our complexity score. Another potential - solution is to only list the first~$n$ results, something a user - interface would do anyways. Either way, this is not so much an issue - for the organizational storage engine and more one that a tetrapodal - search aggregator has to account for. - - \item \textbf{$\mathcal{Q}_2$ ``Find algorithms that solve - $NP$-complete graph problems.''} Here we want the tetrapodal search - system to return a listing of algorithms that solve (graph) - problems with a given property (runtime complexity). We need - to consider where each of these three components might be stored. - - \textbf{Symbolic and Concrete Aspects} First, let us consider - algorithms. Algorithms can be formulated as computer code which - can be understood as symbolic knowledge (code represented as a - syntax tree) or as concrete knowledge (code as text - files)~\cites[pp. 8--9]{tetra}{virt}. Either way, we will not be - able to query these indices for what problem a given algorithm is - solving, nor is it possible to infer properties as complex as - $NP$-completeness automatically. Meta data of this kind needs to be - stored in a separate index for organizational knowledge, it being - the only fit. - - \textbf{Organizational Aspect} If we wish to look up properties - about algorithms from organizational knowledge, we first have to - think about how to represent this information. We propose two - approaches, one using the existing ULO ontology and one that - recommends extending ULO to nativly support the concept of - algorithms. Both approaches have their distinct trade-offs. - - As a first approach, we can try to represent algorithms and - problems in terms of existing ULO predicates. As ULO does have a - concept of \texttt{theorem} and \texttt{proof}, it might be - tempting to exploit the Curry-Howard correspondence and represent - algorithms understood as programs in terms of proofs. But that - does not really capture the canonical understanding of algorithms; - algorithms are not actually programs, rather there are programs - that implement algorithms. Even if we do work around this problem, - it is not clear how to represent problems (e.g.\ the traveling - salesman problem or the sorting problem) in terms of theorems - (propositions, types) that get implemented by a proof (algorithm, - program). - - As algorithms make up an important part of certain areas of - research, it might be reasonable to introduce native level support - for algorithms in ULO or separately in another ontology. An - argument for adding support directly to ULO is that ULO aims to be - universal and as such should not be without algorithms. An - argument for a separate ontology is that what we understand as ULO - data sets (Isabelle, Coq exports) already contain triplets from - other ontologies (e.g.\ Dublin Core meta data~\cite{dcreport, - dcowl}) and keeping concepts separate is not entirely - unattractive in itself. - - \item \textbf{$\mathcal{Q}_3$ ``All areas of math that {Nicolas G.\ - de Bruijn} has worked in and his main contributions.''} This query - is asking by works of a given author~$A$. It also asks for their - main contributions, e.g.\ what paragraphs or code~$A$ has authored. - - \textbf{Organizational Aspect} ULO has no concept of authors, - contributors, dates and so on. Rather, the idea is to take - advantage of the Dublin Core project which provides an ontology - for such metadata~\cite{dcreport, dcowl}. For example, Dublin Core - provides us with the \texttt{dcterms:creator} and - \texttt{dcterms:contributor} predicates. Servicing~$\mathcal{Q}_3$ - requires us to look for creator~$A$ and then list all associated - \texttt{dcterms:title}s that they have worked on. Of course this - requires above authorship predicates to actually be in use. With - the Isabelle and Coq exports this was hardly the case; running - some experiments we found less than 15 unique contributors and - creators, raising suspicion that meta data is missing in the - original library files. Regardless, in theory ULO allows us to - query for objects ordered by authors. - - Query $\mathcal{Q}_3$ is also asking for the main contributions - of~$A$, that is those works that~$A$ authored that are the most - important. Importance is a quality measure, simply sorting the - result by number of references might be a good start. - - \textbf{Implementation} A search for contributions by a given author - can easily be formulated in {SPARQL}. - \begin{lstlisting} +\subsubsection{Elementary Proofs and Computes Scores} + +Our first query~$\mathcal{Q}_1$ illustrates how we can compute +arithmetic scores for some nodes in our knowledge +graph. Query~$\mathcal{Q}_1$ asks us to ``[f]ind theorems with +non-elementary proofs.''. Elementary proofs are those that are +considered easy and obvious~\cite{elempro}. In +consequence,~$\mathcal{Q}_1$ has to search for all proofs which are +not trivial. Of course, just like nay distinction between ``theorem'' +and ``corollary'' is going to be somewhat arbitrary, so is any +judgment about whether a proof is easy or not. + +Existing research on proof difficulty is either very broad or specific +to one problem. For example, some experiments showed that students and +prospective school teachers have problems with notation, term +rewriting and required prerequisites~\cite{proofund, proofteach}, none +of which seems applicable for grading individual proofs for +difficulty. On the other hand, there is research on rating proofs for +individual subsets of problems, e.g.\ on the satisfiability of a given +CNF formula. A particular example is focused on heuristics for how +long a SAT solver will take to find a solution for a +given~{CNF}. Here, solutions that take long are considered +harder~\cite{proofsat}. + +\noindent\textbf{Organizational Aspect} A first working hypothesis +might be to assume that elementary proofs are short. In that case, the +size, that is the number of bytes to store the proof, is our first +indicator of proof complexity. This is by no means perfect, as even +identical proofs can be represented in different ways that might have +vastly different size in bytes. It might be tempting to imagine a +unique normal form for each proof, but finding such a normal form +might very well be impossible. As it is very difficult to find a +generalized definition of proof difficulty, we will accept proof size +as a first working hypothesis. + +{ULO} offers the \texttt{ulo:external-size} predicate which will allow +us to sort by file size. Maybe proof complexity also leads to quick +check times in proof assistants and automatic theorem provers. With +this assumption in mind we could use the \texttt{ulo:check-time} +predicate. Correlating proof complexity with file size allows us to +define one indicator of proof complexity based on organizational +knowledge alone. + +\noindent\textbf{Other Aspects} A tetrapodal search system should +probably also take symbolic knowledge into account. Based on some kind +of measure of formula complexity, different proofs could be +rated. Similarly, with narrative knowledge available to us, we could +count the number of words, references and so on to rate the narrative +complexity of a proof. Combining symbolic knowledge, narrative +knowledge and organizational knowledge allows us to find proofs which +are probably straight forward. + +\input{applications-q1.tex} + +\noindent\textbf{Implementation} Implementing a naive version of the +organizational aspect can be as simple as querying for all theorems +justified by proofs, ordered by size (or check time). +Figure~\ref{fig:q1short} illustrates how this can be achieved with a +SPARQL query. Maybe we want to go one step further and calculate a +rating that assigns each proof some numeric score of complexity based +on a number of properties. We can achieve this in SPARQL as recent +versions support arithmetic as part of the SPARQL specification; +Figure~\ref{fig:q1long} shows an example. Finding a reasonable rating +is its own topic of research, but we see that as long as it is based +on standard arithmetic, it will be possible to formulate in a SPARQL +query. + +The queries in Figure~\ref{fig:q1full} return a list of all theorems +and associated proofs, naturally this list is bound to be very long. A +suggested way to solve this problem is to introduce some kind of +cutoff value for our complexity score. Another potential solution is +to only list the first~$n$ results, something a user interface would +do anyways. Either way, this is not so much an issue for the +organizational storage engine and more one that a tetrapodal search +aggregator has to account for. + +\subsubsection{Categorizing Algorithms and Algorithmic Problems} + +The second query~$\mathcal{Q}_2$ we decided to focus on wants to +``[f]ind algorithms that solve $NP$-complete graph problems.'' Here we +want the tetrapodal search system to return a listing of algorithms +that solve (graph) problems with a given property (runtime +complexity). We need to consider where each of these three components +might be stored. + +\noindent\textbf{Symbolic and Concrete Aspects} First, let us consider +algorithms. Algorithms can be formulated as computer code which +can be understood as symbolic knowledge (code represented as a +syntax tree) or as concrete knowledge (code as text +files)~\cites[pp. 8--9]{tetra}{virt}. Either way, we will not be +able to query these indices for what problem a given algorithm is +solving, nor is it possible to infer properties as complex as +$NP$-completeness automatically. Meta data of this kind needs to be +stored in a separate index for organizational knowledge, it being +the only fit. + +\noindent\textbf{Organizational Aspect} If we wish to look up properties +about algorithms from organizational knowledge, we first have to +think about how to represent this information. We propose two +approaches, one using the existing ULO ontology and one that +recommends extending ULO to nativly support the concept of +algorithms. Both approaches have their distinct trade-offs. + +As a first approach, we can try to represent algorithms and +problems in terms of existing ULO predicates. As ULO does have a +concept of \texttt{theorem} and \texttt{proof}, it might be +tempting to exploit the Curry-Howard correspondence and represent +algorithms understood as programs in terms of proofs. But that +does not really capture the canonical understanding of algorithms; +algorithms are not actually programs, rather there are programs +that implement algorithms. Even if we do work around this problem, +it is not clear how to represent problems (e.g.\ the traveling +salesman problem or the sorting problem) in terms of theorems +(propositions, types) that get implemented by a proof (algorithm, +program). + +As algorithms make up an important part of certain areas of +research, it might be reasonable to introduce native level support +for algorithms in ULO or separately in another ontology. An +argument for adding support directly to ULO is that ULO aims to be +universal and as such should not be without algorithms. An +argument for a separate ontology is that what we understand as ULO +data sets (Isabelle, Coq exports) already contain triplets from +other ontologies (e.g.\ Dublin Core meta data~\cite{dcreport, +dcowl}) and keeping concepts separate is not entirely +unattractive in itself. + +\subsubsection{Contributors and Number of References} + +The final query~$\mathcal{Q}_3$ from literature~\cite{tetra} we wish +to look at wants to know ``[a]ll areas of math that {Nicolas G.\ de +Bruijn} has worked in and his main contributions.'' This query is +asking by works of a given author~$A$. It also asks for their main +contributions, e.g.\ what paragraphs or code~$A$ has authored. + +\noindent\textbf{Organizational Aspect} ULO has no concept of authors, +contributors, dates and so on. Rather, the idea is to take +advantage of the Dublin Core project which provides an ontology +for such metadata~\cite{dcreport, dcowl}. For example, Dublin Core +provides us with the \texttt{dcterms:creator} and +\texttt{dcterms:contributor} predicates. Servicing~$\mathcal{Q}_3$ +requires us to look for creator~$A$ and then list all associated +\texttt{dcterms:title}s that they have worked on. Of course this +requires above authorship predicates to actually be in use. With +the Isabelle and Coq exports this was hardly the case; running +some experiments we found less than 15 unique contributors and +creators, raising suspicion that meta data is missing in the +original library files. Regardless, in theory ULO allows us to +query for objects ordered by authors. + +Query $\mathcal{Q}_3$ is also asking for the main contributions +of~$A$, that is those works that~$A$ authored that are the most +important. Importance is a quality measure, simply sorting the +result by number of references might be a good start. + +\noindent\textbf{Implementation} A search for contributions by a given author +can easily be formulated in {SPARQL}. +\begin{lstlisting} PREFIX ulo: <https://mathhub.info/ulo#> PREFIX dcterms: <http://purl.org/dc/terms/> @@ -238,13 +247,13 @@ implementations. ?work dcterms:creator|dcterms:contributor "John Smith" . } GROUP BY ?work - \end{lstlisting} +\end{lstlisting} To get the main contributions, we rate each individual \texttt{?work} by its number of \texttt{ulo:uses} references. Extending the {SPARQL} query above, we can query the database for a ordered list of works, starting with the one that has the most references. - \begin{lstlisting} +\begin{lstlisting} PREFIX ulo: <https://mathhub.info/ulo#> PREFIX dcterms: <http://purl.org/dc/terms/> @@ -255,11 +264,10 @@ implementations. } GROUP BY ?work ORDER BY DESC(?refcount) - \end{lstlisting} - We can formulate~$\mathcal{Q}_3$ with just one SPARQL - query. Because everything is handled by the database, access - should be about as quick as we can hope it to be. -\end{itemize} +\end{lstlisting} +We can formulate~$\mathcal{Q}_3$ with just one SPARQL +query. Because everything is handled by the database, access +should be about as quick as we can hope it to be. Experimenting with $\mathcal{Q}_1$ to $\mathcal{Q}_3$ provided us with some insight into ULO and existing ULO exports. $\mathcal{Q}_1$ shows