applications.tex

\section{Applications}\label{sec:applications}

With programming endpoints in place, we can now query the data set containing
both Isabelle and Coq exports stored in {GraphDB}. We experimented with
various queries and applications:

\begin{itemize}
    \item Exploring which ULO predicates are actually used in the
      existing Coq and Isabelle exports.  We find that more than two
      thirds of existing ULO predicates were not taken advantage of
      (Section~\ref{sec:expl}).

    \item We investigated queries that could be used to extend the
      system into a larger tetrapodal search system. While some
      organizational queries have obvious canonical solutions others
      introduce questions on how organizational knowledge should be
      organized (Section~\ref{sec:tetraq}).

    \item We also experimented with various other more general queries
      for organizational data recommended in literature
      (Section~\ref{sec:miscq}).

    \item Finally we built a small web front end that takes visualizes
      the ULO data set (Section~\ref{sec:webq}).
\end{itemize}

\noindent Each application will now be discussed in a dedicated section.

\subsection{Exploring Existing Data Sets}\label{sec:expl}

Four our first application, we looked at what ULO predicates are
actually used by the respective data sets. With more than 250~million
triplets in the store, we hoped that this would give us some insight
into the kind of knowledge we are dealing with.

Implementing a query for this job is not very difficult. In SPARQL,
this can be achieved with the \texttt{COUNT} aggregate, the full query
is given in verbatim in Figure~\ref{fig:preds-query}.  This yields a
list of all used predicates with \texttt{?count} being the number of
occurrences (Figure~\ref{fig:preds-result}). Looking at the results,
we find that both the Isabelle and the Coq data sets only use subsets
of the predicates provided by the ULO ontology. The full results are
listed in Appendix~\ref{sec:used}. In both cases, what stands out is
that both exports use less than a third of the available predicates.

We also see that the Isabelle and Coq exports use different
predicates.  For example, the Isabelle contains organizational meta
information such as information about paragraphs and sections in the
source document while the Coq export only tells us about the filename
of the Coq source. That is not particularly problematic as long as we
can trace a given object back to the original source.  Regardless, our
results do show that both exports have their own particularities and
with more and more third party libraries exported to ULO one has to
assume that this heterogeneity will only grow. In particular we want
to point to the large number of predicates which remain unused in both
Isabelle and Coq exports. A user formulating queries for ULO might be
oblivious to the fact that only subsets of exports support given
predicates.

While not a problem for \emph{ulo-storage} per se, we do expect this
to be a challenge when building a tetrapodal search
system. Recommended ways around this ``missing fields'' problem in
database literature include the clever use of default values or
inference of missing values~\cite{kdisc, aidisc}, neither of which
feels particularly applicable to an ULO data set.

\input{applications-preds.tex}

\subsection{Querying for Tetrapodal Search}\label{sec:tetraq}

Various queries for a tetrapodal search system were previously
suggested in literature~\cite{tetra}. We investigated how some of them
can be realized with ULO data sets and where other data sources are
required. Where possible, we evaluate proof of concept
implementations.

\begin{itemize}
    \item \textbf{$\mathcal{Q}_1$ ``Find theorems with non-elementary
    proofs.''}  Elementary proofs are those that are considered easy
    and obvious~\cite{elempro}. In consequence,~$\mathcal{Q}_1$ has
    to search for all proofs which are not trivial.  Of course, just
    like nay distinction between ``theorem'' and ``corollary'' is
    going to be somewhat arbitrary, so is any judgment about whether
    a proof is easy or not.

    Existing research on proof difficulty is either very broad or
    specific to one problem. For example, some experiments showed that
    students and prospective school teachers have problems with
    notation, term rewriting and required
    prerequisites~\cite{proofund, proofteach}, none of which seems
    applicable for grading individual proofs for difficulty. On the
    other hand, there is research on rating proofs for individual
    subsets of problems, e.g.\ on the satisfiability of a given CNF
    formula. A particular example is focused on heuristics for how
    long a SAT solver will take to find a solution for a
    given~{CNF}~\cite{proofsat}.

    \textbf{Organizational Aspect} A first working hypothesis might be
    to assume that elementary proofs are short. In that case, the
    size, that is the number of bytes to store the proof, is our first
    indicator of proof complexity. This is by no means perfect, as
    even identical proofs can be represented in different ways that
    might have vastly different size in bytes. It might be tempting to
    imagine a unique normal form for each proof, but finding such a
    normal form might very well be impossible. As it is very difficult
    to find a generalized definition of proof difficulty, we will
    accept proof size as a first working hypothesis.

    {ULO} offers the \texttt{ulo:external-size}
    predicate which will allow us to sort by file size. Maybe small
    file size also leads to quick check times in proof assistants and
    automatic theorem provers. With this assumption in mind we could
    use the \texttt{ulo:check-time} predicate. Correlating proof
    complexity with file size allows us to define one indicator of
    proof complexity based on organizational knowledge alone.

    \textbf{Other Aspects} A tetrapodal search system should probably
    also take symbolic knowledge into account. Based on some kind of
    measure of formula complexity, different proofs could be
    rated. Similarly, with narrative knowledge available to us, we
    could count the number of words, references and so on to rate the
    narrative complexity of a proof. Combining symbolic knowledge,
    narrative knowledge and organizational knowledge allows us to
    find proofs which are probably straight forward.

    \input{applications-q1.tex}

    \textbf{Implementation} Implementing a naive version of the
    organizational aspect can be as simple as querying for all
    theorems justified by proofs, ordered by size (or check time).
    Figure~\ref{fig:q1short} illustrates how this can be achieved with
    a SPARQL query.  Maybe we want to go one step further and
    calculate a rating that assigns each proof some numeric score of
    complexity based on a number of properties. We can achieve this in
    SPARQL as recent versions support arithmetic as part of the SPARQL
    specification; Figure~\ref{fig:q1long} shows an example.  Finding
    a reasonable rating is its own topic of research, but we see that
    as long as it is based on standard arithmetic, it will be possible
    to formulate in a SPARQL query.

    The queries in Figure~\ref{fig:q1full} return a list of all
    theorems and associated proofs, naturally this list is bound to be
    very long. A suggested way to solve this problem is to introduce
    some kind of cutoff value for our complexity score. Another potential
    solution is to only list the first~$n$ results, something a user
    interface would do anyways. Either way, this is not so much an issue
    for the organizational storage engine and more one that a tetrapodal
    search aggregator has to account for.

    \item \textbf{$\mathcal{Q}_2$ ``Find algorithms that solve
    $NP$-complete graph problems.''} Here we want the tetrapodal search
    system to return a listing of algorithms that solve (graph)
    problems with a given property (runtime complexity). We need
    to consider where each of these three components might be stored.

    \textbf{Symbolic and Concrete Aspects} First, let us consider
    algorithms. Algorithms can be formulated as computer code which
    can be understood as symbolic knowledge (code represented as a
    syntax tree) or as concrete knowledge (code as text
    files)~\cites[pp. 8--9]{tetra}{virt}. Either way, we will not be
    able to query these indices for what problem a given algorithm is
    solving, nor is it possible to infer properties as complex as
    $NP$-completeness automatically. Meta data of this kind needs to be
    stored in a separate index for organizational knowledge, it being
    the only fit.

    \textbf{Organizational Aspect} If we wish to look up properties
    about algorithms from organizational knowledge, we first have to
    think about how to represent this information. We propose two
    approaches, one using the existing ULO ontology and one that
    recommends extending ULO to nativly support the concept of
    algorithms. Both approaches have their distinct trade-offs.

    \textbf{Representing Algorithms} As a first approach, we can try
    to represent algorithms and problems in terms of existing ULO
    predicates. As ULO does have a concept of \texttt{theorem} and
    \texttt{proof}, it might be tempting to exploit the Curry-Howard
    correspondence and represent algorithms understood as programs in
    terms of proofs.  But that does not really capture the canonical
    understanding of algorithms; algorithms are not actually programs,
    rather there are programs that implement algorithms. Even if we do
    work around this problem, it is not clear how to represent
    problems (e.g.\ the traveling salesman problem or the sorting
    problem) in terms of theorems (propositions, types) that get
    implemented by a proof (algorithm, program).

    As algorithms make up an important part of certain areas of
    research, it might be reasonable to introduce native level support
    for algorithms in ULO or separately in another ontology. An
    argument for adding support directly to ULO is that ULO aims to be
    universal and as such should not be without algorithms. An
    argument for a separate ontology is that what we understand as ULO
    data sets (Isabelle, Coq exports) already contain triplets from
    other ontologies (e.g.\ Dublin Core meta data~\cite{dcreport,
    dcowl}) and keeping concepts separate is not entirely
    unattractive in itself.

    \item \textbf{$\mathcal{Q}_3$ ``All areas of math that {Nicolas G.\
    de Bruijn} has worked in and his main contributions.''}  This query
    is asking by works of a given author~$A$.  It also asks for their
    main contributions, e.g.\ what paragraphs or code~$A$ has authored.

    \textbf{Organizational Aspect} ULO has no concept of authors,
    contributors, dates and so on. Rather, the idea is to take
    advantage of the Dublin Core project which provides an ontology
    for such metadata~\cite{dcreport, dcowl}. For example, Dublin Core
    provides us with the \texttt{dcterms:creator} and
    \texttt{dcterms:contributor} predicates. Servicing~$\mathcal{Q}_3$
    requires us to look for creator~$A$ and then list all associated
    \texttt{dcterms:title}s that they have worked on. Of course this
    requires above authorship predicates to actually be in use. With
    the Isabelle and Coq exports this was hardly the case; running
    some experiments we found less than 15 unique contributors and
    creators, raising suspicion that meta data is missing in the
    original library files. Regardless, in theory ULO allows us to
    query for objects ordered by authors.

    Query $\mathcal{Q}_3$ is also asking for the main contributions
    of~$A$, that is those works that~$A$ authored that are the most
    important. Importance is a quality measure, simply sorting the
    result by number of references might be a good start.

    \textbf{Implementation} A search for contributions by a given author
    can easily be formulated in {SPARQL}.
    \begin{lstlisting}
    PREFIX ulo: <https://mathhub.info/ulo#>
    PREFIX dcterms: <http://purl.org/dc/terms/>

    SELECT ?work
    WHERE {
        ?work dcterms:creator|dcterms:contributor "John Smith" .
    }
    GROUP BY ?work
    \end{lstlisting}
    To get the main contributions, we rate each individual
    \texttt{?work} by its number of \texttt{ulo:uses}
    references. Extending the {SPARQL} query above, we can query the
    database for a ordered list of works, starting with the one that
    has the most references.
    \begin{lstlisting}
    PREFIX ulo: <https://mathhub.info/ulo#>
    PREFIX dcterms: <http://purl.org/dc/terms/>

    SELECT ?work (COUNT(?user) as ?refcount)
    WHERE {
        ?work dcterms:creator|dcterms:contributor "John Smith" .
        ?user ulo:uses ?work .
    }
    GROUP BY ?work
    ORDER BY DESC(?refcount)
    \end{lstlisting}
    We can formulate~$\mathcal{Q}_3$ with just one SPARQL
    query. Because everything is handled by the database, access
    should be quick.
\end{itemize}

Experimenting with $\mathcal{Q}_1$ to $\mathcal{Q}_3$ provided us with
some insight into ULO and existing ULO exports. $\mathcal{Q}_1$ shows
that while there is no formal definition for ``elementary proof'', ULO
allows us to query for heuristics and calculate a difficulty score for
proofs and their associated theorems. Query~$\mathcal{Q}_2$ illustrates
the difficulty in finding universal schemas. It remains an open question
whether ULO should include algorithms as a first class citizen, as a
concept based around existing ULO predicates or whether it is a better
idea to design a dedicated ontology and potentially data store entirely.
Finally, while we were able to formulate a SPARQL query that should
take care of most of~$\mathcal{Q}_3$ we found that the existing data sets
contain very little information about authorship. This underlines
the observations made in Section~\ref{sec:expl}, developers writing
applications that query ULO storage need to be aware of the fact that
existing exports have ``holes''.

\subsection{Organizational Queries}\label{sec:miscq}

Other organizational queries for organizational data sets have
previously been proposed~\cite{adl}. Here we implement some of them.

\subsection{Experience with Building a Web Frontend}\label{sec:webq}