applications.tex

\section{Applications}\label{sec:applications}

With programming endpoints in place, we can now query the data set containing
both Isabelle and Coq exports stored in {GraphDB}. We experimented with
various queries and applications:

\begin{itemize}
    \item Exploring which ULO predicates are actually used and which
      remain unused (Section~\ref{sec:expl}).

    \item We ran some queries that were suggested as building blocks
      of a larger tetrapodal search system (Section~\ref{sec:tetraq}).

    \item We also experimented with various other more general queries
      for organizational data recommended in literature
      (Section~\ref{sec:miscq}).

    \item Finally we built a small web front end that takes visualizes
      the ULO data set (Section~\ref{sec:webq}).
\end{itemize}

For each example query or application, we try to describe how to
implement it, what results we observed and if possible we conclude
with some recommendations for future development of {ULO}.

\subsection{Exploring Existing Data Sets}\label{sec:expl}

As a very first application, we simply looked at what ULO predicates
are actually used by the respective data sets. With more than
250~million triplets in the store, we hoped that this would give us
some insight into the kind of knowledge we are dealing with.

Implementing a query for this job is not very difficult. In SPARQL,
this can be achieved with the \texttt{COUNT} aggregate.
\begin{lstlisting}
    PREFIX ulo: <https://mathhub.info/ulo#>

    SELECT ?predicate (COUNT(?predicate) as ?count)
    WHERE {
        ?s ?predicate ?o .
    }
    GROUP BY ?predicate
    ORDER BY DESC(?count)
\end{lstlisting}
This yields a list of all used predicates with \texttt{?count} being
the number of occurrences. Looking at the results, we find that both
the Isabelle and the Coq data sets only use subsets of the predicates
provided by the ULO ontology. The results are listed in
Figure~\ref{fig:used}. In both cases, the exports use less than a
third of the available predicates.

\input{applications-ulo-table.tex}

We also see that the Isabelle and Coq exports use different
predicates.  For example, the Isabelle contains organizational meta
information such as information about paragraphs and sections in the
source document while the Coq export only tells us about the filename
of the Coq source. That is not particularly problematic as long as we
can trace a given object back to the original source.  Regardless, our
results do show that both exports have their own particularities and
with more and more third party libraries exported to ULO one has to
assume that this heterogeneity will only grow. In particular we want
to point to the large number of predicates which remain unused in both
Isabelle and Coq exports. A user formulating queries for ULO might be
oblivious to the fact that only subsets of exports support given
predicates.

While not a problem for \emph{ulo-storage} per se, we do expect this
to be a challenge when building a tetrapodal search
system. Recommended ways around this ``missing fields'' problem in
database literature include the clever use of default values or
inference of missing values~\cite{kdisc, aidisc}, neither of which
feels particularly applicable to an ULO data set.

\subsection{Querying for Tetrapodal Search}\label{sec:tetraq}

Various queries for a tetrapodal search system were previously
suggested in literature~\cite{tetra}. We investigated how some of them
can be realized with ULO data sets and where other data sources are
required. Where possible, we evaluate proof of concept
implementations.

\begin{itemize}
    \item \textbf{$\mathcal{Q}_1$ ``Find theorems with non-elementary
    proofs.''}  Elementary proofs are those that are considered easy
    and obvious~\cite{elempro}. In consequence,~$\mathcal{Q}_1$ has
    to search for all proofs which are not trivial. Of course, just
    like the distinction between ``theorem'' and ``corollary'' is
    arbitrary, so is any judgment about whether a proof is easy or
    not.

    \textbf{Organizational Aspect} A first working hypothesis might be
    to assume that easy proofs are short. In that case, the size, that
    is the number of bytes to store the proof, is our first indicator
    of proof complexity. {ULO} offers the \texttt{ulo:external-size}
    predicate which will allow us to sort by file size. Maybe small
    file size also leads to quick check times in proof assistants and
    automatic theorem provers. With this assumption in mind we could
    use the \texttt{ulo:check-time} predicate. Correlating proof
    complexity with file size allows us to define one indicator of
    proof complexity based on organizational knowledge alone.

    \textbf{Other Aspects} A tetrapodal search system should probably
    also take symbolic knowledge into account. Based on some kind of
    measure of formula complexity, different proofs could be
    rated. Similarly, with narrative knowledge available to us, we
    could count the number of words, references and so on to rate the
    narrative complexity of a proof. Combining symbolic knowledge,
    narrative knowledge and organizational knowledge allows us to
    find proofs which are probably straight forward.

    \textbf{Implementation} Implementing a naive version of the
    organizational aspect can be as simple as querying for all
    theorems justified by proofs, ordered by size (or check time).
    \begin{lstlisting}
    PREFIX ulo: <https://mathhub.info/ulo#>

    SELECT ?theorem ?proof ?size
    WHERE {
        ?theorem ulo:theorem ?i .
        ?proof ulo:justifies ?theorem .
        ?proof ulo:proof ?j .
        ?proof ulo:external-size ?size .
    }
    ORDER BY DESC(?size)
    \end{lstlisting}
    Maybe we want to go one step further and calculate a rating that
    assigns each proof some numeric score of complexity based on a
    number of properties.  We can achieve this in SPARQL as recent
    versions support arithmetic as part of the SPARQL specification.
    \begin{lstlisting}
    PREFIX ulo: <https://mathhub.info/ulo#>
    PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

    SELECT ?theorem ?proof ?size (
        xsd:float(?size) + xsd:float(?checktime) as ?rating
    )
    WHERE {
        ?theorem ulo:theorem ?i .
        ?proof ulo:justifies ?theorem .
        ?proof ulo:proof ?j .
        ?proof ulo:external-size ?size .
        ?proof ulo:check-time ?checktime .
    }
    ORDER BY DESC(?rating)
    \end{lstlisting}
    Finding a reasonable rating is its own topic of research, but we
    see that as long as it is based on basic arithmetic, it will be
    possible to formulate in a SPARQL query.

    \item \textbf{$\mathcal{Q}_2$ ``Find algorithms that solve
    $NP$-complete graph problems.''} Here we want the tetrapodal search
    system to return a listing of algorithms that solve (graph)
    problems with a given property (runtime complexity). We need
    to consider where each of these three components might be stored.

    \textbf{Symbolic Aspect} First, let us consider
    algorithms. Algorithms can be formulated as computer code which
    should be accessible from storage for symbolic knowledge. But
    symbolic knowledge encodes just the algorithm itself, not what the
    algorithm is for nor is it possible to quickly evaluate properties
    such as $NP$~completeness for querying. Such meta data needs to be
    stored in a separate index.

    \textbf{Organizational Aspect} The purpose of a given algorithm
    (what problem it solves) as well as meta properties such as time
    and space complexity needs to be stored in a separate index. For
    this to be easy to look up, we propose to file this meta
    information as organizational knowledge. It certainly isn't easily
    searchable from symbolic or narrative knowledge and nor is it
    concrete knowledge as we are talking about general properties of
    algorithms.

    While ULO has a concept of \texttt{ulo:theorem}
    and \texttt{ulo:proof}, there is no native way to represent
    (a)~problems (e.g.\ the traveling salesman problem) and
    (b)~algorithms that compute a given problem. If ULO had such a
    concept, we could then introduce new data predicates that tell us
    something about the properties of problems and
    algorithms. Organized in such a schema, query~$\mathcal{Q}_2$
    would be easy to service.

    Of course the question here is whether adding another first level
    concept to ULO is a good idea or whether it would be better to
    think of another ontology for algorithms. We leave this for later
    discussion.

    \item \textbf{$\mathcal{Q}_3$ ``Find integer sequences whose
    generating function is a rational polynomial in $\sin(x)$ that has
    a Maple implementation not not affected by the bug in
    module~$x$.''} We see that this query is about finding specific
    instances, integer sequences, with some property.  This is a case
    where information would be split between many sources.  The name
    of the sequences is part of the organizational data set, the
    generating function is part of symbolic knowledge, the Maple
    implementation could be part of concrete knowledge.

    \textbf{Organizational Aspect} Handling this query would probably
    start by filtering for all integer sequences. It is not clear how
    this should be achieved with ULO/RDF as it contains no unified
    concept of a sequence.  It might be possible to take advantage
    of \texttt{aligned-with} or some similar concept to find all such
    sequences~\cite{align}. If this succeeds, an ULO index can provide the first
    step in servicing this query. Here we are in a similar situation
    as with~$\mathcal{Q}_2$. It is not clear whether we should
    represent the idea behind ``integer sequences'' as a native
    component of ULO or as something building on top of what ULO
    provides.

    As for the next steps, finding concrete algorithms and in
    particular looking inside of them is not organizational data and
    other indices will need to be queried. That said, it is an open
    question whether ULO should contain more information (e.g.\ about
    properties of the generating function) or whether such information
    can be deduced from symbolic knowledge.

    \item \textbf{$\mathcal{Q}_4$ ``CAS implementation of Gröbner bases that
    conform to a definition in AFP.''} Gröbner Bases are a field of
    study in mathematics particular attractive for use in computer
    algebra systems (CAS)~\cite{groebner}. This query is asking for
    concrete implementations of Gröbner Bases that match the definition
    in the Archive of Formal Proofs~(AFP)~\cite{afp}.

    We do have ULO/RDF exports for the AFP~\cite{uloisabelle}. Stated
    like this, we can probably assume that $\mathcal{Q}_4$ is a query
    for a very specific definition, identified by an ULO {URI}. No smart
    queries necessary. What is missing is the set of implementations,
    that is symbolic knowledge about actual implementations, and a way
    of matching this symbolic knowledge with the definition in
    {AFP}. While surely an interesting problem, it is not a task for
    organizational knowledge.

    \item \textbf{$\mathcal{Q}_5$ ``All areas of math that {Nicolas G.\
    de Bruijn} has worked in and his main contributions.''}  This query
    is asking by works of a given author~$A$.  It also ask for their
    main contributions, e.g.\ what paragraphs or code~$A$ has authored.

    \textbf{Organizational Aspect} ULO has no concept of authors,
    contributors dates and so on. Rather, the idea is to take advantage
    of the Dublin Core project which provides an ontology for such
    metadata~\cite{dcreport, dcowl}. For example, Dublin Core provides
    us with the \texttt{dcterms:creator} predicate. Servicing this
    query would mean looking for the creator~$A$ and then listing all
    associated \texttt{dcterms:title} that~$A$ has worked on. For a
    first working version, the exports managed by \emph{ulo-storage}
    are enough to service this query.

    As~$\mathcal{Q}_5$ is also asking for the main contributions
    of~$A$, that is those works that~$A$ authored that are the most
    important. Importance is a quality measure, simply sorting the
    result by number of references might be a good start. Again, this
    is something that should serviceable with just organizational
    knowledge.

    \textbf{Implementation} Search for contributions by a given author
    can easily be formulated in {SPARQL}.
    \begin{lstlisting}
    PREFIX ulo: <https://mathhub.info/ulo#>
    PREFIX dcterms: <http://purl.org/dc/terms/>

    SELECT ?work
    WHERE {
        ?work dcterms:contributor "John Smith" .
    }
    GROUP BY ?work
    \end{lstlisting}
    To get all main contributions, we rate each
    individual \texttt{?work} by its number of \texttt{ulo:uses}
    references. Extending the {SPARQL} query, we can query the database
    for a ordered list of works, starting with the one that has the
    most references.
    \begin{lstlisting}
    PREFIX ulo: <https://mathhub.info/ulo#>
    PREFIX dcterms: <http://purl.org/dc/terms/>

    SELECT ?work (COUNT(?user) as ?refcount)
    WHERE {
        ?work dcterms:contributor "John Smith" .
        ?user ulo:uses ?work .
    }
    GROUP BY ?work
    ORDER BY DESC(?refcount)
    \end{lstlisting}
    We see that we can formulate the idea behind~$\mathcal{Q}_5$ with
    one not very complicated SPARQL query. Because here everything is
    handled by the database access should be quick.
\end{itemize}

\subsection{Organizational Queries}\label{sec:miscq}

\emph{{TODO}: SPARQL Queries references in ULO paper}

\subsection{Experience with Building a Web Frontend}\label{sec:webq}