applications.tex

\newpage
\section{Applications}\label{sec:applications}

With endpoints in place, we can now query the ULO/RDF
data set. Depending on the kind of application, different interfaces
and approaches to querying the database might make sense.

\subsection{Querying for Tetrapodal Search}

\emph{ulo-storage} was started with the goal of making organizational
knowledge available for tetrapodal search. We will first take a look
at how ULO/RDF performs at this task. Conveniently, various queries
for a tetrapodal search system were suggested in~\cite{tetra}; we will
investigate how some of them can be realized with ULO/RDF data sets
and where other data sources are required. Where possible, we evaluate
proof of concept implementations.

\begin{itemize}
    \item \textbf{$\mathcal{Q}_{1}$ ``Find theorems with non-elementary
    proofs.''}  Elementary proofs are those that are considered easy and
    obvious. In consequence,~$\mathcal{Q}_{1}$ asks for all proofs
    which are more difficult and not trivial. Of course, just like the
    distinction between ``theorem'' and ``corollary'' is arbitrary, so
    is any judgment about whether a proof is elementary or not.

    \textbf{Organizational Aspect} A first working hypothesis might be
    to assume that easy proofs are short. In that case, the size, that
    is the number of bytes to store the proof, is our first indicator
    of proof complexity. ULO/RDF offers the \texttt{ulo:external-size}
    predicate which will allow us to sort by file size. Maybe small
    file size also leads to quick check times in proof assistants and
    automatic theorem provers. With this assumption in mind we could
    use the \texttt{ulo:check-time} predicate. Correlating proof
    complexity with file size allows us to answer this query with
    organizational data based on {ULO/RDF}.

    \textbf{Other Aspects} A tetrapodal search system should probably
    also take symbolic knowledge into account. Based on some kind of
    measure of formula complexity, different proofs could be
    rated. Similarly, with narrative knowledge available to us, we
    could count the number of words, references and so on to rate the
    narrative complexity of a proof. This shows that combining
    symbolic knowledge, narrative knowledge and organizational
    knowledge could allow us to service~$\mathcal{Q}_{1}$ in a
    heuristic fashion.

    \textbf{Implementation} Implementing a naive version of the
    organizational aspect can be as simple as querying for all proofs,
    ordered by size (or check time).
    \begin{lstlisting}
    PREFIX ulo: <https://mathhub.info/ulo#>

    SELECT ?proof ?size
    WHERE {
        ?proof ulo:proof ?o .
        ?proof ulo:external-size ?size .
    }
    ORDER BY DESC(?size)
    \end{lstlisting}
    Maybe we want to go one step further and calculate a rating that
    assigns each proof some numeric value of complexity.  We can
    achieve this in SPARQL as recent versions support arithmetic
    as part of the SPARQL specification.
    \begin{lstlisting}
    PREFIX ulo: <https://mathhub.info/ulo#>
    PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

    SELECT ?proof ?size (xsd:float(?size) + xsd:float(?checktime) as ?rating)
    WHERE {
        ?proof ulo:proof ?o .
        ?proof ulo:external-size ?size .
        ?proof ulo:check-time ?checktime .
    }
    ORDER BY DESC(?rating)
    \end{lstlisting}
    Finding a reasonable rating is its own topic of research, but we
    see that as long as it is based on basic arithmetic, it will be
    possible to formulate in a SPARQL query.

    \item \textbf{$\mathcal{Q}_{2}$ ``Find algorithms that solve
    $NP$-complete graph problems.''} Here we want the tetrapodal search
    system to return a listing of algorithms that solve (graph)
    problems with a given property (runtime complexity). We need
    to consider where each of these three components might be stored.

    \textbf{Symbolic Aspect} First, let us consider
    algorithms. Algorithms can be formulated as computer code which
    should be accessible from storage for symbolic knowledge. But
    symbolic knowledge encodes just the algorithm itself, not what the
    algorithm is for nor is it possible to quickly evaluate properties
    such as $NP$~completeness for querying. Such meta data needs to be
    stored in a separate index.

    \textbf{Organizational Aspect} The purpose of a given algorithm
    (what problem it solves) as well as meta properties such as time
    and space complexity needs to be stored in a separate index. For
    this to be easy to look up, we propose to file this meta
    information as organizational knowledge. It certainly isn't easily
    searchable from symbolic or narrative knowledge and nor is it
    concrete knowledge as we are talking about general properties of
    algorithms.

    While ULO has a concept of \texttt{ulo:theorem}
    and \texttt{ulo:proof}, there is no native way to represent
    (a)~problems (e.g.\ the traveling salesman problem) and
    (b)~algorithms that compute a given problem. If ULO had such a
    concept, we could then introduce new data predicates that tell us
    something about the properties of problems and
    algorithms. Organized in such a schema, query~$\mathcal{Q}_{2}$
    would be easy to service.

    Of course the question here is whether adding another first level
    concept to ULO is a good idea or whether it would be better to
    think of another ontology for algorithms. We leave this for later
    discussion.

    \item \textbf{$\mathcal{Q}_{3}$ ``Find integer sequences whose generating
    function is a rational polynomial in $\sin(x)$ that has a Maple
    implementation not not affected by the bug in module~$x$.''} We
    see that this query is about finding specific instances, integer
    sequences, with some property.

   This is a case where information would be split between many
   sources.  The name of the sequences is part of the organizational
   data set, the generating function is part of symbolic knowledge,
   the Maple implementation could be part of concrete knowledge.

   Handling this query would probably start by filtering for all
   integer sequences. It is not clear how this should be achieved with
   ULO/RDF as it contains no unified concept of a sequence.  It might
   be possible to take advantage of \texttt{aligned-with} or some
   similar concept to find all such sequences. If this succeeds, an
   ULO index can provide the first step in servicing this query.

   As for the next steps, finding concrete algorithms and in
   particular looking inside of them is not organizational data and
   other indices will need to be queried. That said, it is an open
   question whether ULO should contain more information (e.g.\ about
   properties of the generating function) or whether such information
   can be deduced from symbolic knowledge.

   \item \textbf{$\mathcal{Q}_{4}$ ``CAS implementation of Gröbner bases that
   conform to a definition in AFP.''} Gröbner Bases are a field of
   study in mathematics particular attractive for use in computer
   algebra systems (CAS)~\cite{groebner}. This query is asking for
   concrete implementations of Gröbner Bases that match the definition
   in the Archive of Formal Proofs (AFP).

   We do have ULO/RDF exports for the AFP~\cite{uloisabelle}. Stated
   like this, we can probably assume that $\mathcal{Q}_{4}$ is a query
   for a very specific definition, identified by an ULO {URI}. No smart
   queries necessary. What is missing is the set of implementations,
   that is symbolic knowledge about actual implementations, and a way
   of matching this symbolic knowledge with the definition in
   {AFP}. While surely an interesting problem, it is not a task for
   organizational knowledge.

   \item \textbf{$\mathcal{Q}_{5}$ ``All areas of math that {Nicolas G.\
   de Bruijn} has worked in and his main contributions.''}  This query
   is asking by works of a given author~$A$.  It also ask for their
   main contributions, e.g.\ what paragraphs or code~$A$ has authored.

   \textbf{Organizational Aspect} ULO has no concept of authors,
   contributors dates and so on. Rather, the idea is to take advantage
   of the Dublin Core project which provides an ontology for such
   metadata~\cite{dcreport, dcowl}. For example, Dublin Core provides
   us with the \texttt{dcterms:creator} predicate. Servicing this
   query would mean looking for the creator~$A$ and then listing all
   associated \texttt{dcterms:title} that~$A$ has worked on. For a
   first working version, the exports managed by \emph{ulo-storage}
   are enough to service this query.

   As~$\mathcal{Q}_{5}$ is also asking for the main contributions
   of~$A$, that is those works that~$A$ authored that are the most
   important. Importance is a quality measure, simply sorting the
   result by number of references might be a good start. Again, this
   is something that should serviceable with just organizational
   knowledge.

   \textbf{Implementation} Search for contributions by a given author
   can easily be formulated in {SPARQL}.
   \begin{lstlisting}
    PREFIX ulo: <https://mathhub.info/ulo#>
    PREFIX dcterms: <http://purl.org/dc/terms/>

    SELECT ?work
    WHERE {
        ?work dcterms:contributor "John Smith" .
    }
    GROUP BY ?work
   \end{lstlisting}
   To get all main contributions, we rate each
   individual \texttt{?work} by its number of \texttt{ulo:uses}
   references. Extending the {SPARQL} query, we can query the database
   for a ordered list of works, starting with the one that has the
   most references.
   \begin{lstlisting}
    PREFIX ulo: <https://mathhub.info/ulo#>
    PREFIX dcterms: <http://purl.org/dc/terms/>

    SELECT ?work (COUNT(?user) as ?refcount)
    WHERE {
        ?work dcterms:contributor "John Smith" .
        ?user ulo:uses ?work .
    }
    GROUP BY ?work
    ORDER BY DESC(?refcount)
   \end{lstlisting}
   We see that we can formulate the idea behind~$\mathcal{Q}_{5}$ with
   one not very complicated SPARQL query. Because here everything is
   handled by the database access should be quick.
\end{itemize}