Skip to content
Snippets Groups Projects
applications.tex 12.3 KiB
Newer Older
  • Learn to ignore specific revisions
  • \section{Applications}\label{sec:applications}
    
    With endpoints in place, we can now query the ULO/RDF data set. This
    section describes some experiments with the \emph{ulo-endpoint}
    Endpoint {API}. In particular, we query the storage backend for some
    statistics, implement some queries suggested for tetrapodal search
    
    \subsection{Exploring Existing Data Sets}
    
    As previously stated, there already exist exports to ULO for both
    Isabelle and Coq libraries~\cite{uloisabelle, ulocoq}. As a very first
    application, we simply look at what ULO predicates are actually used
    by the respective data sets.  Implementing such a query is not very
    difficult.  In SPARQL, this can be achieved with the \texttt{COUNT}
    aggregate.
    \begin{lstlisting}
        PREFIX ulo: <https://mathhub.info/ulo#>
    
        SELECT ?predicate (COUNT(?predicate) as ?count)
        WHERE {
            ?s ?predicate ?o .
        }
        GROUP BY ?predicate
        ORDER BY DESC(?count)
    \end{lstlisting}
    This yields a list of all used predicates with \texttt{?count} being
    the number of occurrences. Looking at the results, we find that both
    the Isabelle and the Coq data sets only use subsets of the predicates
    provided by the ULO ontology. The results are listed in
    figure~\ref{fig:used}. In both cases, the exports use less than a
    third of the available predicates.
    
    \input{applications-ulo-table.tex}
    
    We also see that the Isabelle and Coq exports use different
    predicates.  For example, the Isabelle contains organizational meta
    information such as information about paragraphs and sections in the
    source document while the Coq export only tells us about the filename
    of the Coq source. That is not particularly problematic as long as we
    can trace a given object back to the original Isabelle/Coq source.
    
    However, our results do show that both exports have their own
    particularities and with more and more third party libraries exported
    to ULO one can assume that this heterogeneity only grows. In particular
    we want to point to the large number of predicates which remain unused
    in both Isabelle and Coq exports. A user formulating queries for ULO
    might be oblivious to the fact that only subsets of exports support
    given predicates. While not a problem for \emph{ulo-storage} per se,
    we expect this to be a major challenge when building a system of
    tetrapodal search.
    
    \subsection{Querying for Tetrapodal Search}
    
    
    \emph{ulo-storage} was started with the goal of making organizational
    knowledge available for tetrapodal search. We will first take a look
    
    at how ULO/RDF performs at this task. Conveniently, various queries
    
    for a tetrapodal search system were suggested in~\cite{tetra}; we will
    
    investigate how some of them can be realized with ULO/RDF data sets
    and where other data sources are required. Where possible, we evaluate
    proof of concept implementations.
    
    \begin{itemize}
    
        \item \textbf{$\mathcal{Q}_1$ ``Find theorems with non-elementary
    
        proofs.''}  Elementary proofs are those that are considered easy and
    
        obvious. In consequence,~$\mathcal{Q}_1$ asks for all proofs
    
        which are more difficult and not trivial. Of course, just like the
        distinction between ``theorem'' and ``corollary'' is arbitrary, so
        is any judgment about whether a proof is elementary or not.
    
    
        \textbf{Organizational Aspect} A first working hypothesis might be
        to assume that easy proofs are short. In that case, the size, that
        is the number of bytes to store the proof, is our first indicator
        of proof complexity. ULO/RDF offers the \texttt{ulo:external-size}
    
        predicate which will allow us to sort by file size. Maybe small
        file size also leads to quick check times in proof assistants and
        automatic theorem provers. With this assumption in mind we could
        use the \texttt{ulo:check-time} predicate. Correlating proof
        complexity with file size allows us to answer this query with
        organizational data based on {ULO/RDF}.
    
    
        \textbf{Other Aspects} A tetrapodal search system should probably
        also take symbolic knowledge into account. Based on some kind of
        measure of formula complexity, different proofs could be
        rated. Similarly, with narrative knowledge available to us, we
        could count the number of words, references and so on to rate the
        narrative complexity of a proof. This shows that combining
        symbolic knowledge, narrative knowledge and organizational
    
        knowledge could allow us to service~$\mathcal{Q}_1$ in a
    
        heuristic fashion.
    
        \textbf{Implementation} Implementing a naive version of the
        organizational aspect can be as simple as querying for all proofs,
        ordered by size (or check time).
        \begin{lstlisting}
        PREFIX ulo: <https://mathhub.info/ulo#>
    
        SELECT ?proof ?size
        WHERE {
            ?proof ulo:proof ?o .
            ?proof ulo:external-size ?size .
        }
        ORDER BY DESC(?size)
        \end{lstlisting}
        Maybe we want to go one step further and calculate a rating that
        assigns each proof some numeric value of complexity.  We can
        achieve this in SPARQL as recent versions support arithmetic
        as part of the SPARQL specification.
        \begin{lstlisting}
        PREFIX ulo: <https://mathhub.info/ulo#>
        PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
    
        SELECT ?proof ?size (xsd:float(?size) + xsd:float(?checktime) as ?rating)
        WHERE {
            ?proof ulo:proof ?o .
            ?proof ulo:external-size ?size .
            ?proof ulo:check-time ?checktime .
        }
        ORDER BY DESC(?rating)
        \end{lstlisting}
        Finding a reasonable rating is its own topic of research, but we
        see that as long as it is based on basic arithmetic, it will be
        possible to formulate in a SPARQL query.
    
        \item \textbf{$\mathcal{Q}_2$ ``Find algorithms that solve
    
        $NP$-complete graph problems.''} Here we want the tetrapodal search
        system to return a listing of algorithms that solve (graph)
        problems with a given property (runtime complexity). We need
        to consider where each of these three components might be stored.
    
    
        \textbf{Symbolic Aspect} First, let us consider
        algorithms. Algorithms can be formulated as computer code which
        should be accessible from storage for symbolic knowledge. But
        symbolic knowledge encodes just the algorithm itself, not what the
        algorithm is for nor is it possible to quickly evaluate properties
        such as $NP$~completeness for querying. Such meta data needs to be
    
        stored in a separate index.
    
    
        \textbf{Organizational Aspect} The purpose of a given algorithm
        (what problem it solves) as well as meta properties such as time
        and space complexity needs to be stored in a separate index. For
        this to be easy to look up, we propose to file this meta
        information as organizational knowledge. It certainly isn't easily
        searchable from symbolic or narrative knowledge and nor is it
        concrete knowledge as we are talking about general properties of
        algorithms.
    
        While ULO has a concept of \texttt{ulo:theorem}
        and \texttt{ulo:proof}, there is no native way to represent
        (a)~problems (e.g.\ the traveling salesman problem) and
        (b)~algorithms that compute a given problem. If ULO had such a
        concept, we could then introduce new data predicates that tell us
        something about the properties of problems and
    
        algorithms. Organized in such a schema, query~$\mathcal{Q}_2$
    
        would be easy to service.
    
        Of course the question here is whether adding another first level
        concept to ULO is a good idea or whether it would be better to
        think of another ontology for algorithms. We leave this for later
        discussion.
    
        \item \textbf{$\mathcal{Q}_3$ ``Find integer sequences whose
        generating function is a rational polynomial in $\sin(x)$ that has
        a Maple implementation not not affected by the bug in
        module~$x$.''} We see that this query is about finding specific
        instances, integer sequences, with some property.  This is a case
        where information would be split between many sources.  The name
        of the sequences is part of the organizational data set, the
        generating function is part of symbolic knowledge, the Maple
        implementation could be part of concrete knowledge.
    
        \textbf{Organizational Aspect} Handling this query would probably
        start by filtering for all integer sequences. It is not clear how
        this should be achieved with ULO/RDF as it contains no unified
        concept of a sequence.  It might be possible to take advantage
        of \texttt{aligned-with} or some similar concept to find all such
        sequences. If this succeeds, an ULO index can provide the first
        step in servicing this query. Here we are in a similar situation
        as with~$\mathcal{Q}_2$. It is not clear whether we should
        represent the idea behind ``integer sequences'' as a native
        component of ULO or as something building on top of what ULO
        provides.
    
        As for the next steps, finding concrete algorithms and in
        particular looking inside of them is not organizational data and
        other indices will need to be queried. That said, it is an open
        question whether ULO should contain more information (e.g.\ about
        properties of the generating function) or whether such information
        can be deduced from symbolic knowledge.
    
    
        \item \textbf{$\mathcal{Q}_4$ ``CAS implementation of Gröbner bases that
    
        conform to a definition in AFP.''} Gröbner Bases are a field of
        study in mathematics particular attractive for use in computer
        algebra systems (CAS)~\cite{groebner}. This query is asking for
        concrete implementations of Gröbner Bases that match the definition
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        in the Archive of Formal Proofs~(AFP)~\cite{afp}.
    
    
        We do have ULO/RDF exports for the AFP~\cite{uloisabelle}. Stated
    
        like this, we can probably assume that $\mathcal{Q}_4$ is a query
    
        for a very specific definition, identified by an ULO {URI}. No smart
        queries necessary. What is missing is the set of implementations,
        that is symbolic knowledge about actual implementations, and a way
        of matching this symbolic knowledge with the definition in
        {AFP}. While surely an interesting problem, it is not a task for
        organizational knowledge.
    
    
        \item \textbf{$\mathcal{Q}_5$ ``All areas of math that {Nicolas G.\
    
        de Bruijn} has worked in and his main contributions.''}  This query
        is asking by works of a given author~$A$.  It also ask for their
        main contributions, e.g.\ what paragraphs or code~$A$ has authored.
    
        \textbf{Organizational Aspect} ULO has no concept of authors,
        contributors dates and so on. Rather, the idea is to take advantage
        of the Dublin Core project which provides an ontology for such
        metadata~\cite{dcreport, dcowl}. For example, Dublin Core provides
        us with the \texttt{dcterms:creator} predicate. Servicing this
        query would mean looking for the creator~$A$ and then listing all
        associated \texttt{dcterms:title} that~$A$ has worked on. For a
        first working version, the exports managed by \emph{ulo-storage}
        are enough to service this query.
    
    
        As~$\mathcal{Q}_5$ is also asking for the main contributions
    
        of~$A$, that is those works that~$A$ authored that are the most
        important. Importance is a quality measure, simply sorting the
        result by number of references might be a good start. Again, this
        is something that should serviceable with just organizational
        knowledge.
    
        \textbf{Implementation} Search for contributions by a given author
        can easily be formulated in {SPARQL}.
        \begin{lstlisting}
    
        PREFIX ulo: <https://mathhub.info/ulo#>
        PREFIX dcterms: <http://purl.org/dc/terms/>
    
        SELECT ?work
        WHERE {
            ?work dcterms:contributor "John Smith" .
        }
        GROUP BY ?work
    
        \end{lstlisting}
        To get all main contributions, we rate each
        individual \texttt{?work} by its number of \texttt{ulo:uses}
        references. Extending the {SPARQL} query, we can query the database
        for a ordered list of works, starting with the one that has the
        most references.
        \begin{lstlisting}
    
        PREFIX ulo: <https://mathhub.info/ulo#>
        PREFIX dcterms: <http://purl.org/dc/terms/>
    
        SELECT ?work (COUNT(?user) as ?refcount)
        WHERE {
            ?work dcterms:contributor "John Smith" .
            ?user ulo:uses ?work .
        }
        GROUP BY ?work
        ORDER BY DESC(?refcount)
    
        \end{lstlisting}
        We see that we can formulate the idea behind~$\mathcal{Q}_5$ with
        one not very complicated SPARQL query. Because here everything is
        handled by the database access should be quick.
    
    \end{itemize}