Skip to content
Snippets Groups Projects
applications.tex 11.7 KiB
Newer Older
  • Learn to ignore specific revisions
  • \section{Applications}\label{sec:applications}
    
    With programming endpoints in place, we can now query the data set containing
    both Isabelle and Coq exports stored in {GraphDB}. We experimented with
    various queries and applications:
    
    \begin{itemize}
        \item Exploring which ULO predicates are actually used and which
          remain unused (Section~\ref{sec:expl}).
    
        \item We ran some queries that were suggested as building blocks
          of a larger tetrapodal search system (Section~\ref{sec:tetraq}).
    
        \item We also experimented with various other more general queries
          for organizational data recommended in literature
          (Section~\ref{sec:miscq}).
    
        \item Finally we built a small web front end that takes visualizes
          the ULO data set (Section~\ref{sec:webq}).
    \end{itemize}
    
    For each example query or application, we try to describe how to
    implement it, what results we observed and if possible we conclude
    with some recommendations for future development of {ULO}.
    
    \subsection{Exploring Existing Data Sets}\label{sec:expl}
    
    As a very first application, we simply looked at what ULO predicates
    are actually used by the respective data sets. With more than
    250~million triplets in the store, we hoped that this would give us
    some insight into the kind of knowledge we are dealing with.
    
    Implementing a query for this job is not very difficult. In SPARQL,
    this can be achieved with the \texttt{COUNT} aggregate.
    
    \begin{lstlisting}
        PREFIX ulo: <https://mathhub.info/ulo#>
    
        SELECT ?predicate (COUNT(?predicate) as ?count)
        WHERE {
            ?s ?predicate ?o .
        }
        GROUP BY ?predicate
        ORDER BY DESC(?count)
    \end{lstlisting}
    This yields a list of all used predicates with \texttt{?count} being
    the number of occurrences. Looking at the results, we find that both
    the Isabelle and the Coq data sets only use subsets of the predicates
    provided by the ULO ontology. The results are listed in
    
    Figure~\ref{fig:used}. In both cases, the exports use less than a
    
    third of the available predicates.
    
    \input{applications-ulo-table.tex}
    
    We also see that the Isabelle and Coq exports use different
    predicates.  For example, the Isabelle contains organizational meta
    information such as information about paragraphs and sections in the
    source document while the Coq export only tells us about the filename
    of the Coq source. That is not particularly problematic as long as we
    
    can trace a given object back to the original source.  Regardless, our
    results do show that both exports have their own particularities and
    with more and more third party libraries exported to ULO one has to
    assume that this heterogeneity will only grow. In particular we want
    to point to the large number of predicates which remain unused in both
    Isabelle and Coq exports. A user formulating queries for ULO might be
    oblivious to the fact that only subsets of exports support given
    predicates.
    
    While not a problem for \emph{ulo-storage} per se, we do expect this
    to be a challenge when building a tetrapodal search
    system. Recommended ways around this ``missing fields'' problem in
    database literature include the clever use of default values or
    inference of missing values~\cite{kdisc, aidisc}, neither of which
    feels particularly applicable to an ULO data set.
    
    \subsection{Querying for Tetrapodal Search}\label{sec:tetraq}
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
    Various queries for a tetrapodal search system were previously
    suggested in literature~\cite{tetra}. We investigated how some of them
    can be realized with ULO data sets and where other data sources are
    required. Where possible, we evaluate proof of concept
    implementations.
    
    
    \begin{itemize}
    
        \item \textbf{$\mathcal{Q}_1$ ``Find theorems with non-elementary
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        proofs.''}  Elementary proofs are those that are considered easy
        and obvious~\cite{elempro}. In consequence,~$\mathcal{Q}_1$ has
        to search for all proofs which are not trivial. Of course, just
        like the distinction between ``theorem'' and ``corollary'' is
        arbitrary, so is any judgment about whether a proof is easy or
        not.
    
        \textbf{Organizational Aspect} A first working hypothesis might be
        to assume that easy proofs are short. In that case, the size, that
        is the number of bytes to store the proof, is our first indicator
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        of proof complexity. {ULO} offers the \texttt{ulo:external-size}
    
        predicate which will allow us to sort by file size. Maybe small
        file size also leads to quick check times in proof assistants and
        automatic theorem provers. With this assumption in mind we could
        use the \texttt{ulo:check-time} predicate. Correlating proof
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        complexity with file size allows us to define one indicator of
        proof complexity based on organizational knowledge alone.
    
        \textbf{Other Aspects} A tetrapodal search system should probably
        also take symbolic knowledge into account. Based on some kind of
        measure of formula complexity, different proofs could be
        rated. Similarly, with narrative knowledge available to us, we
        could count the number of words, references and so on to rate the
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        narrative complexity of a proof. Combining symbolic knowledge,
        narrative knowledge and organizational knowledge allows us to
        find proofs which are probably straight forward.
    
    
        \textbf{Implementation} Implementing a naive version of the
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        organizational aspect can be as simple as querying for all
        theorems justified by proofs, ordered by size (or check time).
    
        \begin{lstlisting}
        PREFIX ulo: <https://mathhub.info/ulo#>
    
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        SELECT ?theorem ?proof ?size
    
        WHERE {
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
            ?theorem ulo:theorem ?i .
            ?proof ulo:justifies ?theorem .
            ?proof ulo:proof ?j .
    
            ?proof ulo:external-size ?size .
        }
        ORDER BY DESC(?size)
        \end{lstlisting}
        Maybe we want to go one step further and calculate a rating that
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        assigns each proof some numeric score of complexity based on a
        number of properties.  We can achieve this in SPARQL as recent
        versions support arithmetic as part of the SPARQL specification.
    
        \begin{lstlisting}
        PREFIX ulo: <https://mathhub.info/ulo#>
        PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
    
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        SELECT ?theorem ?proof ?size (
            xsd:float(?size) + xsd:float(?checktime) as ?rating
        )
    
        WHERE {
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
            ?theorem ulo:theorem ?i .
            ?proof ulo:justifies ?theorem .
            ?proof ulo:proof ?j .
    
            ?proof ulo:external-size ?size .
            ?proof ulo:check-time ?checktime .
        }
        ORDER BY DESC(?rating)
        \end{lstlisting}
        Finding a reasonable rating is its own topic of research, but we
        see that as long as it is based on basic arithmetic, it will be
        possible to formulate in a SPARQL query.
    
        \item \textbf{$\mathcal{Q}_2$ ``Find algorithms that solve
    
        $NP$-complete graph problems.''} Here we want the tetrapodal search
        system to return a listing of algorithms that solve (graph)
        problems with a given property (runtime complexity). We need
        to consider where each of these three components might be stored.
    
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        \textbf{Symbolic and Concrete Aspects} First, let us consider
    
        algorithms. Algorithms can be formulated as computer code which
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        can be understood as symbolic knowledge (code represented as a
        syntax tree) or as concrete knowledge (code as text
        files)~\cites[pp. 8--9]{tetra}{virt}. Either way, we will not be
        able to query these indices for what problem a given algorithm is
        solving, nor is it possible to infer properties as complex as
        $NP$-completeness automatically. Meta data of this kind needs to be
        stored in a separate index for organizational knowledge, it being
        the only fit.
    
        \textbf{Organizational Aspect} If we wish to look up properties
        about algorithms from organizational knowledge, we first have to
        think about how to represent this information. We propose to
        approaches, one using the existing ULO ontology and one that
        recommends extending ULO to nativly support the concept of
        algorithms. Both approaches have their distinct trade-offs.
    
        \textbf{Representing Algorithms} As a first approach, we can try
        to represent algorithms and problems in terms of existing ULO
        predicates. As ULO does have a concept of \texttt{theorem} and
        \texttt{proof}, it might be tempting to exploit the Curry-Howard
        correspondence and represent algorithms understood as programs in
        terms of proofs.  But that does not really capture the canonical
        understanding of algorithms; algorithms are not actually programs,
        rather there are programs that implement algorithms. Even if we do
        work around this problem, it is not clear how to represent
        problems (e.g.\ the traveling salesman problem or the sorting
        problem) in terms of theorems (propositions, types) that get
        implemented by a proof (algorithm, program).
    
        As algorithms make up an important part of certain areas of
        research, it might be reasonable to introduce native level support
        for algorithms in ULO or separately in another ontology. An
        argument for adding support directly to ULO is that ULO aims to be
        universal and as such should not be without algorithms. An
        argument for a separate ontology is that what we understand as ULO
        data sets (Isabelle, Coq) exports already contain triplets from
        other ontologies (e.g.\ Dublin Core meta data~\cite{dcreport,
        dcowl}) and keeping concepts separate is not entirely
        unattractive in itself.
    
        \item \textbf{$\mathcal{Q}_3$ ``All areas of math that {Nicolas G.\
    
        de Bruijn} has worked in and his main contributions.''}  This query
        is asking by works of a given author~$A$.  It also ask for their
        main contributions, e.g.\ what paragraphs or code~$A$ has authored.
    
        \textbf{Organizational Aspect} ULO has no concept of authors,
        contributors dates and so on. Rather, the idea is to take advantage
        of the Dublin Core project which provides an ontology for such
        metadata~\cite{dcreport, dcowl}. For example, Dublin Core provides
        us with the \texttt{dcterms:creator} predicate. Servicing this
        query would mean looking for the creator~$A$ and then listing all
        associated \texttt{dcterms:title} that~$A$ has worked on. For a
        first working version, the exports managed by \emph{ulo-storage}
        are enough to service this query.
    
    
        As~$\mathcal{Q}_3$ is also asking for the main contributions
    
        of~$A$, that is those works that~$A$ authored that are the most
        important. Importance is a quality measure, simply sorting the
        result by number of references might be a good start. Again, this
        is something that should serviceable with just organizational
        knowledge.
    
        \textbf{Implementation} Search for contributions by a given author
        can easily be formulated in {SPARQL}.
        \begin{lstlisting}
    
        PREFIX ulo: <https://mathhub.info/ulo#>
        PREFIX dcterms: <http://purl.org/dc/terms/>
    
        SELECT ?work
        WHERE {
            ?work dcterms:contributor "John Smith" .
        }
        GROUP BY ?work
    
        \end{lstlisting}
        To get all main contributions, we rate each
        individual \texttt{?work} by its number of \texttt{ulo:uses}
        references. Extending the {SPARQL} query, we can query the database
        for a ordered list of works, starting with the one that has the
        most references.
        \begin{lstlisting}
    
        PREFIX ulo: <https://mathhub.info/ulo#>
        PREFIX dcterms: <http://purl.org/dc/terms/>
    
        SELECT ?work (COUNT(?user) as ?refcount)
        WHERE {
            ?work dcterms:contributor "John Smith" .
            ?user ulo:uses ?work .
        }
        GROUP BY ?work
        ORDER BY DESC(?refcount)
    
        \end{lstlisting}
    
        We see that we can formulate the idea behind~$\mathcal{Q}_3$ with
    
        one not very complicated SPARQL query. Because here everything is
        handled by the database access should be quick.
    
    \end{itemize}
    
    \subsection{Organizational Queries}\label{sec:miscq}
    
    
    \emph{{TODO}: SPARQL Queries references in ULO paper}
    
    
    \subsection{Experience with Building a Web Frontend}\label{sec:webq}