Skip to content
Snippets Groups Projects
applications.tex 14.1 KiB
Newer Older
  • Learn to ignore specific revisions
  • \section{Applications}\label{sec:applications}
    
    With programming endpoints in place, we can now query the data set containing
    both Isabelle and Coq exports stored in {GraphDB}. We experimented with
    various queries and applications:
    
    \begin{itemize}
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        \item Exploring which ULO predicates are actually used in the
          existing Coq and Isabelle exports.  We find that more than two
          thirds of existing ULO predicates were not taken advantage of
          (Section~\ref{sec:expl}).
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        \item We investigated queries that could be used to extend the
          system into a larger tetrapodal search system. While some
          organizational queries have obvious canonical solutions others
          introduce questions on how organizational knowledge should be
          organized (Section~\ref{sec:tetraq}).
    
    
        \item We also experimented with various other more general queries
          for organizational data recommended in literature
          (Section~\ref{sec:miscq}).
    
        \item Finally we built a small web front end that takes visualizes
          the ULO data set (Section~\ref{sec:webq}).
    \end{itemize}
    
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
    \noindent Each application will now be discussed in a dedicated section.
    
    
    \subsection{Exploring Existing Data Sets}\label{sec:expl}
    
    
    Four our first application, we looked at what ULO predicates are
    actually used by the respective data sets. With more than 250~million
    triplets in the store, we hoped that this would give us some insight
    into the kind of knowledge we are dealing with.
    
    
    Implementing a query for this job is not very difficult. In SPARQL,
    
    this can be achieved with the \texttt{COUNT} aggregate, the full query
    is given in verbatim in Figure~\ref{fig:preds-query}.  This yields a
    list of all used predicates with \texttt{?count} being the number of
    occurrences (Figure~\ref{fig:preds-result}). Looking at the results,
    we find that both the Isabelle and the Coq data sets only use subsets
    of the predicates provided by the ULO ontology. The full results are
    
    listed in Appendix~\ref{sec:used}. In both cases, what stands out is
    that both exports use less than a third of the available predicates.
    
    
    We also see that the Isabelle and Coq exports use different
    predicates.  For example, the Isabelle contains organizational meta
    information such as information about paragraphs and sections in the
    source document while the Coq export only tells us about the filename
    of the Coq source. That is not particularly problematic as long as we
    
    can trace a given object back to the original source.  Regardless, our
    results do show that both exports have their own particularities and
    with more and more third party libraries exported to ULO one has to
    assume that this heterogeneity will only grow. In particular we want
    to point to the large number of predicates which remain unused in both
    Isabelle and Coq exports. A user formulating queries for ULO might be
    oblivious to the fact that only subsets of exports support given
    predicates.
    
    While not a problem for \emph{ulo-storage} per se, we do expect this
    to be a challenge when building a tetrapodal search
    system. Recommended ways around this ``missing fields'' problem in
    database literature include the clever use of default values or
    inference of missing values~\cite{kdisc, aidisc}, neither of which
    feels particularly applicable to an ULO data set.
    
    
    \input{applications-preds.tex}
    
    
    \subsection{Querying for Tetrapodal Search}\label{sec:tetraq}
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
    Various queries for a tetrapodal search system were previously
    suggested in literature~\cite{tetra}. We investigated how some of them
    can be realized with ULO data sets and where other data sources are
    required. Where possible, we evaluate proof of concept
    implementations.
    
    
    \begin{itemize}
    
        \item \textbf{$\mathcal{Q}_1$ ``Find theorems with non-elementary
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        proofs.''}  Elementary proofs are those that are considered easy
        and obvious~\cite{elempro}. In consequence,~$\mathcal{Q}_1$ has
    
        to search for all proofs which are not trivial.  Of course, just
        like nay distinction between ``theorem'' and ``corollary'' is
        going to be somewhat arbitrary, so is any judgment about whether
        a proof is easy or not.
    
        Existing research on proof difficulty is either very broad or
        specific to one problem. For example, some experiments showed that
        students and prospective school teachers have problems with
        notation, term rewriting and required
        prerequisites~\cite{proofund, proofteach}, none of which seems
        applicable for grading individual proofs for difficulty. On the
        other hand, there is research on rating proofs for individual
        subsets of problems, e.g.\ on the satisfiability of a given CNF
        formula. A particular example is focused on heuristics for how
        long a SAT solver will take to find a solution for a
        given~{CNF}~\cite{proofsat}.
    
        \textbf{Organizational Aspect} A first working hypothesis might be
    
        to assume that elementary proofs are short. In that case, the
        size, that is the number of bytes to store the proof, is our first
        indicator of proof complexity. This is by no means perfect, as
        even identical proofs can be represented in different ways that
        might have vastly different size in bytes. It might be tempting to
        imagine a unique normal form for each proof, but finding such a
        normal form might very well be impossible. As it is very difficult
        to find a generalized definition of proof difficulty, we will
        accept proof size as a first working hypothesis.
    
        {ULO} offers the \texttt{ulo:external-size}
    
        predicate which will allow us to sort by file size. Maybe small
        file size also leads to quick check times in proof assistants and
        automatic theorem provers. With this assumption in mind we could
        use the \texttt{ulo:check-time} predicate. Correlating proof
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        complexity with file size allows us to define one indicator of
        proof complexity based on organizational knowledge alone.
    
        \textbf{Other Aspects} A tetrapodal search system should probably
        also take symbolic knowledge into account. Based on some kind of
        measure of formula complexity, different proofs could be
        rated. Similarly, with narrative knowledge available to us, we
        could count the number of words, references and so on to rate the
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        narrative complexity of a proof. Combining symbolic knowledge,
        narrative knowledge and organizational knowledge allows us to
        find proofs which are probably straight forward.
    
        \input{applications-q1.tex}
    
    
        \textbf{Implementation} Implementing a naive version of the
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        organizational aspect can be as simple as querying for all
        theorems justified by proofs, ordered by size (or check time).
    
        Figure~\ref{fig:q1short} illustrates how this can be achieved with
        a SPARQL query.  Maybe we want to go one step further and
        calculate a rating that assigns each proof some numeric score of
        complexity based on a number of properties. We can achieve this in
        SPARQL as recent versions support arithmetic as part of the SPARQL
        specification; Figure~\ref{fig:q1long} shows an example.  Finding
        a reasonable rating is its own topic of research, but we see that
        as long as it is based on standard arithmetic, it will be possible
        to formulate in a SPARQL query.
    
        The queries in Figure~\ref{fig:q1full} return a list of all
        theorems and associated proofs, naturally this list is bound to be
        very long. A suggested way to solve this problem is to introduce
        some kind of cutoff value for our complexity score. Another potential
        solution is to only list the first~$n$ results, something a user
        interface would do anyways. Either way, this is not so much an issue
        for the organizational storage engine and more one that a tetrapodal
        search aggregator has to account for.
    
        \item \textbf{$\mathcal{Q}_2$ ``Find algorithms that solve
    
        $NP$-complete graph problems.''} Here we want the tetrapodal search
        system to return a listing of algorithms that solve (graph)
        problems with a given property (runtime complexity). We need
        to consider where each of these three components might be stored.
    
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        \textbf{Symbolic and Concrete Aspects} First, let us consider
    
        algorithms. Algorithms can be formulated as computer code which
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        can be understood as symbolic knowledge (code represented as a
        syntax tree) or as concrete knowledge (code as text
        files)~\cites[pp. 8--9]{tetra}{virt}. Either way, we will not be
        able to query these indices for what problem a given algorithm is
        solving, nor is it possible to infer properties as complex as
        $NP$-completeness automatically. Meta data of this kind needs to be
        stored in a separate index for organizational knowledge, it being
        the only fit.
    
        \textbf{Organizational Aspect} If we wish to look up properties
        about algorithms from organizational knowledge, we first have to
    
        think about how to represent this information. We propose two
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        approaches, one using the existing ULO ontology and one that
        recommends extending ULO to nativly support the concept of
        algorithms. Both approaches have their distinct trade-offs.
    
        \textbf{Representing Algorithms} As a first approach, we can try
        to represent algorithms and problems in terms of existing ULO
        predicates. As ULO does have a concept of \texttt{theorem} and
        \texttt{proof}, it might be tempting to exploit the Curry-Howard
        correspondence and represent algorithms understood as programs in
        terms of proofs.  But that does not really capture the canonical
        understanding of algorithms; algorithms are not actually programs,
        rather there are programs that implement algorithms. Even if we do
        work around this problem, it is not clear how to represent
        problems (e.g.\ the traveling salesman problem or the sorting
        problem) in terms of theorems (propositions, types) that get
        implemented by a proof (algorithm, program).
    
        As algorithms make up an important part of certain areas of
        research, it might be reasonable to introduce native level support
        for algorithms in ULO or separately in another ontology. An
        argument for adding support directly to ULO is that ULO aims to be
        universal and as such should not be without algorithms. An
        argument for a separate ontology is that what we understand as ULO
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        data sets (Isabelle, Coq exports) already contain triplets from
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        other ontologies (e.g.\ Dublin Core meta data~\cite{dcreport,
        dcowl}) and keeping concepts separate is not entirely
        unattractive in itself.
    
        \item \textbf{$\mathcal{Q}_3$ ``All areas of math that {Nicolas G.\
    
        de Bruijn} has worked in and his main contributions.''}  This query
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        is asking by works of a given author~$A$.  It also asks for their
    
        main contributions, e.g.\ what paragraphs or code~$A$ has authored.
    
        \textbf{Organizational Aspect} ULO has no concept of authors,
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        contributors, dates and so on. Rather, the idea is to take
        advantage of the Dublin Core project which provides an ontology
        for such metadata~\cite{dcreport, dcowl}. For example, Dublin Core
        provides us with the \texttt{dcterms:creator} and
        \texttt{dcterms:contributor} predicates. Servicing~$\mathcal{Q}_3$
        requires us to look for creator~$A$ and then list all associated
        \texttt{dcterms:title}s that they have worked on. Of course this
        requires above authorship predicates to actually be in use. With
        the Isabelle and Coq exports this was hardly the case; running
        some experiments we found less than 15 unique contributors and
    
        creators, raising suspicion that meta data is missing in the
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        original library files. Regardless, in theory ULO allows us to
        query for objects ordered by authors.
    
        Query $\mathcal{Q}_3$ is also asking for the main contributions
    
        of~$A$, that is those works that~$A$ authored that are the most
        important. Importance is a quality measure, simply sorting the
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        result by number of references might be a good start.
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        \textbf{Implementation} A search for contributions by a given author
    
        can easily be formulated in {SPARQL}.
        \begin{lstlisting}
    
        PREFIX ulo: <https://mathhub.info/ulo#>
        PREFIX dcterms: <http://purl.org/dc/terms/>
    
        SELECT ?work
        WHERE {
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
            ?work dcterms:creator|dcterms:contributor "John Smith" .
    
        }
        GROUP BY ?work
    
        \end{lstlisting}
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        To get the main contributions, we rate each individual
        \texttt{?work} by its number of \texttt{ulo:uses}
        references. Extending the {SPARQL} query above, we can query the
        database for a ordered list of works, starting with the one that
        has the most references.
    
        \begin{lstlisting}
    
        PREFIX ulo: <https://mathhub.info/ulo#>
        PREFIX dcterms: <http://purl.org/dc/terms/>
    
        SELECT ?work (COUNT(?user) as ?refcount)
        WHERE {
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
            ?work dcterms:creator|dcterms:contributor "John Smith" .
    
            ?user ulo:uses ?work .
        }
        GROUP BY ?work
        ORDER BY DESC(?refcount)
    
        \end{lstlisting}
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        We can formulate~$\mathcal{Q}_3$ with just one SPARQL
        query. Because everything is handled by the database, access
        should be quick.
    
    \end{itemize}
    
    Experimenting with $\mathcal{Q}_1$ to $\mathcal{Q}_3$ provided us with
    some insight into ULO and existing ULO exports. $\mathcal{Q}_1$ shows
    that while there is no formal definition for ``elementary proof'', ULO
    allows us to query for heuristics and calculate a difficulty score for
    proofs and their associated theorems. Query~$\mathcal{Q}_2$ illustrates
    the difficulty in finding universal schemas. It remains an open question
    whether ULO should include algorithms as a first class citizen, as a
    concept based around existing ULO predicates or whether it is a better
    idea to design a dedicated ontology and potentially data store entirely.
    Finally, while we were able to formulate a SPARQL query that should
    take care of most of~$\mathcal{Q}_3$ we found that the existing data sets
    contain very little information about authorship. This underlines
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
    the observations made in Section~\ref{sec:expl}, developers writing
    
    applications that query ULO storage need to be aware of the fact that
    existing exports have ``holes''.
    
    \subsection{Organizational Queries}\label{sec:miscq}
    
    Other organizational queries for organizational data sets have
    previously been proposed~\cite{adl}. Here we implement some of them.
    
    
    \subsection{Experience with Building a Web Frontend}\label{sec:webq}