Skip to content
Snippets Groups Projects
applications.tex 16 KiB
Newer Older
  • Learn to ignore specific revisions
  • \section{Applications}\label{sec:applications}
    
    With programming endpoints in place, we can now query the data set
    containing both Isabelle and Coq exports stored in {GraphDB}. We
    
    experimented with two kinds of applications that talk to a GraphDB
    endpoint.
    
    
    \begin{itemize}
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        \item Exploring which ULO predicates are actually used in the
          existing Coq and Isabelle exports.  We find that more than two
          thirds of existing ULO predicates were not taken advantage of
          (Section~\ref{sec:expl}).
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        \item We investigated queries that could be used to extend the
          system into a larger tetrapodal search system. While some
          organizational queries have obvious canonical solutions others
          introduce questions on how organizational knowledge should be
    
          structured (Section~\ref{sec:tetraq}).
    
    \end{itemize}
    
    
    \noindent These applications will now be discussed in the following
    
    
    \subsection{Exploring Existing Data Sets}\label{sec:expl}
    
    
    For our first application, we looked at which ULO predicates are
    
    actually used by the respective data sets. With more than 250~million
    triplets in the store, we hoped that this would give us some insight
    into the kind of knowledge we are dealing with.
    
    
    Implementing a query for this job is not very difficult. In SPARQL,
    
    this can be achieved with the \texttt{COUNT} aggregate, the full query
    
    is given in verbatim in Figure~\ref{fig:preds-query}.  Our query
    yields a list of all used predicates together with the number of
    
    occurrences (Figure~\ref{fig:preds-result}). Looking at the results,
    we find that both the Isabelle and the Coq data sets only use subsets
    of the predicates provided by the ULO ontology. The full results are
    
    listed in Appendix~\ref{sec:used}. In both cases, what stands out is
    
    that both exports use less than a third of all available ULO~predicates.
    
    
    We also see that the Isabelle and Coq exports use different
    
    predicates.  For example, the Isabelle export contains organizational
    meta information such as information about paragraphs and sections in
    
    the source document while the Coq export only tells us the filename of
    the original Coq source. That is not particularly problematic as long
    as we can trace a given object back to the original source.
    
    Regardless, our results do show that both exports have their own
    particularities and with more and more third party libraries exported
    to ULO one has to assume that this heterogeneity will only grow. In
    particular we want to point to the large number of predicates which
    remain unused in both Isabelle and Coq exports. A user formulating
    queries for ULO might be oblivious to the fact that only subsets of
    exports support given predicates.
    
    \input{applications-preds.tex}
    
    
    We expect the difference between Coq and Isabelle exports to be caused
    by the difference in source material. It is only natural that
    different third party libraries expressed in different languages with
    
    different features will result in different ULO~predicates. However, we
    want to hint at the fact that this could also be an omission in
    
    the exporter code that originally generated the RDF~triplets we
    
    imported. This shows the importance of writing good exporters.
    Exporters taking existing libraries and outputting ULO~triplets must
    lose as little information as possible to ensure good results in a
    larger tetrapodal search system.
    
    Our first application gave us an initial impression of the structure
    of currently available organizational knowledge formulated in ULO
    triplets.  Whether caused by the difference in formal language or
    because of omissions in code that produce ULO~triplets, we must not
    expect predicates to be evenly distributed in the data set. This is
    something to keep in mind especially as the number of ULO exports
    increases.
    
    \subsection{Querying for Tetrapodal Search}\label{sec:tetraq}
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
    Various queries for a tetrapodal search system were previously
    
    suggested in literature~\cite{tetra}. We will now investigate how
    three of them could be realized with the help of ULO data sets and
    where other data sources are required. Where possible, we construct
    proof of concept implementations and evaluate their applicability.
    
    \subsubsection{Elementary Proofs and Computed Scores}
    
    
    Our first query~$\mathcal{Q}_1$ illustrates how we can compute
    
    arithmetic scores for some nodes in our knowledge graph.
    Query~$\mathcal{Q}_1$ asks us to ``\emph{[f]ind theorems with
    non-elementary proofs}.''  Elementary proofs can be understood as
    
    those proof that are considered easy and obvious~\cite{elempro}. In
    
    consequence,~$\mathcal{Q}_1$ has to search for all proofs which are
    
    not trivial.  Of course, just like any distinction between ``theorem''
    
    and ``corollary'' is going to be somewhat arbitrary, so is any
    judgment about whether a proof is easy or not.
    
    Existing research on proof difficulty is either very broad or specific
    to one problem. For example, some experiments showed that students and
    prospective school teachers have problems with notation, term
    rewriting and required prerequisites~\cite{proofund, proofteach}, none
    
    of which seems applicable for grading individual formal proofs for
    
    difficulty. On the other hand, there is research on rating proofs for
    individual subsets of problems, e.g.\ on the satisfiability of a given
    CNF formula. A particular example is focused on heuristics for how
    long a SAT solver will take to find a solution for a
    given~{CNF}. Here, solutions that take long are considered
    harder~\cite{proofsat}.
    
    
    \noindent\emph{Organizational Aspect.} A first working hypothesis
    
    might be to assume that elementary proofs are short. In that case, the
    size, that is the number of bytes to store the proof, is our first
    indicator of proof complexity. This is by no means perfect, as even
    identical proofs can be represented in different ways that might have
    vastly different size in bytes. It might be tempting to imagine a
    unique normal form for each proof, but finding such a normal form
    might very well be impossible. As it is very difficult to find a
    generalized definition of proof difficulty, we will accept proof size
    as a first working hypothesis.
    
    {ULO} offers the \texttt{ulo:external-size} predicate which will allow
    us to sort by file size. Maybe proof complexity also leads to quick
    check times in proof assistants and automatic theorem provers. With
    this assumption in mind we could use the \texttt{ulo:check-time}
    predicate. Correlating proof complexity with file size allows us to
    define one indicator of proof complexity based on organizational
    knowledge alone.
    
    
    \noindent\emph{Other Aspects.} A tetrapodal search system should
    
    probably also take symbolic knowledge into account. Based on some kind
    of measure of formula complexity, different proofs could be
    rated. Similarly, with narrative knowledge available to us, we could
    
    count the number of words, citations and so on to rate the narrative
    
    complexity of a proof. Combining symbolic knowledge, narrative
    knowledge and organizational knowledge allows us to find proofs which
    are probably straight forward.
    
    \input{applications-q1.tex}
    
    
    \noindent\emph{Implementation.} Implementing a naive version of the
    
    organizational aspect can be as simple as querying for all theorems
    justified by proofs, ordered by size (or check time).
    Figure~\ref{fig:q1short} illustrates how this can be achieved with a
    SPARQL query.  Maybe we want to go one step further and calculate a
    rating that assigns each proof some numeric score of complexity based
    on a number of properties. We can achieve this in SPARQL as recent
    versions support arithmetic as part of the SPARQL specification;
    Figure~\ref{fig:q1long} shows an example.  Finding a reasonable rating
    is its own topic of research, but we see that as long as it is based
    on standard arithmetic, it will be possible to formulate in a SPARQL
    query.
    
    The queries in Figure~\ref{fig:q1full} return a list of all theorems
    
    and associated proofs. Naturally, this list is bound to be very
    long. A suggested way to solve this problem is to introduce some kind
    of cutoff value for our complexity score. Another potential solution
    is to only list the first~$n$ results, something a user interface
    would have to do either way (pagination~\cite{pagination}). Either
    way, this is not so much an issue for the organizational storage
    engine and more one that a tetrapodal search aggregator has to account
    for.
    
    Another problem is that computing these scores can be quite time
    intensive. Even if calculating a score for one given object is fast,
    doing so for the whole data set might quickly turn into a problem.  In
    
    particular, if we wish to show the $n$~objects with best scores, we do
    
    need to compute scores for all relevant triplets for that score. In
    \emph{ulo-storage}, all scores we experimented with were easy enough
    
    and the data sets small enough such that this did not become a
    
    concrete problem. But in a larger tetrapodal search system, caching or
    lazily or ahead of time computed results will probably we a
    necessity. Which component takes care of keeping this cache is not
    clear right now, but we advocate for keeping caches of previously
    computed scores separate from the core \emph{ulo-storage} Endpoint
    such that the Endpoint can be easily updated.
    
    
    Understanding query~$\mathcal{Q}_1$ in the way we did makes it
    difficult to present a definite solution. However while thinking
    about~$\mathcal{Q}_1$ we found out that the infrastructure provided by
    \emph{ulo-storage} allows us to compute arbitrary arithmetic scores,
    something that will surely be useful for many applications.
    
    
    \subsubsection{Categorizing Algorithms and Algorithmic Problems}
    
    The second query~$\mathcal{Q}_2$ we decided to focus on wants to
    
    ``\emph{[f]ind algorithms that solve $NP$-complete graph problems.}''
    Here we want the tetrapodal search system to return a listing of
    algorithms that solve (graph) problems with a given property (runtime
    
    complexity). We need to consider where each of these three components
    might be stored.
    
    
    \noindent\emph{Symbolic and Concrete Aspects.} First, let us consider
    
    algorithms. Algorithms can be formulated as computer code which
    can be understood as symbolic knowledge (code represented as a
    syntax tree) or as concrete knowledge (code as text
    files)~\cites[pp. 8--9]{tetra}{virt}. Either way, we will not be
    able to query these indices for what problem a given algorithm is
    solving, nor is it possible to infer properties as complex as
    
    $NP$-completeness automatically. Metadata of this kind needs to be
    
    stored in a separate index for organizational knowledge, it being
    the only fit.
    
    
    \noindent\emph{Organizational Aspect.} If we wish to look up properties
    
    about algorithms from organizational knowledge, we first have to
    think about how to represent this information. We propose two
    approaches, one using the existing ULO ontology and one that
    recommends extending ULO to nativly support the concept of
    algorithms. Both approaches have their distinct trade-offs.
    
    As a first approach, we can try to represent algorithms and
    problems in terms of existing ULO predicates. As ULO does have a
    concept of \texttt{theorem} and \texttt{proof}, it might be
    tempting to exploit the Curry-Howard correspondence and represent
    algorithms understood as programs in terms of proofs.  But that
    does not really capture the canonical understanding of algorithms;
    algorithms are not actually programs, rather there are programs
    that implement algorithms. Even if we do work around this problem,
    it is not clear how to represent problems (e.g.\ the traveling
    salesman problem or the sorting problem) in terms of theorems
    (propositions, types) that get implemented by a proof (algorithm,
    program).
    
    As algorithms make up an important part of certain areas of
    research, it might be reasonable to introduce native level support
    for algorithms in ULO or separately in another ontology. An
    argument for adding support directly to ULO is that ULO aims to be
    universal and as such should not be without algorithms. An
    argument for a separate ontology is that what we understand as ULO
    data sets (Isabelle, Coq exports) already contain triplets from
    
    other ontologies (e.g.\ Dublin Core metadata~\cite{dcreport,
    
    dcowl}) and keeping concepts separate is not entirely
    unattractive in itself.
    
    
    In summary, we see that it is difficult to service~$\mathcal{Q}_2$
    even though the nature of this query is very much one of
    organizational knowledge. It is probably up to the implementors of
    future ULO~exports to find a good way of encoding algorithmic problems
    and solutions. Perhaps a starting point on this endeavor would be to
    find a formal way of structuring information gathered on sites like
    Rosetta Code~\cite{rc}, a site that provides concrete programs that
    solve algorithms problems.
    
    
    \subsubsection{Contributors and Number of References}
    
    
    Finally, query~$\mathcal{Q}_3$ from literature~\cite{tetra} wants to
    know ``\emph{[a]ll areas of math that {Nicolas G.\ de Bruijn} has
      worked in and his main contributions.}''  $\mathcal{Q}_3$~is asking
    for works of a given author~$A$.  It also asks for their main
    contributions, for example which particularly interesting paragraphs
    or code~$A$ has authored.  We picked this particular query as it
    is asking for metadata, something that should be easily serviced by
    organizational knowledge.
    
    \noindent\emph{Organizational Aspect.} ULO has no concept of authors,
    
    contributors, dates and so on. Rather, the idea is to take
    advantage of the Dublin Core project which provides an ontology
    for such metadata~\cite{dcreport, dcowl}. For example, Dublin Core
    provides us with the \texttt{dcterms:creator} and
    \texttt{dcterms:contributor} predicates. Servicing~$\mathcal{Q}_3$
    requires us to look for creator~$A$ and then list all associated
    
    objects that they have worked on. Of course this
    
    requires above authorship predicates to actually be in use. With
    the Isabelle and Coq exports this was hardly the case; running
    some experiments we found less than 15 unique contributors and
    
    creators, raising suspicion that metadata is missing in the
    
    original library files. Regardless, in theory ULO allows us to
    query for objects ordered by authors.
    
    
    \noindent\emph{Implementation.} A search for contributions by a given
    
    author can easily be formulated in {SPARQL}~(Figure~\ref{fig:q2a}).
    
    Query $\mathcal{Q}_3$ is also asking for the main contributions
    of~$A$, that is those works that~$A$ authored that are the most
    important. Sorting the result by number of references might be a good
    start.  To get the main contributions, we rate each individual work by
    its number of \texttt{ulo:uses} references. Extending the previous
    
    {SPARQL}, we can query the database for a ordered list of works,
    starting with the one that has the most
    references~(Figure~\ref{fig:q2b}).  We can formulate~$\mathcal{Q}_3$
    with just one SPARQL query. Because everything is handled by the
    database, access should be about as quick as we can hope it to be.
    
    
    While the sparse data set available to use only returned a handful of
    results, we see that queries like~$\mathcal{Q}_3$ are easily serviced
    with organizational knowledge formulated in ULO~triplets. More
    advanced queries could look at the interlinks between authors and even
    uncover ``citation cartels'' as was done previously with similar
    approaches~\cite{citcart}.
    
    
    \subsubsection{Summarizing $\mathcal{Q}_1$ to $\mathcal{Q}_3$}
    
    Experimenting with $\mathcal{Q}_1$ to $\mathcal{Q}_3$ provided us with
    some insight into ULO and existing ULO exports. $\mathcal{Q}_1$ shows
    that while there is no formal definition for ``elementary proof'', ULO
    
    allows us to query for heuristics and calculate arbitrary arithmetic scores for
    objects of organizational knowledge. Query~$\mathcal{Q}_2$ illustrates
    
    the difficulty in finding universal schemas. It remains an open question
    whether ULO should include algorithms as a first class citizen, as a
    concept based around existing ULO predicates or whether it is a better
    idea to design a dedicated ontology and potentially data store entirely.
    Finally, while we were able to formulate a SPARQL query that should
    take care of most of~$\mathcal{Q}_3$ we found that the existing data sets
    contain very little information about authorship. This underlines
    
    the observations made previously in Section~\ref{sec:expl} and should
    be on the mind of anyone writing exporters that output ULO~triplets.