Newer
Older
\section{Applications}\label{sec:applications}
With programming endpoints in place, we can now query the data set containing
both Isabelle and Coq exports stored in {GraphDB}. We experimented with
various queries and applications:
\begin{itemize}
\item Exploring which ULO predicates are actually used and which
remain unused (Section~\ref{sec:expl}).
\item We ran some queries that were suggested as building blocks
of a larger tetrapodal search system (Section~\ref{sec:tetraq}).
\item We also experimented with various other more general queries
for organizational data recommended in literature
(Section~\ref{sec:miscq}).
\item Finally we built a small web front end that takes visualizes
the ULO data set (Section~\ref{sec:webq}).
\end{itemize}
For each example query or application, we try to describe how to
implement it, what results we observed and if possible we conclude
with some recommendations for future development of {ULO}.
\subsection{Exploring Existing Data Sets}\label{sec:expl}
As a very first application, we simply looked at what ULO predicates
are actually used by the respective data sets. With more than
250~million triplets in the store, we hoped that this would give us
some insight into the kind of knowledge we are dealing with.
Implementing a query for this job is not very difficult. In SPARQL,
this can be achieved with the \texttt{COUNT} aggregate.
\begin{lstlisting}
PREFIX ulo: <https://mathhub.info/ulo#>
SELECT ?predicate (COUNT(?predicate) as ?count)
WHERE {
?s ?predicate ?o .
}
GROUP BY ?predicate
ORDER BY DESC(?count)
\end{lstlisting}
This yields a list of all used predicates with \texttt{?count} being
the number of occurrences. Looking at the results, we find that both
the Isabelle and the Coq data sets only use subsets of the predicates
provided by the ULO ontology. The results are listed in
Figure~\ref{fig:used}. In both cases, the exports use less than a
third of the available predicates.
\input{applications-ulo-table.tex}
We also see that the Isabelle and Coq exports use different
predicates. For example, the Isabelle contains organizational meta
information such as information about paragraphs and sections in the
source document while the Coq export only tells us about the filename
of the Coq source. That is not particularly problematic as long as we
can trace a given object back to the original source. Regardless, our
results do show that both exports have their own particularities and
with more and more third party libraries exported to ULO one has to
assume that this heterogeneity will only grow. In particular we want
to point to the large number of predicates which remain unused in both
Isabelle and Coq exports. A user formulating queries for ULO might be
oblivious to the fact that only subsets of exports support given
predicates.
While not a problem for \emph{ulo-storage} per se, we do expect this
to be a challenge when building a tetrapodal search
system. Recommended ways around this ``missing fields'' problem in
database literature include the clever use of default values or
inference of missing values~\cite{kdisc, aidisc}, neither of which
feels particularly applicable to an ULO data set.
\subsection{Querying for Tetrapodal Search}\label{sec:tetraq}
Various queries for a tetrapodal search system were previously
suggested in literature~\cite{tetra}. We investigated how some of them
can be realized with ULO data sets and where other data sources are
required. Where possible, we evaluate proof of concept
implementations.
\item \textbf{$\mathcal{Q}_1$ ``Find theorems with non-elementary
proofs.''} Elementary proofs are those that are considered easy
and obvious~\cite{elempro}. In consequence,~$\mathcal{Q}_1$ has
to search for all proofs which are not trivial. Of course, just
like the distinction between ``theorem'' and ``corollary'' is
arbitrary, so is any judgment about whether a proof is easy or
not.
\textbf{Organizational Aspect} A first working hypothesis might be
to assume that easy proofs are short. In that case, the size, that
is the number of bytes to store the proof, is our first indicator
of proof complexity. {ULO} offers the \texttt{ulo:external-size}
predicate which will allow us to sort by file size. Maybe small
file size also leads to quick check times in proof assistants and
automatic theorem provers. With this assumption in mind we could
use the \texttt{ulo:check-time} predicate. Correlating proof
complexity with file size allows us to define one indicator of
proof complexity based on organizational knowledge alone.
\textbf{Other Aspects} A tetrapodal search system should probably
also take symbolic knowledge into account. Based on some kind of
measure of formula complexity, different proofs could be
rated. Similarly, with narrative knowledge available to us, we
could count the number of words, references and so on to rate the
narrative complexity of a proof. Combining symbolic knowledge,
narrative knowledge and organizational knowledge allows us to
find proofs which are probably straight forward.
\textbf{Implementation} Implementing a naive version of the
organizational aspect can be as simple as querying for all
theorems justified by proofs, ordered by size (or check time).
\begin{lstlisting}
PREFIX ulo: <https://mathhub.info/ulo#>
?theorem ulo:theorem ?i .
?proof ulo:justifies ?theorem .
?proof ulo:proof ?j .
?proof ulo:external-size ?size .
}
ORDER BY DESC(?size)
\end{lstlisting}
Maybe we want to go one step further and calculate a rating that
assigns each proof some numeric score of complexity based on a
number of properties. We can achieve this in SPARQL as recent
versions support arithmetic as part of the SPARQL specification.
\begin{lstlisting}
PREFIX ulo: <https://mathhub.info/ulo#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?theorem ?proof ?size (
xsd:float(?size) + xsd:float(?checktime) as ?rating
)
?theorem ulo:theorem ?i .
?proof ulo:justifies ?theorem .
?proof ulo:proof ?j .
?proof ulo:external-size ?size .
?proof ulo:check-time ?checktime .
}
ORDER BY DESC(?rating)
\end{lstlisting}
Finding a reasonable rating is its own topic of research, but we
see that as long as it is based on basic arithmetic, it will be
possible to formulate in a SPARQL query.
\item \textbf{$\mathcal{Q}_2$ ``Find algorithms that solve
$NP$-complete graph problems.''} Here we want the tetrapodal search
system to return a listing of algorithms that solve (graph)
problems with a given property (runtime complexity). We need
to consider where each of these three components might be stored.
\textbf{Symbolic and Concrete Aspects} First, let us consider
algorithms. Algorithms can be formulated as computer code which
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
can be understood as symbolic knowledge (code represented as a
syntax tree) or as concrete knowledge (code as text
files)~\cites[pp. 8--9]{tetra}{virt}. Either way, we will not be
able to query these indices for what problem a given algorithm is
solving, nor is it possible to infer properties as complex as
$NP$-completeness automatically. Meta data of this kind needs to be
stored in a separate index for organizational knowledge, it being
the only fit.
\textbf{Organizational Aspect} If we wish to look up properties
about algorithms from organizational knowledge, we first have to
think about how to represent this information. We propose to
approaches, one using the existing ULO ontology and one that
recommends extending ULO to nativly support the concept of
algorithms. Both approaches have their distinct trade-offs.
\textbf{Representing Algorithms} As a first approach, we can try
to represent algorithms and problems in terms of existing ULO
predicates. As ULO does have a concept of \texttt{theorem} and
\texttt{proof}, it might be tempting to exploit the Curry-Howard
correspondence and represent algorithms understood as programs in
terms of proofs. But that does not really capture the canonical
understanding of algorithms; algorithms are not actually programs,
rather there are programs that implement algorithms. Even if we do
work around this problem, it is not clear how to represent
problems (e.g.\ the traveling salesman problem or the sorting
problem) in terms of theorems (propositions, types) that get
implemented by a proof (algorithm, program).
As algorithms make up an important part of certain areas of
research, it might be reasonable to introduce native level support
for algorithms in ULO or separately in another ontology. An
argument for adding support directly to ULO is that ULO aims to be
universal and as such should not be without algorithms. An
argument for a separate ontology is that what we understand as ULO
data sets (Isabelle, Coq) exports already contain triplets from
other ontologies (e.g.\ Dublin Core meta data~\cite{dcreport,
dcowl}) and keeping concepts separate is not entirely
unattractive in itself.
\item \textbf{$\mathcal{Q}_3$ ``All areas of math that {Nicolas G.\
de Bruijn} has worked in and his main contributions.''} This query
is asking by works of a given author~$A$. It also ask for their
main contributions, e.g.\ what paragraphs or code~$A$ has authored.
\textbf{Organizational Aspect} ULO has no concept of authors,
contributors dates and so on. Rather, the idea is to take advantage
of the Dublin Core project which provides an ontology for such
metadata~\cite{dcreport, dcowl}. For example, Dublin Core provides
us with the \texttt{dcterms:creator} predicate. Servicing this
query would mean looking for the creator~$A$ and then listing all
associated \texttt{dcterms:title} that~$A$ has worked on. For a
first working version, the exports managed by \emph{ulo-storage}
are enough to service this query.
As~$\mathcal{Q}_3$ is also asking for the main contributions
of~$A$, that is those works that~$A$ authored that are the most
important. Importance is a quality measure, simply sorting the
result by number of references might be a good start. Again, this
is something that should serviceable with just organizational
knowledge.
\textbf{Implementation} Search for contributions by a given author
can easily be formulated in {SPARQL}.
\begin{lstlisting}
PREFIX ulo: <https://mathhub.info/ulo#>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?work
WHERE {
?work dcterms:contributor "John Smith" .
}
GROUP BY ?work
\end{lstlisting}
To get all main contributions, we rate each
individual \texttt{?work} by its number of \texttt{ulo:uses}
references. Extending the {SPARQL} query, we can query the database
for a ordered list of works, starting with the one that has the
most references.
\begin{lstlisting}
PREFIX ulo: <https://mathhub.info/ulo#>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?work (COUNT(?user) as ?refcount)
WHERE {
?work dcterms:contributor "John Smith" .
?user ulo:uses ?work .
}
GROUP BY ?work
ORDER BY DESC(?refcount)
We see that we can formulate the idea behind~$\mathcal{Q}_3$ with
one not very complicated SPARQL query. Because here everything is
handled by the database access should be quick.
\subsection{Organizational Queries}\label{sec:miscq}
\emph{{TODO}: SPARQL Queries references in ULO paper}
\subsection{Experience with Building a Web Frontend}\label{sec:webq}