Skip to content
Snippets Groups Projects
Commit 2003ef78 authored by Andreas Schärtl's avatar Andreas Schärtl
Browse files

report: add some clarifications

Next up: Proofreading and then setting up next tasks for September
parent 82fe3080
No related branches found
No related tags found
No related merge requests found
......@@ -2,8 +2,8 @@
With programming endpoints in place, we can now query the data set
containing both Isabelle and Coq exports stored in {GraphDB}. We
experimented with two kinds of applications thatt take advantage
of our GraphDB endpoint.
experimented with two kinds of applications that talk to a GraphDB
endpoint.
\begin{itemize}
\item Exploring which ULO predicates are actually used in the
......@@ -22,7 +22,7 @@ of our GraphDB endpoint.
\subsection{Exploring Existing Data Sets}\label{sec:expl}
Four our first application, we looked at what ULO predicates are
For our first application, we looked at what ULO predicates are
actually used by the respective data sets. With more than 250~million
triplets in the store, we hoped that this would give us some insight
into the kind of knowledge we are dealing with.
......@@ -38,25 +38,26 @@ listed in Appendix~\ref{sec:used}. In both cases, what stands out is
that both exports use less than a third of the available predicates.
We also see that the Isabelle and Coq exports use different
predicates. For example, the Isabelle contains organizational meta
information such as information about paragraphs and sections in the
source document while the Coq export only tells us about the filename
of the Coq source. That is not particularly problematic as long as we
can trace a given object back to the original source. Regardless, our
results do show that both exports have their own particularities and
with more and more third party libraries exported to ULO one has to
assume that this heterogeneity will only grow. In particular we want
to point to the large number of predicates which remain unused in both
Isabelle and Coq exports. A user formulating queries for ULO might be
oblivious to the fact that only subsets of exports support given
predicates.
predicates. For example, the Isabelle export contains organizational
meta information such as information about paragraphs and sections in
the source document while the Coq export only tells us about the
filename of the Coq source. That is not particularly problematic as
long as we can trace a given object back to the original source.
Regardless, our results do show that both exports have their own
particularities and with more and more third party libraries exported
to ULO one has to assume that this heterogeneity will only grow. In
particular we want to point to the large number of predicates which
remain unused in both Isabelle and Coq exports. A user formulating
queries for ULO might be oblivious to the fact that only subsets of
exports support given predicates.
While not a problem for \emph{ulo-storage} per se, we do expect this
to be a challenge when building a tetrapodal search
system. Recommended ways around this ``missing fields'' problem in
database literature include the clever use of default values or
inference of missing values~\cite{kdisc, aidisc}, neither of which
feels particularly applicable to an ULO data set.
feels particularly applicable to vast and heterogeneous sets of
ULO~triplets.
\input{applications-preds.tex}
......@@ -87,7 +88,8 @@ implementations.
subsets of problems, e.g.\ on the satisfiability of a given CNF
formula. A particular example is focused on heuristics for how
long a SAT solver will take to find a solution for a
given~{CNF}~\cite{proofsat}.
given~{CNF}. Here, solutions that take long are considered
harder~\cite{proofsat}.
\textbf{Organizational Aspect} A first working hypothesis might be
to assume that elementary proofs are short. In that case, the
......@@ -100,13 +102,13 @@ implementations.
to find a generalized definition of proof difficulty, we will
accept proof size as a first working hypothesis.
{ULO} offers the \texttt{ulo:external-size}
predicate which will allow us to sort by file size. Maybe small
file size also leads to quick check times in proof assistants and
automatic theorem provers. With this assumption in mind we could
use the \texttt{ulo:check-time} predicate. Correlating proof
complexity with file size allows us to define one indicator of
proof complexity based on organizational knowledge alone.
{ULO} offers the \texttt{ulo:external-size} predicate which will
allow us to sort by file size. Maybe proof complexity also leads
to quick check times in proof assistants and automatic theorem
provers. With this assumption in mind we could use the
\texttt{ulo:check-time} predicate. Correlating proof complexity
with file size allows us to define one indicator of proof
complexity based on organizational knowledge alone.
\textbf{Other Aspects} A tetrapodal search system should probably
also take symbolic knowledge into account. Based on some kind of
......@@ -165,18 +167,18 @@ implementations.
recommends extending ULO to nativly support the concept of
algorithms. Both approaches have their distinct trade-offs.
\textbf{Representing Algorithms} As a first approach, we can try
to represent algorithms and problems in terms of existing ULO
predicates. As ULO does have a concept of \texttt{theorem} and
\texttt{proof}, it might be tempting to exploit the Curry-Howard
correspondence and represent algorithms understood as programs in
terms of proofs. But that does not really capture the canonical
understanding of algorithms; algorithms are not actually programs,
rather there are programs that implement algorithms. Even if we do
work around this problem, it is not clear how to represent
problems (e.g.\ the traveling salesman problem or the sorting
problem) in terms of theorems (propositions, types) that get
implemented by a proof (algorithm, program).
As a first approach, we can try to represent algorithms and
problems in terms of existing ULO predicates. As ULO does have a
concept of \texttt{theorem} and \texttt{proof}, it might be
tempting to exploit the Curry-Howard correspondence and represent
algorithms understood as programs in terms of proofs. But that
does not really capture the canonical understanding of algorithms;
algorithms are not actually programs, rather there are programs
that implement algorithms. Even if we do work around this problem,
it is not clear how to represent problems (e.g.\ the traveling
salesman problem or the sorting problem) in terms of theorems
(propositions, types) that get implemented by a proof (algorithm,
program).
As algorithms make up an important part of certain areas of
research, it might be reasonable to introduce native level support
......@@ -245,7 +247,7 @@ implementations.
\end{lstlisting}
We can formulate~$\mathcal{Q}_3$ with just one SPARQL
query. Because everything is handled by the database, access
should be quick.
should be about as quick as we can hope it to be.
\end{itemize}
Experimenting with $\mathcal{Q}_1$ to $\mathcal{Q}_3$ provided us with
......
......@@ -2,17 +2,18 @@
One of the two contributions of \emph{ulo-storage} is that we
implemented components for making organizational mathematical
knowledge queryable. This section first makes out the individual
required component for this tasks and then describes some details
of the actual implementation for this project.
knowledge (formulated as RDF~triplets) queryable. This section first
makes out the individual required component for this tasks and then
describes some details of the actual implementation for this project.
\subsection{Components Implemented for \emph{ulo-storage}}\label{sec:components}
With RDF files exported and available for download as Git repositories
on MathHub, we have the goal of making the underlying data available
for use in applications. Let us first look at a high level overview of
all involved components. Figure~\ref{fig:components} illustrates each
component and the flow of data.
With RDF files exported and available for download as Git
repositories~\cite{uloisabelle, ulocoq}, we have the goal of making
the underlying data available for use in applications. Let us first
look at a high level overview of all involved
components. Figure~\ref{fig:components} illustrates each component and
the flow of data.
\begin{figure}[]\begin{center}
\includegraphics{figs/components}
......@@ -28,27 +29,21 @@ component and the flow of data.
\item With streams of ULO files assembled by the Collecter, this
data then gets passed to an \emph{Importer}. An Importer uploads
RDF~streams into some kind of permanent storage. For
use in this project, the GraphDB~\cite{graphdb} triplet store was
a natural fit.
RDF~streams into some kind of permanent storage. As we will see,
the GraphDB~\cite{graphdb} triplet store was a natural fit.
\item Finally, with all triplets stored in a database, an
\emph{Endpoint} is where applications access the underlying
knowledge base. This does not necessarily need to be any custom
software, rather the programming interface of the underlying
database itself could be understood as an endpoint of its own.
Regardless, some thought can be put into designing an Endpoint as a
layer that lives between application and database that is more
convenient to use than the one provided by the database. It comes
down to the programming interface we wish to provide to a developer
using this system.
database itself can be understood as an endpoint of its own.
\end{itemize}
Collecter, Importer and Endpoint provide us with an easy and automated
way of making RDF files available for use within applications. We will
now take a look at the actual implementation created for
\emph{ulo-storage}.
\emph{ulo-storage}, beginning with the implementation of Collecter and
Importer.
\subsection{Collecter and Importer}\label{sec:collecter}
......@@ -59,11 +54,13 @@ takes such a stream of RDF data and then dumps it to some sort of
persistent storage. However in the implementation for
\emph{ulo-storage}, both Collecter and Importer ended up being one
piece of monolithic software. This does not need to be the case but
simply proved convenient.
proved convenient because (1)~combining Collecter and Importer forgoes
the needs for an additional IPC~mechanism and (2)~neither Collecter
nor Importer are terribly large pieces of software.
Our implementation supports two sources for RDF files, namely Git
repositories and the local file system. The file system Collecter
simply crawls a given directory on the local machine and looks for
crawls a given directory on the local machine and looks for
RDF~XMl~files~\cite{rdfxml} while the Git Collecter first clones a Git
repository and then passes the checked out working copy to the file
system Collecter. Because it is not uncommon for RDF files to be
......@@ -72,19 +69,20 @@ Gzip~\cite{gzip} and XZ~\cite{xz} formats which can greatly reduce the
required disk space in the collection step.
During development of the Collecter, we found that existing exports
from third party mathematical libraries contain RDF syntax errors which
were not discovered previously. In particular, both Isabelle and Coq
export contained URIs which do not fit the official
specification~\cite{rfc3986}. Previous work that processed Coq and
Isabelle exports used database software such as Virtuoso Open
Source~\cite{ulo} which does not properly check URIs according to
spec, in consequence these faults were only discovered now. To tackle
these problems, we introduced on the fly correction steps during
collection that take the broken RDF files, fix the mentioned problems
related to URIs (by escaping illegal characters) and then
continue processing. Of course this is only a work-around; related
bugs were filed in the respective export projects to ensure that in the
future this extra step is not necessary.
from third party mathematical libraries contain RDF syntax errors
which were not discovered previously. In particular, both Isabelle and
Coq export contained URIs which do not fit the official syntax
specification~\cite{rfc3986} as they contained illegal
characters. Previous work~\cite{ulo} that processed Coq and Isabelle
exports used database software such as Virtuoso Open Source which does
not properly check URIs according to spec, in consequence these faults
were only discovered now. To tackle these problems, we introduced on
the fly correction steps during collection that take the broken RDF
files, fix the mentioned problems related to URIs (by escaping illegal
characters) and then continue processing. Of course this is only a
work-around; related bug reports were filed in the respective export
projects to ensure that in the future this extra step is not
necessary.
Our Collecter takes existing RDF files, applies some on the fly
transformations (extraction of compressed files, fixing of errors),
......@@ -110,10 +108,11 @@ allows the user to automate import jobs. For example, it is possible
to schedule an import of a given Git repository every seven days to a
given GraphDB instance.
Automated job control alone however do not solve the problem of
versioning. ULO exports~$\mathcal{E}$ depend on an original third
party library~$\mathcal{L}$. Running~$\mathcal{E}$ through the
workflow of Collecter and Importer, we get some database
Automated job control that regularly imports data from the same
sources leads us to the problem of versioning. ULO
exports~$\mathcal{E}$ depend on an original third party
library~$\mathcal{L}$. Running~$\mathcal{E}$ through the workflow of
Collecter and Importer, we get some database
representation~$\mathcal{D}$. We see that data flows
\begin{align*}
\mathcal{L} \rightarrow \mathcal{E} \rightarrow \mathcal{D}
......@@ -130,10 +129,10 @@ suggestion to solve the problem of changing third party libraries is
to regularly re-create the full data set~$\mathcal{D}$ from scratch,
say every seven days. This circumvents all problems related to
updating existing data sets, but it does mean additional computation
requirements. For currently existing exports from Coq and Isabelle
this is not a problem as even on weak laptop hardware the imports take
less than an hour. But if the number of triplets raises by orders of
magnitude, this approach will eventually not be scalable anymore.
requirements. It also means that changes in~$\mathcal{L}$ takes some
to propagate to~$\mathcal{D}$. If the number of triplets raises
by orders of magnitude, this approach will eventually not be scalable
anymore.
\subsection{Endpoints}\label{sec:endpoints}
......@@ -152,7 +151,7 @@ on the RDF4J Java library. Both approaches have unique advantages.
\item SPARQL is a standardized query language for RDF triplet
data~\cite{sparql}. The specification includes not just syntax
and semantics of the language itself, but also a standardized
REST interface for querying database servers.
REST interface~\cite{rest} for querying database servers.
\textbf{Syntax} SPARQL is inspired by SQL and as such the
\texttt{SELECT} \texttt{WHERE} syntax should be familiar to many
......@@ -168,9 +167,9 @@ on the RDF4J Java library. Both approaches have unique advantages.
subject~\texttt{?o}, predicate~\texttt{?p} and
object~\texttt{?o}.
\textbf{Advantage} Probably the biggest advantage is that
SPARQL is ubiquitous. As it is the de facto standard for
querying triplet stores, lots of literature and documentation is
\textbf{Advantage} Probably the biggest advantage is that SPARQL
is ubiquitous. As it is the de facto standard for querying
triplet stores, lots of implementations and documentation are
available~\cite{sparqlbook, sparqlimpls, gosparql}.
\item RDF4J is a Java API for interacting with triplet stores,
......@@ -199,8 +198,9 @@ on the RDF4J Java library. Both approaches have unique advantages.
than mixing programming language with awkward query strings.
We also found it quite helpful to generate Java classes from
OWL ontologies that contain all definitions of the ontology and
make it readable by any IDE~\cite{rdf4jgen}.
OWL~ontologies that contain all definitions of the
ontology~\cite{rdf4jgen}. This provides us with powerful IDE
auto completion features during development of ULO applications.
\end{itemize}
We see that both SPARQL and RDF4J have unique advantages. While SPARQL
......@@ -223,8 +223,9 @@ that web interface such that it can persist import jobs and finally
(3)~a GraphDB instance which provides us with the required
Endpoint. All code for Collecter and Importer is available in the
\texttt{ulo-storage-collect} Git repository\footnote{\url{\gorepo}}
Additional deployment files, that is Docker Compose and Dockerfiles
are stored in a separate repository\footnote{\url{\composerepo}}.
Additional deployment files, that is Docker Compose and additional
Dockerfiles are stored in a separate
repository\footnote{\url{\composerepo}}.
This concludes our discussion of the implementation developed for the
\emph{ulo-storage} project. We designed a system based around (1)~a
......
......@@ -24,37 +24,35 @@ structure, tetrapodal search proposes that each kind of mathematical
knowledge should be made available in a storage backend that fits the
kind of data it is providing. With all four areas available for
querying, tetrapodal search intends to then combine the four indexes
into a single query interface.
Currently, research is focused on providing schemas, storage backends
and indexes for the four different kinds of mathematical
knowledge. The focus of \emph{ulo-storage} is the area of
organizational knowledge.
into a single query interface. Currently, research is focused on
providing schemas, storage backends and indexes for the four different
kinds of mathematical knowledge. The focus of \emph{ulo-storage} is
the area of organizational knowledge.
A previously proposed way to structure such organizational data is the
\emph{upper level ontology} (ULO)~\cite{ulo}. ULO takes the form of an
OWL~ontology~\cite{uloonto} and as such all organization information
is stored as RDF~triplets with a unified schema of
ULO~predicates~\cite{owl}. Some effort has been made to export
existing databases of formal mathematical knowledge to {ULO}, in
particular, there exist exports from Isabelle and Coq
libraries~\cite{uloisabelle, ulocoq}. The resulting data set is
already quite large, the Isabelle export alone containing more than
existing databases of formal mathematical knowledge to {ULO}. In
particular, exports from Isabelle and Coq-based libraries are
avilable~\cite{uloisabelle, ulocoq}. The resulting data set is already
quite large, the Isabelle export alone containing more than
200~million triplets.
Existing exports from Isabelle and Coq result in single or multiple
RDF~files. This is a convenient format for exchange and easily
versioned using Git. However, considering the vast number of triplets,
it is impossible to query easily and efficiently in this state. This
is what \emph{ulo-storage} is focused on: Making ULO data sets
accessible for querying and analysis. We collected RDF files spread
over different Git repositories, imported them into a database and
then experimented with APIs for accessing that data set.
XML~files that contain RDF~triplets. This is a convenient format for
exchange and easily versioned using Git. However, considering the vast
number of triplets, it is impossible to query easily and efficiently
in this state. This is what \emph{ulo-storage} is focused on: Making
ULO data sets accessible for querying and analysis. We collected RDF
files spread over different Git repositories, imported them into a
database and then experimented with APIs for accessing that data set.
The main contribution of \emph{ulo-storage} is twofold. First, (1)~we
built up various infrastructure components for making organizational
knowledge queryable. These components can make up building blocks of
a larger tetrapodal search system. Design and implementation are
a larger tetrapodal search system. Their design and implementation are
discussed in Section~\ref{sec:implementation}. Second, (2)~we ran
sample prototype applications and queries on top of this
interface. While the applications themselves are admittedly not very
......
......@@ -314,3 +314,10 @@
url = {http://graphdb.ontotext.com/documentation/free/graphdb-feature-comparison.html},
}
@book{rest,
title={Architectural styles and the design of network-based software architectures},
author={Fielding, Roy T and Taylor, Richard N},
volume={7},
year={2000},
publisher={University of California, Irvine Irvine}
}
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment