implementation.tex

\section{Implementation}\label{sec:implementation}

One of the two contributions of \emph{ulo-storage} is that we
implemented components for making organizational mathematical
knowledge (formulated as RDF~triplets) queryable. This section first
makes out the individual components involved in this task. We then
discuss the actual implementation created for this project.

\subsection{Components Implemented for \emph{ulo-storage}}\label{sec:components}
\FloatBarrier{}

Figure~\ref{fig:components} illustrates how data flows through the
different components. In total, we made out three components that make
up the infrastructure provided by \emph{ulo-storage}.

\begin{figure}[]\begin{center}
    \includegraphics[width=0.9\textwidth]{figs/components.png}
    \caption{Components involved in the \emph{ulo-storage} system.}\label{fig:components}
\end{center}\end{figure}

\begin{itemize}
\item ULO triplets are present in various locations, be it Git
  repositories, web servers or the local disk.  It is the job of a
  \emph{Collector} to assemble these {RDF}~files and forward them for further
  processing. This may involve cloning a Git repository or crawling
  the file system.

  \item With streams of ULO files assembled by the Collector, these
    streams then get passed to an \emph{Importer}. The Importer then
    uploads RDF~streams into some kind of permanent storage. As we
    will see, the GraphDB~\cite{graphdb} triple store was a natural
    fit.

\item Finally, with all triplets stored in a database, an
  \emph{Endpoint} is where applications access the underlying
  knowledge base. This does not necessarily need to be any custom
  software, rather the programming interface of the underlying
  database server itself can be understood as an Endpoint of its own.
\end{itemize}

Collector, Importer and Endpoint provide us with an automated way of
making RDF files available for use within applications. We will now
take a look at the actual implementation created for
\emph{ulo-storage}, beginning with the implementation of Collector and
Importer.

\subsection{Collector and Importer}\label{sec:collector}

We previously described Collector and Importer as two distinct
components.  First, a Collector pulls RDF data from various sources as
an input and outputs a stream of standardized RDF data. Second, an
Importer takes such a stream of RDF data and then dumps it to some
sort of persistent storage.  In our implementation, both Collector and
Importer ended up as one piece of monolithic software. This does not
need to be the case but proved convenient as combining Collector and
Importer forgoes the needs for an additional IPC~mechanism between
Collector and Importer. In addition, neither our Collector nor
Importer are particularly complicated pieces of software, as such there
is no pressing need to force them into separate processes.

Our implementation supports two sources for RDF files, namely Git
repositories and the local file system. The file system Collector
crawls a given directory on the local machine and looks for
RDF~XML~files~\cite{rdfxml} while the Git Collector first clones a Git
repository and then passes the checked out working copy to the file
system Collector. Because we found that is not uncommon for RDF files
to be compressed, our implementation supports on the fly extraction of
gzip~\cite{gzip} and xz~\cite{xz} formats which can greatly reduce the
required disk space in the collection step.

During development of the Collector, we found that existing exports
from third party mathematical libraries contain RDF syntax errors
which were not discovered previously. In particular, both Isabelle and
Coq exports contained URIs which does not fit the official syntax
specification~\cite{rfc3986} as they contained illegal
characters. Previous work~\cite{ulo} that processed Coq and Isabelle
exports used database software such as Virtuoso Open
Source~\cite{wikivirtuoso} which does not properly check URIs according
to spec; in consequence these faults were only discovered now.  To
tackle these problems, we introduced on the fly correction steps
during collection that escape the URIs in question and then continue
processing.  Of course this is only a work-around. Related bug reports
were filed in the respective export projects to ensure that in the
future this extra step is not necessary.

The output of the Collector is a stream of RDF~data.  This stream gets
passed to the Importer which imports the encoded RDF triplets into
some kind of persistent storage. In theory, multiple implementations
of this Importer are possible, namely different implementations for
different database backends. As we will see in
Section~\ref{sec:endpoints}, for our project we selected the GraphDB
triple store alone. The Importer merely needs to make the necessary
API~calls to import the RDF stream into the database.  As such the
import itself is straight-forward, our software only needs to upload
the RDF file stream as-is to an HTTP endpoint provided by our GraphDB
instance.

To review, our combination of Collector and Importer fetches XML~files
from Git repositories, applies on the fly decompression and fixes and
then imports the collected RDF~triplets into persistent database
storage.

\subsubsection{Scheduling}

Collector and Importer were implemented as library code that can be
called from various front ends. For this project, we provide both a
command line interface as well as a graphical web front end. While the
command line interface is only useful for manually starting single
runs, the web interface (Figure~\ref{fig:ss}) allows for more
flexibility. In particular, import jobs can be started either manually
or scheduled to run at fixed intervals. The web interface also
persists error messages and logs.

\input{implementation-screenshots.tex}

\subsubsection{Version Management}

Automated job control leads us to the problem of versioning.  In our
current design, given ULO exports~$\mathcal{E}_i$ depend on
original third party libraries~$\mathcal{L}_i$. Running~$\mathcal{E}_i$
through the workflow of Collector and Importer, we get some database
representation~$\mathcal{D}$. We see that data flows
\begin{align*}
  \mathcal{L}_1 \rightarrow \; &\mathcal{E}_1 \rightarrow \mathcal{D} \\
  \mathcal{L}_2 \rightarrow \; &\mathcal{E}_2 \rightarrow \mathcal{D} \\
  &\vdots{}                                                        \\
  \mathcal{L}_n \rightarrow \; &\mathcal{E}_n \rightarrow \mathcal{D}
\end{align*}
from $n$~individual libraries~$\mathcal{L}_i$ into a single
database storage~$\mathcal{D}$ that is used for querying.

However, we must not ignore that mathematical knowledge is ever
changing and not static. When a given library~$\mathcal{L}^{t}_i$ at
revision~$t$ gets updated to a new version~$\mathcal{L}^{t+1}_i$, this
change will eventually propagate to the associated export and result
in a new set of RDF triplets~$\mathcal{E}^{t+1}_i$. Our global
database state~$\mathcal{D}$ needs to get updated to match the changes
between~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$.

Finding an efficient implementation for this problem is not trivial.
While it should be possible to compute the difference between two
exports~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$ and infer the
changes necessary to be applied to~$\mathcal{D}$, the big number of
triplets makes this appear unfeasible.  As this is a problem an
implementer of a greater tetrapodal search system will most likely
encounter, we suggest the following approaches to tackle this
challenge.

One approach is to annotate each triplet in~$\mathcal{D}$ with
versioning information about which particular
export~$\mathcal{E}^{t}_i$ it was derived from.  During an import
from~$\mathcal{E}^{s}_i$ into~$\mathcal{D}$, we could (1)~first remove
all triplets in~$\mathcal{D}$ that were derived from previous
version~$\mathcal{E}^{s-1}_i$ and (2)~then re-import all triplets from
the current version~$\mathcal{E}^{s}_i$. Annotating triplets with
versioning information is an approach that should work, but it does
introduce~$\mathcal{O}(n)$ additional triplets in~$\mathcal{D}$ where
$n$~is the number of triplets in~$\mathcal{D}$. After all, we need to
annotate each of the $n$~triplets with versioning information,
effectively doubling the required storage space. A not very satisfying
solution.

Another approach is to regularly re-create the full data
set~$\mathcal{D}$ from scratch, say every seven days. This circumvents
the problems related to updating existing data sets, but also means
that changes in a given library~$\mathcal{L}_i$ take some to propagate
to~$\mathcal{D}$.  Continuing this train of thought, an advanced
version of this approach could forgo the requirement for one single
database storage~$\mathcal{D}$ entirely. Instead of maintaining just
one global database state~$\mathcal{D}$, we suggest experimenting with
dedicated database instances~$\mathcal{D}_i$ for each given
library~$\mathcal{L}_i$.  The advantage here is that re-creating a
given database representation~$\mathcal{D}_i$ is fast as
exports~$\mathcal{E}_i$ are comparably small. The disadvantage is that
we still want to query the whole data set~$\mathcal{D} = \mathcal{D}_1
\cup \mathcal{D}_2 \cup \cdots \cup \mathcal{D}_n$. This does require
the development of some cross-database query mechanism, functionality
existing systems currently only offer limited support
for~\cite{graphdbnested}.

In summary, we see that versioning is a potential challenge for a
greater tetrapodal search system. While not a pressing issue for
\emph{ulo-storage} now, we consider it a topic of future research.

\subsection{Endpoint}\label{sec:endpoints}

Finally, we need to discuss how \emph{ulo-storage} realizes the
Endpoint. Recall that an Endpoint provides the programming interface
for applications that wish to query our collection of organizational
knowledge. In practice, the choice of Endpoint programming interface
is determined by the choice of database system as the Endpoint is
provided directly by the database system.

In our project, organizational knowledge is formulated as
RDF~triplets.  The canonical choice for us is to use a triple store,
that is a database optimized for storing RDF triplets~\cite{triponto,
tripw3c}. For our project, we used the GraphDB~\cite{graphdb} triple
store. A free version that fits our needs is available
at~\cite{graphdbfree}.

\subsubsection{Transitive Queries}

A notable advantage of GraphDB compared to other systems such as
Virtuoso Open Source~\cite{wikivirtuoso, ulo} is that GraphDB supports
recent versions of the SPARQL query language~\cite{graphdbsparql} and
OWL~Reasoning~\cite{owlspec, graphdbreason}.  In particular, this
means that GraphDB offers support for transitive queries as described
in previous work on~ULO~\cite{ulo}. A transitive query is one that,
given a relation~$R$, asks for the transitive closure~$S$of~$R$.
(Figure~\ref{fig:tc}).

\input{implementation-transitive-closure.tex}

In fact, GraphDB supports two approaches for realizing transitive
queries.  On one hand, GraphDB supports the
\texttt{owl:TransitiveProperty}~\cite[Section 4.4.1]{owlspec} property
that defines a given predicate~$P$ to be transitive. With $P$~marked
this way, querying the knowledge base is equivalent to querying the
transitive closure of~$P$. This requires transitivity to be hard-coded
into the knowledge base. If we only wish to query the transitive
closure for a given query, we can take advantage of so-called
``property paths''~\cite{paths} which allow us to indicate that a
given predicate~$P$ is to be understood as transitive when
querying. Only during querying is the transitive closure then
evaluated. Either way, GraphDB supports transitive queries without
awkward workarounds necessary in other systems~\cite{ulo}.

\subsubsection{SPARQL Endpoint}

There are multiple approaches to querying the GraphDB triple store,
one based around the standardized SPARQL query language and the other
on the RDF4J Java library. Both approaches have unique advantages.
Let us first take a look at {SPARQL}, which is a standardized query
language for RDF triplet data~\cite{sparql}. The specification
includes not just syntax and semantics of the language itself, but
also a standardized REST interface~\cite{rest} for querying database
servers.

The SPARQL syntax was inspired by SQL and as such the \texttt{SELECT}
\texttt{WHERE} syntax should be familiar to many software developers.
A simple query that returns all triplets in the store looks like
\begin{lstlisting}
    SELECT * WHERE { ?s ?p ?o }
\end{lstlisting}
where \texttt{?s}, \texttt{?p} and \texttt{?o} are query
variables. The result of any query are valid substitutions for all
query variables. In this particular case, the database would return a
table of all triplets in the store sorted by subject~\texttt{?o},
predicate~\texttt{?p} and object~\texttt{?o}.

Probably the biggest advantage is that SPARQL is ubiquitous. As it is
the de facto standard for querying triple stores, lots of
implementations (client and server) as well as documentation are
available~\cite{sparqlbook, sparqlimpls, gosparql}.

\subsubsection{RDF4J Endpoint}

SPARQL is one way of accessing a triple store database. Another
approach is RDF4J, a Java API for interacting with RDF graphs,
implemented based on a superset of the {SPARQL} REST
interface~\cite{rdf4j}.  GraphDB is one of the database servers that
supports RDF4J, in fact it is the recommended way of interacting with
GraphDB repositories~\cite{graphdbapi}.

Instead of formulating textual queries, RDF4J allows developers to
query a knowledge base by calling Java library methods. Previous query
that asks for all triplets in the store looks like
\begin{lstlisting}
    connection.getStatements(null, null, null);
\end{lstlisting}
in RDF4J. \texttt{getStatements(s, p, o)} returns all triplets that
have matching subject~\texttt{s}, predicate~\texttt{p} and
object~\texttt{o}. Any argument that is \texttt{null} can be
substituted with any value, that is it is a query variable to be
filled by the call to \texttt{getStatements}.

Using RDF4J does introduce a dependency on the JVM and its
languages. But in practice, we found RDF4J to be quite convenient,
especially for simple queries, as it allows us to formulate everything
in a single programming language rather than mixing programming
language with awkward query strings. We also found it quite helpful to
generate Java classes from OWL~ontologies that contain all definitions
of the ontology as easily accessible constants~\cite{rdf4jgen}.  This
provides us with powerful IDE auto completion features during
development of ULO applications.

Summarizing the last two sections, we see that both SPARQL and RDF4J
have unique advantages. While SPARQL is an official W3C~\cite{w3c}
standard and implemented by more database systems, RDF4J can be more
convenient when dealing with JVM-based projects.  For
\emph{ulo-storage}, we played around with both interfaces and chose
whatever seemed more convenient at the moment. We recommend any
implementors to do the same.

\subsection{Deployment and Availability}

Software not only needs to get developed, but also deployed. To deploy
the combination of Collector, Importer and Endpoint, we use Docker
Compose. Docker itself is a technology for wrapping software into
containers, that is lightweight virtual machines with a fixed
environment for running a given application~\cite[pp. 22]{dockerbook}.
Docker Compose then is a way of combining individual Docker containers
to run a full tech stack of application, database server and so
on~\cite[pp. 42]{dockerbook}. All configuration of the overarching
setup is stored in a Docker Compose file that describes the software
stack.

For \emph{ulo-storage}, we provide a single Docker Compose file which
starts three containers, namely (1)~the Collector/Importer web
interface, (2)~a GraphDB instance which provides us with the required
Endpoint and (3)~some test applications that use that Endpoint.  All
code for Collector and Importer is available in the
\texttt{ulo-storage-collect} Git repository~\cite{gorepo}.  Additional
deployment files, that is Docker Compose configuration and additional
Dockerfiles are stored in a separate repository~\cite{dockerfilerepo}.

\subsection{Summary}

With this, we conclude our discussion of the implementation developed
for the \emph{ulo-storage} project. We designed a system based around
(1)~a Collector which collects RDF triplets from third party sources,
(2)~an Importer which imports these triplets into a GraphDB database
and (3)~looked at different ways of querying a GraphDB Endpoint. All
of this is easy to deploy using a single Docker Compose file.

Our concrete implementation is useful in so far as that we can use it
to experiment with ULO data sets. But development also provided
insight about (1)~which components this class of system requires and
(2)~which problems need to be solved. One topic we discussed at length
is version management. It is easy to dismiss this in these early
stages of development, but no question it is something to keep in
mind.