implementation.tex

\section{Implementation}\label{sec:implementation}

One of the two contributions of \emph{ulo-storage} is that we
implemented components for making organizational mathematical
knowledge (formulated as RDF~triplets) queryable. This section first
makes out the individual required component for this tasks and then
describes some details of the actual implementation for this project.

\subsection{Components Implemented for \emph{ulo-storage}}\label{sec:components}

With RDF files exported and available for download as Git
repositories~\cite{uloisabelle, ulocoq}, we have the goal of making
the underlying data available for use in applications. Let us first
look at a high level overview of all involved
components. Figure~\ref{fig:components} illustrates each component and
the flow of data.

\begin{figure}[]\begin{center}
    \includegraphics{figs/components}
    \caption{Components involved in the \emph{ulo-storage} system.}\label{fig:components}
\end{center}\end{figure}

\begin{itemize}
\item ULO triplets are present in various locations, be it Git
  repositories, web servers or the local disk.  It is the job of a
  \emph{Collector} to assemble these {RDF}~files and forward them for further
  processing. This may involve cloning a Git repository or crawling
  the file system.

  \item With streams of ULO files assembled by the Collector, this
    data then gets passed to an \emph{Importer}. An Importer uploads
    RDF~streams into some kind of permanent storage. As we will see,
    the GraphDB~\cite{graphdb} triplet store was a natural fit.

\item Finally, with all triplets stored in a database, an
  \emph{Endpoint} is where applications access the underlying
  knowledge base. This does not necessarily need to be any custom
  software, rather the programming interface of the underlying
  database itself can be understood as an endpoint of its own.
\end{itemize}

Collector, Importer and Endpoint provide us with an easy and automated
way of making RDF files available for use within applications. We will
now take a look at the actual implementation created for
\emph{ulo-storage}, beginning with the implementation of Collector and
Importer.

\subsection{Collector and Importer}\label{sec:collector}

We previously described Collector and Importer as two distinct
components.  The Collector pulls RDF data from various sources as an
input and outputs a stream of standardized RDF data. Then, the
Importer takes such a stream of RDF data and then dumps it to some
sort of persistent storage.  In the implementation for
\emph{ulo-storage}, both Collector and Importer ended up being one
piece of monolithic software. This does not need to be the case but
proved convenient because (1)~combining Collector and Importer forgoes
the needs for an additional IPC~mechanism and (2)~neither Collector
nor Importer are terribly large pieces of software in themselves.

Our implementation supports two sources for RDF files, namely Git
repositories and the local file system. The file system Collector
crawls a given directory on the local machine and looks for
RDF~XMl~files~\cite{rdfxml} while the Git Collector first clones a Git
repository and then passes the checked out working copy to the file
system Collector. Because it is not uncommon for RDF files to be
compressed, our Collector supports on the fly extraction of
gzip~\cite{gzip} and xz~\cite{xz} formats which can greatly reduce the
required disk space in the collection step.

During development of the Collector, we found that existing exports
from third party mathematical libraries contain RDF syntax errors
which were not discovered previously. In particular, both Isabelle and
Coq export contained URIs which do not fit the official syntax
specification~\cite{rfc3986} as they contained illegal
characters. Previous work~\cite{ulo} that processed Coq and Isabelle
exports used database software such as Virtuoso Open Source which do
not properly check URIs according to spec, in consequence these faults
were only discovered now.  To tackle these problems, we introduced on
the fly correction steps during collection that escape the URIs in
question and then continue processing.  Of course this is only a
work-around. Related bug reports were filed in the respective export
projects to ensure that in the future this extra step is not
necessary.

The output of the Collector is a stream of RDF data.  This stream gets
passed to the Importer which imports the encoded RDF triplets into
some kind of persistent storage. The canonical choice for this task is
to use a triple store, that is a database optimized for storing RDF
triplets~\cite{triponto, tripw3c}. For our project, we used the
GraphDB~\cite{graphdb} triple store. A free version that fits our
needs is available at~\cite{graphdbfree}.  The import itself is
straight-forward, our software only needs to upload the RDF file
stream as-is to an HTTP endpoint provided by our GraphDB instance.

\emph{({TODO}: Write down a small comparison of different database
  types, triplet stores and implementations. Honestly the main
  advantage of GraphDB is that it's easy to set up and import to;
  maybe I'll also write an Importer for another DB to show that the
  choice of database is not that important.)}

\subsubsection{Scheduling and Version Management}

Collector and Importer were implemented as library code that can be
called from various front ends. For this project, we provide both a
command line interface as well as a graphical web front end. While the
command line interface is only useful for manually starting single
jobs, the web interface allows scheduling of jobs. In particular, it
allows the user to automate import jobs. For example, it is possible
to schedule an import of a given Git repository every seven days to a
given GraphDB instance.

Automated job control that regularly imports data from the same
sources leads us to the problem of versioning.  ULO
exports~$\mathcal{E}$ depend on an original third party
library~$\mathcal{L}$. Running~$\mathcal{E}$ through the workflow of
Collector and Importer, we get some database
representation~$\mathcal{D}$. We see that data flows
\begin{align*}
  \mathcal{L} \rightarrow \mathcal{E} \rightarrow \mathcal{D}
\end{align*}
which means that if records in~$\mathcal{L}$ change, this will
probably result in different triplets~$\mathcal{E}$ which in turn
results in a need to update~$\mathcal{D}$. This is non-trivial.  As it
stands, \emph{ulo-storage} only knows about what is in~$\mathcal{E}$.
While it should be possible to find out the difference between a new
version of~$\mathcal{E}$ and the current version of~$\mathcal{D}$ and
compute the changes necessary to be applied to~$\mathcal{D}$, the big
number of triplets makes this appear unfeasible. So far, our only
suggestion to solve the problem of changing third party libraries is
to regularly re-create the full data set~$\mathcal{D}$ from scratch,
say every seven days. This circumvents all problems related to
updating existing data sets, but it does mean additional computation
requirements. It also means that changes in~$\mathcal{L}$ take some
to propagate to~$\mathcal{D}$.  If the number of triplets raises
by orders of magnitude, this approach will eventually not be scalable
anymore.

\subsection{Endpoints}\label{sec:endpoints}

With ULO triplets imported into the GraphDB triplet store by Collector
and Importer, we now have all data available necessary for querying.
As discussed before, querying from applications happens through an
Endpoint that exposes some kind of {API}. The interesting question
here is probably not so much the implementation of the endpoint itself,
rather it is the choice of API than can make or break such a project.

There are multiple approaches to querying the GraphDB triplet store,
one based around the standardized SPARQL query language and the other
on the RDF4J Java library. Both approaches have unique advantages.

\begin{itemize}
      \item SPARQL is a standardized query language for RDF triplet
      data~\cite{sparql}. The specification includes not just syntax
      and semantics of the language itself, but also a standardized
      REST interface~\cite{rest} for querying database servers.

      \textbf{Syntax} SPARQL is inspired by SQL and as such the
      \texttt{SELECT} \texttt{WHERE} syntax should be familiar to many
      software developers.  A simple query that returns all triplets
      in the store looks like
      \begin{lstlisting}
      SELECT * WHERE { ?s ?p ?o }
      \end{lstlisting}
      where \texttt{?s}, \texttt{?p} and \texttt{?o} are query
      variables. The result of any query are valid substitutions for
      the query variables. In this particular case, the database would
      return a table of all triplets in the store sorted by
      subject~\texttt{?o}, predicate~\texttt{?p} and
      object~\texttt{?o}.

      \textbf{Advantage} Probably the biggest advantage is that SPARQL
      is ubiquitous. As it is the de facto standard for querying
      triplet stores, lots of implementations and documentation are
      available~\cite{sparqlbook, sparqlimpls, gosparql}.

      \item RDF4J is a Java API for interacting with triplet stores,
      implemented based on a superset of the {SPARQL} REST
      interface~\cite{rdf4j}.  GraphDB is one of the database
      servers that supports RDF4J, in fact it is the recommended way
      of interacting with GraphDB repositories~\cite{graphdbapi}.

      \textbf{Syntax} Instead of formulating textual queries, RDF4J
      allows developers to query a repository by calling Java API
      methods. Previous query that requests all triplets in the store
      looks like
      \begin{lstlisting}
      connection.getStatements(null, null, null);
      \end{lstlisting}
      in RDF4J. \texttt{getStatements(s, p, o)} returns all triplets
      that have matching subject~\texttt{s}, predicate~\texttt{p} and
      object~\texttt{o}. Any argument that is \texttt{null} can be
      replace with any value, i.e.\ it is a query variable to be
      filled by the call to \texttt{getStatements}.

      \textbf{Advantage} Using RDF4J does introduce a dependency on
      the JVM and its languages. But in practice, we found RDF4J to be
      quite convenient, especially for simple queries, as it allows us
      to formulate everything in a single programming language rather
      than mixing programming language with awkward query strings.

      We also found it quite helpful to generate Java classes from
      OWL~ontologies that contain all definitions of the
      ontology~\cite{rdf4jgen}.  This provides us with powerful IDE
      auto completion features during development of ULO applications.
\end{itemize}

We see that both SPARQL and RDF4J have unique advantages. While SPARQL
is an official W3C standard and implemented by more database systems,
RDF4J can be more convenient when dealing with JVM-based code bases.
For \emph{ulo-storage}, we played around with both interfaces and
chose whatever seemed more convenient at the moment. We recommend any
implementors to do the same.

\subsection{Deployment and Availability}

\def\gorepo{https://gitlab.cs.fau.de/kissen/ulo-storage-collect}
\def\composerepo{https://gl.kwarc.info/supervision/schaertl_andreas/-/tree/master/experimental/compose}

Software not only needs to get developed, but also deployed. To deploy
the combination of Collector, Importer and Endpoint, we use Docker
Compose. Docker itself is a technology for wrapping software into
containers, that is lightweight virtual machines with a fixed
environment for running a given application~\cite[pp. 22]{dockerbook}.
Docker Compose then is a way of combining individual Docker containers
to run a full tech stack of application, database server and so
on~\cite[pp. 42]{dockerbook}. All configuration of such a setup is
stored in a Docker Compose file that describes the tech stack.

For \emph{ulo-storage}, we provide a single Docker Compose file which
starts three containers, namely (1)~the Collector/Importer web
interface, (2)~a database server for that web interface such that it
can persist import jobs and finally (3)~a GraphDB instance which
provides us with the required Endpoint. All code for Collector and
Importer is available in the \texttt{ulo-storage-collect} Git
repository~\cite{gorepo}.  Additional deployment files, that is Docker
Compose and additional Dockerfiles are stored in a separate
repository~\cite{dockerfilerepo}.

This concludes our discussion of the implementation developed for the
\emph{ulo-storage} project. We designed a system based around (1)~a
Collector which collects RDF triplets from third party sources, (2)~an
Importer which imports these triplets into a GraphDB database and
(3)~looked at different ways of querying a GraphDB Endpoint. All of
this is easy to deploy using a single Docker Compose file. With this
stack ready for use, we will continue with a look at some interesting
applications and queries built on top of this interface.