Newer
Older
\section{Implementation}\label{sec:implementation}
One of the two contributions of \emph{ulo-storage} is that we
implemented components for making organizational mathematical
knowledge queryable. This section first makes out the individual
required component for this tasks and then describes some details
of the actual implementation for this project.
\subsection{Components Implemented for \emph{ulo-storage}}\label{sec:components}
With RDF files exported and available for download as Git repositories
on MathHub, we have the goal of making the underlying data available
for use in applications. Figure~\ref{fig:components} illustrates the
implemented components and their relationships.
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
\begin{figure}[]\begin{center}
\includegraphics{figs/components}
\caption{Components involved in the \emph{ulo-storage} system.}\label{fig:components}
\end{center}\end{figure}
\begin{itemize}
\item ULO triplets are present in various locations, be it Git
repositories, on web servers or the local disk. It is the job of a
\emph{Collecter} to assemble these {RDF}~files and forward them for further
processing. This may involve cloning a Git repository or crawling
the file system.
\item With streams of ULO files assembled by the Collecter, this
data then gets passed to an \emph{Importer}. An Importer uploads
RDF~streams into some kind of permanent storage. For
use in this project, the GraphDB~\cite{graphdb} triplet store was
a natural fit.
For this project, both Collecter and Importer ended up being one
piece of monolithic software, but this does not have to be the case.
\item Finally, with all triplets stored in a database, an
\emph{Endpoint} is where applications access the underlying
knowledge base. This does not necessarily need to be any custom
software, rather the programming interface of the underlying
database itself could be understood as an endpoint of its own.
Regardless, some thought can be put into designing an Endpoint as a
layer that lives between application and database that is more
convenient to use than the one provided by the database. It comes
down to the programming interface we wish to provide to a developer
using this system.
\end{itemize}
Collecter, Importer and Endpoint provide us with an easy and automated
way of making RDF files ready for use with applications. We will now
take a look at the actual implementation created for
\emph{ulo-storage}.
\subsection{Collecter}\label{sec:collecter}
\emph{here be dragons}
\subsection{Importer}\label{sec:importer}
\emph{here be dragons}
\subsection{Endpoints}\label{sec:endpoints}
With ULO triplets imported into the GraphDB triplet store by Collecter
and Importer, we now have all data available necessary for querying.
As discussed before, querying from applications happens through an
Endpoint that exposes some kind of {API}. The interesting question
here is probably not so much the implementation of the endpoint itself,
rather it is the choice of API than can make or break such a project.
There are multiple approaches to querying the GraphDB triplet store,
one based around the standardized SPARQL query language and the other
on the RDF4J Java library. Both approaches have unique advantages.
\begin{itemize}
\item SPARQL is a standardized query language for RDF triplet
data~\cite{sparql}. The specification includes not just syntax
and semantics of the language itself, but also a standardized
\textbf{Syntax} SPARQL is inspired by SQL and as such the
\texttt{SELECT} \texttt{WHERE} syntax should be familiar to many
software developers. A simple query that returns all triplets
in the store looks like
\begin{lstlisting}
SELECT * WHERE { ?s ?p ?o }
\end{lstlisting}
where \texttt{?s}, \texttt{?p} and \texttt{?o} are query
variables. The result of any query are valid substitutions for
the query variables. In this particular case, the database would
return a table of all triplets in the store sorted by
subject~\texttt{?o}, predicate~\texttt{?p} and
object~\texttt{?o}.
\textbf{Advantage} Probably the biggest advantage is that
SPARQL is ubiquitous. As it is the de facto standard for
querying triplet stores, lots of literature and documentation is
available~\cite{sparqlbook, sparqlimpls, gosparql}.
\item RDF4J is a Java API for interacting with triplet stores,
implemented based on a superset of the {SPARQL} REST
interface~\cite{rdf4j}. GraphDB is one of the database
servers that supports RDF4J, in fact it is the recommended way
of interacting with GraphDB repositories~\cite{graphdbapi}.
\textbf{Syntax} Instead of formulating textual queries, RDF4J
allows developers to query a repository by calling Java API
methods. Previous query that requests all triplets in the store
looks like
\begin{lstlisting}
connection.getStatements(null, null, null);
\end{lstlisting}
in RDF4J. \texttt{getStatements(s, p, o)} returns all triplets
that have matching subject~\texttt{s}, predicate~\texttt{p} and
object~\texttt{o}. Any argument that is \texttt{null} can be
replace with any value, i.e.\ it is a query variable to be
filled by the call to \texttt{getStatements}.
\textbf{Advantage} Using RDF4J does introduce a dependency on
the JVM and its languages. But in practice, we found RDF4J to be
quite convenient, especially for simple queries, as it allows us
to formulate everything in a single programming language rather
than mixing programming language with awkward query strings.
We also found it quite helpful to generate Java classes from
OWL ontologies that contain all definitions of the ontology and
make it readable by any IDE~\cite{rdf4jgen}.
We see that both SPARQL and RDF4J have unique advantages. While SPARQL
is an official W3C standard and implemented by more database systems,
RDF4J can be more convenient when dealing with JVM-based code bases.
For \emph{ulo-storage}, we played around with both interfaces and
chose whatever seemed more convenient at the moment. We recommend any
implementors to do the same.
\subsection{Deployment}
\emph{here be dragons}