Skip to content
Snippets Groups Projects
Commit 7e38b25d authored by Andreas Schärtl's avatar Andreas Schärtl
Browse files

report: review implementation

parent 05f78daa
Branches
No related tags found
No related merge requests found
......@@ -18,7 +18,7 @@
data sets take advantage of the \texttt{dcterms}
namespace~\cite{dcowl}.}\label{fig:q2a}
\end{subfigure}
\vspace{0.5cm}
\vspace{8mm}
\begin{subfigure}[]{0.9\textwidth}
\begin{lstlisting}
......@@ -39,6 +39,7 @@
of references. The idea is works that were referenced more
often are more important.}\label{fig:q2b}
\end{subfigure}
\vspace{8mm}
\begin{subfigure}[]{0.9\textwidth}
\large
......@@ -48,8 +49,8 @@
1 & \texttt{https://isabelle.in.tum.de?HOL-Algebra.Group?...} & 17 \\
2 & \texttt{https://isabelle.in.tum.de?HOL-Algebra.Group?...} & 17 \\
3 & \texttt{https://isabelle.in.tum.de?HOL-Algebra.Group?...} & 11 \\
4 & \texttt{https://isabelle.in.tum.de?HOL-Algebra.Ring?...} & 11 \\
5 & \texttt{https://isabelle.in.tum.de?HOL-Algebra.Group?...} & 10 \\
% 4 & \texttt{https://isabelle.in.tum.de?HOL-Algebra.Ring?...} & 11 \\
% 5 & \texttt{https://isabelle.in.tum.de?HOL-Algebra.Group?...} & 10 \\
$\vdots$ & $\vdots$ & $\vdots$ \\ \bottomrule
\end{tabular*}
\caption{Result of query from Figure~\ref{fig:q2b}. In this particular case
......
......@@ -12,7 +12,7 @@
\includegraphics[width=\textwidth]{figs/tree-transitive.pdf}
\caption{Transitive closure~$S$ of relation~$R$. Additionally to
each tuple from~$R$ (solid edges), $S$~also contains additional
transitive edges (dotted edges).}
transitive (dotted) edges.}
\end{subfigure}
\caption{Illustrating the idea behind transitive closures. A
transitive closure~$S$ of relation~$R$ is defined as the
......
......@@ -3,18 +3,15 @@
One of the two contributions of \emph{ulo-storage} is that we
implemented components for making organizational mathematical
knowledge (formulated as RDF~triplets) queryable. This section first
makes out the individual required component for this tasks and then
describes some details of the actual implementation for this project.
makes out the individual components involved in this task. We then
discuss the actual implementation created for this project.
\subsection{Components Implemented for \emph{ulo-storage}}\label{sec:components}
\FloatBarrier{}
With RDF files exported and available for download as Git
repositories~\cite{uloisabelle, ulocoq}, we have the goal of making
the underlying data available for use in applications. Let us first
take a high level look at the three components we made out for this
job. Figure~\ref{fig:components} illustrates how data flows through
the different components.
Figure~\ref{fig:components} illustrates how data flows through the
different components. In total, we made out three components that make
up the infrastructure provided by \emph{ulo-storage}.
\begin{figure}[]\begin{center}
\includegraphics[width=0.9\textwidth]{figs/components.png}
......@@ -28,16 +25,17 @@ the different components.
processing. This may involve cloning a Git repository or crawling
the file system.
\item With streams of ULO files assembled by the Collector, this
data then gets passed to an \emph{Importer}. An Importer uploads
RDF~streams into some kind of permanent storage. As we will see,
the GraphDB~\cite{graphdb} triple store was a natural fit.
\item With streams of ULO files assembled by the Collector, these
streams then gets passed to an \emph{Importer}. The Importer then
uploads RDF~streams into some kind of permanent storage. As we
will see, the GraphDB~\cite{graphdb} triple store was a natural
fit.
\item Finally, with all triplets stored in a database, an
\emph{Endpoint} is where applications access the underlying
knowledge base. This does not necessarily need to be any custom
software, rather the programming interface of the underlying
database server itself can be understood as an endpoint of its own.
database server itself can be understood as an Endpoint of its own.
\end{itemize}
Collector, Importer and Endpoint provide us with an automated way of
......@@ -49,14 +47,16 @@ Importer.
\subsection{Collector and Importer}\label{sec:collector}
We previously described Collector and Importer as two distinct
components. The Collector pulls RDF data from various sources as an
input and outputs a stream of standardized RDF data. The Importer
takes such a stream of RDF data and then dumps it to some sort of
persistent storage. In our implementation, both Collector and
components. First, a Collector pulls RDF data from various sources as
an input and outputs a stream of standardized RDF data. Second, an
Importer takes such a stream of RDF data and then dumps it to some
sort of persistent storage. In our implementation, both Collector and
Importer ended up as one piece of monolithic software. This does not
need to be the case but proved convenient because combining Collector
and Importer forgoes the needs for an additional IPC~mechanism
between Collector and Importer.
need to be the case but proved convenient as combining Collector and
Importer forgoes the needs for an additional IPC~mechanism between
Collector and Importer. In addition, neither our Collector nor
Importer are particularly complicated pieces of software, as such there
is no pressing need to force them into separate processes.
Our implementation supports two sources for RDF files, namely Git
repositories and the local file system. The file system Collector
......@@ -71,12 +71,12 @@ required disk space in the collection step.
During development of the Collector, we found that existing exports
from third party mathematical libraries contain RDF syntax errors
which were not discovered previously. In particular, both Isabelle and
Coq exports contained URIs which do not fit the official syntax
Coq exports contained URIs which does not fit the official syntax
specification~\cite{rfc3986} as they contained illegal
characters. Previous work~\cite{ulo} that processed Coq and Isabelle
exports used database software such as Virtuoso Open
Source~\cite{wikivirtuoso} which do not properly check URIs according
to spec, in consequence these faults were only discovered now. To
to spec; in consequence these faults were only discovered now. To
tackle these problems, we introduced on the fly correction steps
during collection that escape the URIs in question and then continue
processing. Of course this is only a work-around. Related bug reports
......@@ -87,12 +87,13 @@ The output of the Collector is a stream of RDF~data. This stream gets
passed to the Importer which imports the encoded RDF triplets into
some kind of persistent storage. In theory, multiple implementations
of this Importer are possible, namely different implementations for
different database backends. As we will see in Section~\ref{sec:endpoints},
for our projected we selected the GraphDB triple store. The Importer
merely needs to make the necessary API~calls to import the RDF stream
into the database. As such the import itself is straight-forward, our
software only needs to upload the RDF file stream as-is to an HTTP
endpoint provided by our GraphDB instance.
different database backends. As we will see in
Section~\ref{sec:endpoints}, for our projected we selected the GraphDB
triple store alone. The Importer merely needs to make the necessary
API~calls to import the RDF stream into the database. As such the
import itself is straight-forward, our software only needs to upload
the RDF file stream as-is to an HTTP endpoint provided by our GraphDB
instance.
To review, our combination of Collector and Importer fetches XML~files
from Git repositories, applies on the fly decompression and fixes and
......@@ -105,19 +106,19 @@ Collector and Importer were implemented as library code that can be
called from various front ends. For this project, we provide both a
command line interface as well as a graphical web front end. While the
command line interface is only useful for manually starting single
jobs, the web interface allows scheduling of jobs. Import jobs can be
started manually or scheduled automatically in a web interface
(Figure~\ref{fig:ss}) which also keeps history and logs.
runs, the web interface (Figure~\ref{fig:ss}) allows for more
flexibility. In particular, import jobs can be started either manually
or scheduled to run at fixed intervals. The web interface also
persists error messages and logs.
\input{implementation-screenshots.tex}
\subsubsection{Version Management}
Automated job control that regularly imports data from the same
sources leads us to the problem of versioning. In our current design,
multiple ULO exports~$\mathcal{E}_i$ depend on original third party
libraries~$\mathcal{L}_i$. Running~$\mathcal{E}_i$ through the
workflow of Collector and Importer, we get some database
Automated job control leads us to the problem of versioning. In our
current design, given ULO exports~$\mathcal{E}_i$ depend on
original third party libraries~$\mathcal{L}_i$. Running~$\mathcal{E}_i$
through the workflow of Collector and Importer, we get some database
representation~$\mathcal{D}$. We see that data flows
\begin{align*}
\mathcal{L}_1 \rightarrow \; &\mathcal{E}_1 \rightarrow \mathcal{D} \\
......@@ -141,51 +142,53 @@ While it should be possible to compute the difference between two
exports~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$ and infer the
changes necessary to be applied to~$\mathcal{D}$, the big number of
triplets makes this appear unfeasible. As this is a problem an
implementer of a greater tetrapodal search system will encounter, we
suggest two possible approaches to solving this problem.
implementer of a greater tetrapodal search system will most likely
encounter, we suggest the following approaches to tackle this
challenge.
One approach is to annotate each triplet in~$\mathcal{D}$ with
versioning information about which particular
export~$\mathcal{E}^{t}_i$ it was derived from. During an import
from~$\mathcal{E}^{s}_i$ into~$\mathcal{D}$, we could (1)~first remove all
triplets in~$\mathcal{D}$ that were derived from the previous version
of~$\mathcal{E}^{s-1}_i$ and (2)~then re-import all triplets from the current
version~$\mathcal{E}^{s}_i$. Annotating triplets with versioning
information is an approach that should work, but it does
from~$\mathcal{E}^{s}_i$ into~$\mathcal{D}$, we could (1)~first remove
all triplets in~$\mathcal{D}$ that were derived from previous
version~$\mathcal{E}^{s-1}_i$ and (2)~then re-import all triplets from
the current version~$\mathcal{E}^{s}_i$. Annotating triplets with
versioning information is an approach that should work, but it does
introduce~$\mathcal{O}(n)$ additional triplets in~$\mathcal{D}$ where
$n$~is the number of triplets in~$\mathcal{D}$. This does mean
effectively doubling the database storage space, a not very satisfying
$n$~is the number of triplets in~$\mathcal{D}$. After all, we need to
annotate each of the $n$~triplets with versioning information,
effectively doubling the required storage space. A not very satisfying
solution.
Another approach is to regularly re-create the full data
set~$\mathcal{D}$ from scratch, say every seven days. This circumvents
the problems related to updating existing data sets, but also means
that changes in a given library~$\mathcal{L}_i$ take some to propagate
to~$\mathcal{D}$. Building on top of this idea, an advanced version
of this approach could forgo the requirement of only one single
database storage~$\mathcal{D}$ entirely. Instead of only maintaining
one global database state~$\mathcal{D}$, we suggest the use of
dedicated database instances~$\mathcal{D}_i$ for each given
to~$\mathcal{D}$. Building on this idea, an advanced version of this
approach could forgo the requirement for one single database
storage~$\mathcal{D}$ entirely. Instead of maintaining just one global
database state~$\mathcal{D}$, we suggest experimenting with dedicated
database instances~$\mathcal{D}_i$ for each given
library~$\mathcal{L}_i$. The advantage here is that re-creating a
given database representation~$\mathcal{D}_i$ is fast as
exports~$\mathcal{E}_i$ are comparably small. The disadvantage is that
we still want to query the whole data set~$\mathcal{D} = \mathcal{D}_1
\cup \mathcal{D}_2 \cup \cdots \cup \mathcal{D}_n$. This requires the
development of some cross-repository query mechanism, something
GraphDB currently only offers limited support
for~\cite{graphdbnested}.
\cup \mathcal{D}_2 \cup \cdots \cup \mathcal{D}_n$. This does require the
development of some cross-database query mechanism, functionality GraphDB
currently only offers limited support for~\cite{graphdbnested}.
In summary, we see that versioning is a potential challenge for a
greater tetrapodal search system. While not a pressing issue for
\emph{ulo-storage} now, we consider this a topic of future research.
\emph{ulo-storage} now, we consider it a topic of future research.
\subsection{Endpoint}\label{sec:endpoints}
Finally, we need to discuss how \emph{ulo-storage} realizes the
Endpoint. Recall that the Endpoint provides the programming interface
Endpoint. Recall that an Endpoint provides the programming interface
for applications that wish to query our collection of organizational
knowledge. In practice, the choice of Endpoint programming interface
is determined by the choice of what database system to use.
is determined by the choice of database system as the Endpoint is
provided directly by the database.
In our project, organizational knowledge is formulated as
RDF~triplets. The canonical choice for us is to use a triple store,
......@@ -201,9 +204,9 @@ Virtuoso Open Source~\cite{wikivirtuoso, ulo} is that GraphDB supports
recent versions of the SPARQL query language~\cite{graphdbsparql} and
OWL~Reasoning~\cite{owlspec, graphdbreason}. In particular, this
means that GraphDB offers support for transitive queries as described
in previous work~\cite{ulo}. A transitive query is one that, given a
relation~$R$, asks for the transitive closure~$S$ of~$R$~\cite{tc}
(Figure~\ref{fig:tc}).
in previous work on~ULO~\cite{ulo}. A transitive query is one that,
given a relation~$R$, asks for the transitive closure~$S$
of~$R$~\cite{tc} (Figure~\ref{fig:tc}).
\input{implementation-transitive-closure.tex}
......@@ -211,15 +214,15 @@ In fact, GraphDB supports two approaches for realizing transitive
queries. On one hand, GraphDB supports the
\texttt{owl:TransitiveProperty}~\cite[Section 4.4.1]{owlspec} property
that defines a given predicate~$P$ to be transitive. With $P$~marked
in this way, querying the knowledge base is equivalent to querying the
transitive closure of~$P$. This, of course, requires transitivity to
be hard-coded into the knowledge base. If we only wish to query the
transitive closure for a given query, we can take advantage of
property paths~\cite{paths} which allows us to indicate that a given
predicate~$P$ is to be understood as transitive when querying. Only
during querying is the transitive closure then evaluated. Either way,
GraphDB supports transitive queries without awkward workarounds
necessary in other systems~\cite{ulo}.
this way, querying the knowledge base is equivalent to querying the
transitive closure of~$P$. This requires transitivity to be hard-coded
into the knowledge base. If we only wish to query the transitive
closure for a given query, we can take advantage of so-called
``property paths''~\cite{paths} which allow us to indicate that a
given predicate~$P$ is to be understood as transitive when
querying. Only during querying is the transitive closure then
evaluated. Either way, GraphDB supports transitive queries without
awkward workarounds necessary in other systems~\cite{ulo}.
\subsubsection{SPARQL Endpoint}
......@@ -266,9 +269,9 @@ that asks for all triplets in the store looks like
\end{lstlisting}
in RDF4J. \texttt{getStatements(s, p, o)} returns all triplets that
have matching subject~\texttt{s}, predicate~\texttt{p} and
object~\texttt{o}. Any argument that is \texttt{null} can be replace
with any value, that is it is a query variable to be filled by the
call to \texttt{getStatements}.
object~\texttt{o}. Any argument that is \texttt{null} can be
substituted with any value, that is it is a query variable to be
filled by the call to \texttt{getStatements}.
Using RDF4J does introduce a dependency on the JVM and its
languages. But in practice, we found RDF4J to be quite convenient,
......@@ -281,18 +284,15 @@ provides us with powerful IDE auto completion features during
development of ULO applications.
Summarizing the last two sections, we see that both SPARQL and RDF4J
have unique advantages. While SPARQL is an official W3C standard and
implemented by more database systems, RDF4J can be more convenient
when dealing with JVM-based projects. For \emph{ulo-storage}, we
played around with both interfaces and chose whatever seemed more
convenient at the moment. We recommend any implementors to do the
same.
have unique advantages. While SPARQL is an official W3C~\cite{w3c}
standard and implemented by more database systems, RDF4J can be more
convenient when dealing with JVM-based projects. For
\emph{ulo-storage}, we played around with both interfaces and chose
whatever seemed more convenient at the moment. We recommend any
implementors to do the same.
\subsection{Deployment and Availability}
\def\gorepo{https://gitlab.cs.fau.de/kissen/ulo-storage-collect}
\def\composerepo{https://gl.kwarc.info/supervision/schaertl_andreas/-/tree/master/experimental/compose}
Software not only needs to get developed, but also deployed. To deploy
the combination of Collector, Importer and Endpoint, we use Docker
Compose. Docker itself is a technology for wrapping software into
......@@ -300,8 +300,9 @@ containers, that is lightweight virtual machines with a fixed
environment for running a given application~\cite[pp. 22]{dockerbook}.
Docker Compose then is a way of combining individual Docker containers
to run a full tech stack of application, database server and so
on~\cite[pp. 42]{dockerbook}. All configuration of such a setup is
stored in a Docker Compose file that describes the software stack.
on~\cite[pp. 42]{dockerbook}. All configuration of the overarching a
setup is stored in a Docker Compose file that describes the software
stack.
For \emph{ulo-storage}, we provide a single Docker Compose file which
starts three containers, namely (1)~the Collector/Importer web
......@@ -309,8 +310,8 @@ interface, (2)~a GraphDB instance which provides us with the required
Endpoint and (3)~some test applications that use that Endpoint. All
code for Collector and Importer is available in the
\texttt{ulo-storage-collect} Git repository~\cite{gorepo}. Additional
deployment files, that is Docker Compose and additional Dockerfiles
are stored in a separate repository~\cite{dockerfilerepo}.
deployment files, that is Docker Compose configuration and additional
Dockerfiles are stored in a separate repository~\cite{dockerfilerepo}.
This concludes our discussion of the implementation developed for the
\emph{ulo-storage} project. We designed a system based around (1)~a
......@@ -319,4 +320,4 @@ Importer which imports these triplets into a GraphDB database and
(3)~looked at different ways of querying a GraphDB Endpoint. All of
this is easy to deploy using a single Docker Compose file. With this
stack ready for use, we will continue with a look at some interesting
applications and queries built on top of this interface.
applications and queries built on top of this infrastructure.
......@@ -437,7 +437,7 @@
}
@misc{afpexport,
title={Making Isabelle Content Accessible in Knowledge Representation Formats},
title={Making Isabelle Content Accessible in Knowledge Representation Formats},
author={Michael Kohlhase and Florian Rabe and Makarius Wenzel},
year={2020},
eprint={2005.08884},
......@@ -455,3 +455,10 @@
publisher={Frontiers}
}
@online{w3c,
organization = {W3C},
title = {ABOUT W3C},
urldate = {2020-10-05},
url = {https://www.w3.org/Consortium/},
}
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment