Skip to content
Snippets Groups Projects
Commit a40c5d96 authored by Andreas Schärtl's avatar Andreas Schärtl
Browse files

report: review impl

parent 97199bf4
No related branches found
No related tags found
No related merge requests found
......@@ -61,10 +61,10 @@ is no pressing need to force them into separate processes.
Our implementation supports two sources for RDF files, namely Git
repositories and the local file system. The file system Collector
crawls a given directory on the local machine and looks for
RDF~XMl~files~\cite{rdfxml} while the Git Collector first clones a Git
RDF~XML~files~\cite{rdfxml} while the Git Collector first clones a Git
repository and then passes the checked out working copy to the file
system Collector. Because we found that is not uncommon for RDF files to be
compressed, our Collector supports on the fly extraction of
system Collector. Because we found that is not uncommon for RDF files
to be compressed, our implementation supports on the fly extraction of
gzip~\cite{gzip} and xz~\cite{xz} formats which can greatly reduce the
required disk space in the collection step.
......@@ -75,7 +75,7 @@ Coq exports contained URIs which does not fit the official syntax
specification~\cite{rfc3986} as they contained illegal
characters. Previous work~\cite{ulo} that processed Coq and Isabelle
exports used database software such as Virtuoso Open
Source~\cite{wikivirtuoso} which do not properly check URIs according
Source~\cite{wikivirtuoso} which does not properly check URIs according
to spec; in consequence these faults were only discovered now. To
tackle these problems, we introduced on the fly correction steps
during collection that escape the URIs in question and then continue
......@@ -164,18 +164,19 @@ Another approach is to regularly re-create the full data
set~$\mathcal{D}$ from scratch, say every seven days. This circumvents
the problems related to updating existing data sets, but also means
that changes in a given library~$\mathcal{L}_i$ take some to propagate
to~$\mathcal{D}$. Building on this idea, an advanced version of this
approach could forgo the requirement for one single database
storage~$\mathcal{D}$ entirely. Instead of maintaining just one global
database state~$\mathcal{D}$, we suggest experimenting with dedicated
database instances~$\mathcal{D}_i$ for each given
to~$\mathcal{D}$. Continuing this train of thought, an advanced
version of this approach could forgo the requirement for one single
database storage~$\mathcal{D}$ entirely. Instead of maintaining just
one global database state~$\mathcal{D}$, we suggest experimenting with
dedicated database instances~$\mathcal{D}_i$ for each given
library~$\mathcal{L}_i$. The advantage here is that re-creating a
given database representation~$\mathcal{D}_i$ is fast as
exports~$\mathcal{E}_i$ are comparably small. The disadvantage is that
we still want to query the whole data set~$\mathcal{D} = \mathcal{D}_1
\cup \mathcal{D}_2 \cup \cdots \cup \mathcal{D}_n$. This does require the
development of some cross-database query mechanism, functionality GraphDB
currently only offers limited support for~\cite{graphdbnested}.
\cup \mathcal{D}_2 \cup \cdots \cup \mathcal{D}_n$. This does require
the development of some cross-database query mechanism, functionality
existing systems currently only offer limited support
for~\cite{graphdbnested}.
In summary, we see that versioning is a potential challenge for a
greater tetrapodal search system. While not a pressing issue for
......@@ -188,7 +189,7 @@ Endpoint. Recall that an Endpoint provides the programming interface
for applications that wish to query our collection of organizational
knowledge. In practice, the choice of Endpoint programming interface
is determined by the choice of database system as the Endpoint is
provided directly by the database.
provided directly by the database system.
In our project, organizational knowledge is formulated as
RDF~triplets. The canonical choice for us is to use a triple store,
......@@ -206,7 +207,7 @@ OWL~Reasoning~\cite{owlspec, graphdbreason}. In particular, this
means that GraphDB offers support for transitive queries as described
in previous work on~ULO~\cite{ulo}. A transitive query is one that,
given a relation~$R$, asks for the transitive closure~$S$
of~$R$~\cite{tc} (Figure~\ref{fig:tc}).
(Figure~\ref{fig:tc}) of~$R$.
\input{implementation-transitive-closure.tex}
......@@ -235,14 +236,14 @@ includes not just syntax and semantics of the language itself, but
also a standardized REST interface~\cite{rest} for querying database
servers.
SPARQL was inspired by SQL and as such the \texttt{SELECT}
The SPARQL syntax was inspired by SQL and as such the \texttt{SELECT}
\texttt{WHERE} syntax should be familiar to many software developers.
A simple query that returns all triplets in the store looks like
\begin{lstlisting}
SELECT * WHERE { ?s ?p ?o }
\end{lstlisting}
where \texttt{?s}, \texttt{?p} and \texttt{?o} are query
variables. The result of any query are valid substitutions for the
variables. The result of any query are valid substitutions for all
query variables. In this particular case, the database would return a
table of all triplets in the store sorted by subject~\texttt{?o},
predicate~\texttt{?p} and object~\texttt{?o}.
......@@ -300,7 +301,7 @@ containers, that is lightweight virtual machines with a fixed
environment for running a given application~\cite[pp. 22]{dockerbook}.
Docker Compose then is a way of combining individual Docker containers
to run a full tech stack of application, database server and so
on~\cite[pp. 42]{dockerbook}. All configuration of the overarching a
on~\cite[pp. 42]{dockerbook}. All configuration of the overarching
setup is stored in a Docker Compose file that describes the software
stack.
......@@ -313,11 +314,12 @@ code for Collector and Importer is available in the
deployment files, that is Docker Compose configuration and additional
Dockerfiles are stored in a separate repository~\cite{dockerfilerepo}.
This concludes our discussion of the implementation developed for the
\emph{ulo-storage} project. We designed a system based around (1)~a
Collector which collects RDF triplets from third party sources, (2)~an
Importer which imports these triplets into a GraphDB database and
(3)~looked at different ways of querying a GraphDB Endpoint. All of
this is easy to deploy using a single Docker Compose file. With this
stack ready for use, we will continue with a look at some interesting
applications and queries built on top of this infrastructure.
With this, we conclude our discussion of the implementation developed
for the \emph{ulo-storage} project. We designed a system based around
(1)~a Collector which collects RDF triplets from third party sources,
(2)~an Importer which imports these triplets into a GraphDB database
and (3)~looked at different ways of querying a GraphDB Endpoint. All
of this is easy to deploy using a single Docker Compose file. With
this stack ready for use, we will now continue with a look at some
interesting applications and queries built on top of this
infrastructure.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment