Skip to content
Snippets Groups Projects
Commit 807c00ca authored by Andreas Schärtl's avatar Andreas Schärtl
Browse files

report: implementation: explain how to implement transitive queries

(The prose needs some more revisions, but the information is there.)
parent d72b2469
Branches week40/elementary
No related tags found
No related merge requests found
digraph Tree
{
A -> B
A -> E
B -> C
B -> D
E -> F
F -> G
}
File added
digraph TreeTransitive
{
A -> B
A -> C [style=dotted]
A -> D [style=dotted]
A -> E
A -> F [style=dotted]
A -> G [style=dotted]
B -> C
B -> D
E -> F
E -> G [style=dotted]
F -> G
}
\ No newline at end of file
File added
\begin{figure*}
\centering
\begin{subfigure}[b]{0.4\textwidth}
\centering
\includegraphics[width=\textwidth]{figs/tree-simple.pdf}
\caption{We can think of this tree as visualizing a relation~$R$ where
$(X, Y)~\in~R$ iff there is an edge from~$X$ to~$Y$.}
\end{subfigure}
\begin{subfigure}[b]{0.4\textwidth}
\centering
\includegraphics[width=\textwidth]{figs/tree-transitive.pdf}
\caption{Transitive closure~$S$ of relation~$R$. Additionally to
each tuple from~$R$ (solid edges), $S$~also contains additional
transitive edges (dotted lines).}
\end{subfigure}
\caption{Illustrating the idea behind the transative closure. A
transative closure~$S$ of relation~$R$ is defined as the
``minimal transitive relation that contains~$R$''~\cite{tc}.}\label{fig:tc}
\end{figure*}
......@@ -75,30 +75,26 @@ which were not discovered previously. In particular, both Isabelle and
Coq export contained URIs which do not fit the official syntax
specification~\cite{rfc3986} as they contained illegal
characters. Previous work~\cite{ulo} that processed Coq and Isabelle
exports used database software such as Virtuoso Open Source which do
not properly check URIs according to spec, in consequence these faults
were only discovered now. To tackle these problems, we introduced on
the fly correction steps during collection that escape the URIs in
question and then continue processing. Of course this is only a
work-around. Related bug reports were filed in the respective export
projects to ensure that in the future this extra step is not
necessary.
The output of the Collector is a stream of RDF data. This stream gets
exports used database software such as Virtuoso Open
Source~\cite{wikivirtuoso} which do not properly check URIs according
to spec, in consequence these faults were only discovered now. To
tackle these problems, we introduced on the fly correction steps
during collection that escape the URIs in question and then continue
processing. Of course this is only a work-around. Related bug reports
were filed in the respective export projects to ensure that in the
future this extra step is not necessary.
The output of the Collector is a stream of RDF~data. This stream gets
passed to the Importer which imports the encoded RDF triplets into
some kind of persistent storage. The canonical choice for this task is
to use a triple store, that is a database optimized for storing RDF
triplets~\cite{triponto, tripw3c}. For our project, we used the
GraphDB~\cite{graphdb} triple store. A free version that fits our
needs is available at~\cite{graphdbfree}. The import itself is
straight-forward, our software only needs to upload the RDF file
stream as-is to an HTTP endpoint provided by our GraphDB instance.
\emph{({TODO}: Write down a small comparison of different database
types, triple stores and implementations. Honestly the main
advantage of GraphDB is that it's easy to set up and import to;
maybe I'll also write an Importer for another DB to show that the
choice of database is not that important.)}
some kind of persistent storage. In theory, multiple implementations
of this Importer are possible, namely different implementations for
different database backends. As we will see in Section~\ref{sec:endpoints},
for our projected we selected the GraphDB triple store. The Importer
merely needs to make the necessary API~calls to import the RDF stream
into the database. As such the import itself is straight-forward, our
software only needs to upload the RDF file stream as-is to an HTTP
endpoint provided by our GraphDB instance.
\subsection{Scheduling and Version Management}
......@@ -172,74 +168,108 @@ whole data set~$\mathcal{D} = \mathcal{D}_1 \cup \mathcal{D}_2 \cup
cross-repository query mechanism, something GraphDB currently only
offers limited support for~\cite{graphdbnested}.
\subsection{Endpoints}\label{sec:endpoints}
With ULO triplets imported into the GraphDB triple store by Collector
and Importer, we now have all data available necessary for querying.
As discussed before, querying from applications happens through an
Endpoint that exposes some kind of {API}. The interesting question
here is probably not so much the implementation of the endpoint itself,
rather it is the choice of API than can make or break such a project.
\subsection{Endpoint}\label{sec:endpoints}
Finally, we need to discuss how \emph{ulo-storage} realizes the
Endpoint. Recall that the Endpoint provides the programming interface
for systems that wish to query our collection of organizational
knowledge. In practice, the choice of Endpoint programming interface
is determined by the choice backing database storage.
In our project, organizational knowledge is formulated as
RDF~triplets. The canonical choice for us is to use a triple store,
that is a database optimized for storing RDF triplets~\cite{triponto,
tripw3c}. For our project, we used the GraphDB~\cite{graphdb} triple
store. A free version that fits our needs is available
at~\cite{graphdbfree}.
\subsubsection{Transitive Queries}
A big advantage of GraphDB compared to other systems, such as Virtuoso
Open Source~\cite{wikivirtuoso} used in previous work related to the
upper level ontology~\cite{ulo}, is that it supports recent versions
of SPARQL~\cite{graphdbsparql} and OWL~Reasoning~\cite{owlspec,
graphdbreason}. In particular, this means that GraphDB offers
support for transitive queries as described in previous
work~\cite{ulo}. A transitive query is one that, given a relation~$R$,
asks for the transitive closure~$S$ of~$R$~\cite{tc}
(Figure~\ref{fig:tc}).
In fact, GraphDB supports two approaches for realizing transitive
queries. On one hand, GraphDB supports the
\texttt{owl:TransitiveProperty} property that defines a given
predicate~$P$ to be transitive. With $P$~marked in this way, querying
the knowledge base is equivalent to querying the transitive closure
of~$P$. This, of course, requires transitivity to be hard-coded into
the knowledge base. If we only wish to query the transitive closure
for a given query, we can take advantage of property
paths~\cite{paths} which allows us to indicate that a given
predicate~$P$ is to be understood as transitive when querying. Only
during querying is the transitive closure then evaluated.
\input{implementation-transitive-closure.tex}
There are multiple approaches to querying the GraphDB triple store,
one based around the standardized SPARQL query language and the other
on the RDF4J Java library. Both approaches have unique advantages.
\begin{itemize}
\item SPARQL is a standardized query language for RDF triplet
data~\cite{sparql}. The specification includes not just syntax
and semantics of the language itself, but also a standardized
REST interface~\cite{rest} for querying database servers.
\textbf{Syntax} SPARQL is inspired by SQL and as such the
\texttt{SELECT} \texttt{WHERE} syntax should be familiar to many
software developers. A simple query that returns all triplets
in the store looks like
\begin{lstlisting}
SELECT * WHERE { ?s ?p ?o }
\end{lstlisting}
where \texttt{?s}, \texttt{?p} and \texttt{?o} are query
variables. The result of any query are valid substitutions for
the query variables. In this particular case, the database would
return a table of all triplets in the store sorted by
subject~\texttt{?o}, predicate~\texttt{?p} and
object~\texttt{?o}.
\textbf{Advantage} Probably the biggest advantage is that SPARQL
is ubiquitous. As it is the de facto standard for querying
triple stores, lots of implementations and documentation are
available~\cite{sparqlbook, sparqlimpls, gosparql}.
\item RDF4J is a Java API for interacting with triple stores,
implemented based on a superset of the {SPARQL} REST
interface~\cite{rdf4j}. GraphDB is one of the database
servers that supports RDF4J, in fact it is the recommended way
of interacting with GraphDB repositories~\cite{graphdbapi}.
\textbf{Syntax} Instead of formulating textual queries, RDF4J
allows developers to query a repository by calling Java API
methods. Previous query that requests all triplets in the store
looks like
\begin{lstlisting}
connection.getStatements(null, null, null);
\end{lstlisting}
in RDF4J. \texttt{getStatements(s, p, o)} returns all triplets
that have matching subject~\texttt{s}, predicate~\texttt{p} and
object~\texttt{o}. Any argument that is \texttt{null} can be
replace with any value, i.e.\ it is a query variable to be
filled by the call to \texttt{getStatements}.
\textbf{Advantage} Using RDF4J does introduce a dependency on
the JVM and its languages. But in practice, we found RDF4J to be
quite convenient, especially for simple queries, as it allows us
to formulate everything in a single programming language rather
than mixing programming language with awkward query strings.
We also found it quite helpful to generate Java classes from
OWL~ontologies that contain all definitions of the
ontology~\cite{rdf4jgen}. This provides us with powerful IDE
auto completion features during development of ULO applications.
\end{itemize}
\subsubsection{SPARQL Endpoint}
SPARQL is a standardized query language for RDF triplet
data~\cite{sparql}. The specification includes not just syntax and
semantics of the language itself, but also a standardized REST
interface~\cite{rest} for querying database servers.
\noindent\textbf{Syntax} SPARQL is inspired by SQL and as such the
\texttt{SELECT} \texttt{WHERE} syntax should be familiar to many
software developers. A simple query that returns all triplets in the
store looks like
\begin{lstlisting}
SELECT * WHERE { ?s ?p ?o }
\end{lstlisting}
where \texttt{?s}, \texttt{?p} and \texttt{?o} are query
variables. The result of any query are valid substitutions for the
query variables. In this particular case, the database would return a
table of all triplets in the store sorted by subject~\texttt{?o},
predicate~\texttt{?p} and object~\texttt{?o}.
\noindent\textbf{Advantage} Probably the biggest advantage is that SPARQL
is ubiquitous. As it is the de facto standard for querying
triple stores, lots of implementations and documentation are
available~\cite{sparqlbook, sparqlimpls, gosparql}.
\subsubsection{RDF4J Endpoint}
RDF4J is a Java API for interacting with triple stores, implemented
based on a superset of the {SPARQL} REST interface~\cite{rdf4j}.
GraphDB is one of the database servers that supports RDF4J, in fact it
is the recommended way of interacting with GraphDB
repositories~\cite{graphdbapi}.
\noindent\textbf{Syntax} Instead of formulating textual queries, RDF4J allows
developers to query a repository by calling Java API methods. Previous
query that requests all triplets in the store looks like
\begin{lstlisting}
connection.getStatements(null, null, null);
\end{lstlisting}
in RDF4J. \texttt{getStatements(s, p, o)} returns all triplets that
have matching subject~\texttt{s}, predicate~\texttt{p} and
object~\texttt{o}. Any argument that is \texttt{null} can be replace
with any value, i.e.\ it is a query variable to be filled by the call
to \texttt{getStatements}.
\noindent\textbf{Advantage} Using RDF4J does introduce a dependency on the JVM
and its languages. But in practice, we found RDF4J to be quite
convenient, especially for simple queries, as it allows us to
formulate everything in a single programming language rather than
mixing programming language with awkward query strings.
We also found it quite helpful to generate Java classes from
OWL~ontologies that contain all definitions of the
ontology~\cite{rdf4jgen}. This provides us with powerful IDE auto
completion features during development of ULO applications.
\subsubsection{Endpoints in \emph{ulo-storage}}
We see that both SPARQL and RDF4J have unique advantages. While SPARQL
is an official W3C standard and implemented by more database systems,
......
......@@ -287,7 +287,9 @@
@misc{wikivirtuoso,
title={Virtuoso Open-Source Edition},
author={Wiki, Virtuoso Open-Source}
author={Wiki, Virtuoso Open-Source},
url = {http://vos.openlinksw.com/owiki/wiki/VOS},
urldate = {2020-09-27},
}
@online{tripw3c,
......@@ -357,4 +359,45 @@
date = {2020},
urldate = {2020-09-23},
url = {http://graphdb.ontotext.com/documentation/standard/nested-repositories.html},
}
@online{graphdbsparql,
title = {SPARQL Compliance},
organization = {Ontotext},
date = {2020},
urldate = {2020-09-27},
url = {http://graphdb.ontotext.com/documentation/standard/sparql-compliance.html},
}
@online{graphdbreason,
title = {Reasoning},
organization = {Ontotext},
date = {2020},
urldate = {2020-09-27},
url = {http://graphdb.ontotext.com/documentation/standard/sparql-compliance.html},
}
@article{owlspec,
title={OWL web ontology language reference},
author={Bechhofer, Sean and Van Harmelen, Frank and Hendler, Jim and Horrocks, Ian and McGuinness, Deborah L and Patel-Schneider, Peter F and Stein, Lynn Andrea and others},
journal={W3C recommendation},
volume={10},
number={02},
year={2004},
url={https://www.w3.org/TR/owl-ref/},
urldate = {2020-09-27},
}
@online{tc,
author = {Weisstein, Eric W.},
title = {Transitive Closure},
urldate = {2020-09-27},
url = {https://mathworld.wolfram.com/TransitiveClosure.html},
}
@online{paths,
organization = {W3C},
year = {2009},
urldate = {2020-09-27},
url = {https://www.w3.org/2009/sparql/wiki/Feature:PropertyPaths},
}
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment