report: implementation: explain how to implement transitive queries

(The prose needs some more revisions, but the information is there.)

report: implementation: explain how to implement transitive queries
807c00ca · Andreas Schärtl · d72b2469 · 807c00ca · 807c00ca · 807c00ca
Commit 807c00ca authored 4 years ago by Andreas Schärtl
--- a/doc/report/figs/tree-simple.dot
+++ b/doc/report/figs/tree-simple.dot
+digraph Tree
+{
+	A -> B
+	A -> E
+
+	B -> C
+	B -> D
+
+	E -> F
+
+	F -> G
+}
--- a/doc/report/figs/tree-simple.pdf
+++ b/doc/report/figs/tree-simple.pdf
--- a/doc/report/figs/tree-transitive.dot
+++ b/doc/report/figs/tree-transitive.dot
+digraph TreeTransitive
+{
+	A -> B
+	A -> C [style=dotted]
+	A -> D [style=dotted]
+	A -> E
+	A -> F [style=dotted]
+	A -> G [style=dotted]
+
+	B -> C
+	B -> D
+
+	E -> F
+	E -> G [style=dotted]
+
+	F -> G
+}
\ No newline at end of file
--- a/doc/report/figs/tree-transitive.pdf
+++ b/doc/report/figs/tree-transitive.pdf
--- a/doc/report/implementation-transitive-closure.tex
+++ b/doc/report/implementation-transitive-closure.tex
+\begin{figure*}
+    \centering
+    \begin{subfigure}[b]{0.4\textwidth}
+        \centering
+        \includegraphics[width=\textwidth]{figs/tree-simple.pdf}
+        \caption{We can think of this tree as visualizing a relation~$R$ where
+          $(X, Y)~\in~R$ iff there is an edge from~$X$ to~$Y$.}
+    \end{subfigure}
+    \begin{subfigure}[b]{0.4\textwidth}
+        \centering
+        \includegraphics[width=\textwidth]{figs/tree-transitive.pdf}
+        \caption{Transitive closure~$S$ of relation~$R$. Additionally to
+          each tuple from~$R$ (solid edges), $S$~also contains additional
+          transitive edges (dotted lines).}
+    \end{subfigure}
+    \caption{Illustrating the idea behind the transative closure. A
+      transative closure~$S$ of relation~$R$ is defined as the
+      ``minimal transitive relation that contains~$R$''~\cite{tc}.}\label{fig:tc}
+\end{figure*}
--- a/doc/report/implementation.tex
+++ b/doc/report/implementation.tex
@@ -75,30 +75,26 @@ which were not discovered previously. In particular, both Isabelle and
 Coq export contained URIs which do not fit the official syntax
 specification~\cite{rfc3986} as they contained illegal
 characters. Previous work~\cite{ulo} that processed Coq and Isabelle
-exports used database software such as Virtuoso Open Source which do
-not properly check URIs according to spec, in consequence these faults
-were only discovered now.  To tackle these problems, we introduced on
-the fly correction steps during collection that escape the URIs in
-question and then continue processing.  Of course this is only a
-work-around. Related bug reports were filed in the respective export
-projects to ensure that in the future this extra step is not
-necessary.
-
-The output of the Collector is a stream of RDF data.  This stream gets
+exports used database software such as Virtuoso Open
+Source~\cite{wikivirtuoso} which do not properly check URIs according
+to spec, in consequence these faults were only discovered now.  To
+tackle these problems, we introduced on the fly correction steps
+during collection that escape the URIs in question and then continue
+processing.  Of course this is only a work-around. Related bug reports
+were filed in the respective export projects to ensure that in the
+future this extra step is not necessary.
+
+The output of the Collector is a stream of RDF~data.  This stream gets
 passed to the Importer which imports the encoded RDF triplets into
-some kind of persistent storage. The canonical choice for this task is
-to use a triple store, that is a database optimized for storing RDF
-triplets~\cite{triponto, tripw3c}. For our project, we used the
-GraphDB~\cite{graphdb} triple store. A free version that fits our
-needs is available at~\cite{graphdbfree}.  The import itself is
-straight-forward, our software only needs to upload the RDF file
-stream as-is to an HTTP endpoint provided by our GraphDB instance.
-
-\emph{({TODO}: Write down a small comparison of different database
-  types, triple stores and implementations. Honestly the main
-  advantage of GraphDB is that it's easy to set up and import to;
-  maybe I'll also write an Importer for another DB to show that the
-  choice of database is not that important.)}
+some kind of persistent storage. In theory, multiple implementations
+of this Importer are possible, namely different implementations for
+different database backends. As we will see in Section~\ref{sec:endpoints},
+for our projected we selected the GraphDB triple store. The Importer
+merely needs to make the necessary API~calls to import the RDF stream
+into the database.  As such the import itself is straight-forward, our
+software only needs to upload the RDF file stream as-is to an HTTP
+endpoint provided by our GraphDB instance.
+

 \subsection{Scheduling and Version Management}

@@ -172,74 +168,108 @@ whole data set~$\mathcal{D} = \mathcal{D}_1 \cup \mathcal{D}_2 \cup
 cross-repository query mechanism, something GraphDB currently only
 offers limited support for~\cite{graphdbnested}.

-\subsection{Endpoints}\label{sec:endpoints}
-
-With ULO triplets imported into the GraphDB triple store by Collector
-and Importer, we now have all data available necessary for querying.
-As discussed before, querying from applications happens through an
-Endpoint that exposes some kind of {API}. The interesting question
-here is probably not so much the implementation of the endpoint itself,
-rather it is the choice of API than can make or break such a project.
+\subsection{Endpoint}\label{sec:endpoints}
+
+Finally, we need to discuss how \emph{ulo-storage} realizes the
+Endpoint. Recall that the Endpoint provides the programming interface
+for systems that wish to query our collection of organizational
+knowledge. In practice, the choice of Endpoint programming interface
+is determined by the choice backing database storage.
+
+In our project, organizational knowledge is formulated as
+RDF~triplets.  The canonical choice for us is to use a triple store,
+that is a database optimized for storing RDF triplets~\cite{triponto,
+tripw3c}. For our project, we used the GraphDB~\cite{graphdb} triple
+store. A free version that fits our needs is available
+at~\cite{graphdbfree}.
+
+\subsubsection{Transitive Queries}
+
+A big advantage of GraphDB compared to other systems, such as Virtuoso
+Open Source~\cite{wikivirtuoso} used in previous work related to the
+upper level ontology~\cite{ulo}, is that it supports recent versions
+of SPARQL~\cite{graphdbsparql} and OWL~Reasoning~\cite{owlspec,
+graphdbreason}.  In particular, this means that GraphDB offers
+support for transitive queries as described in previous
+work~\cite{ulo}. A transitive query is one that, given a relation~$R$,
+asks for the transitive closure~$S$ of~$R$~\cite{tc}
+(Figure~\ref{fig:tc}).
+
+In fact, GraphDB supports two approaches for realizing transitive
+queries.  On one hand, GraphDB supports the
+\texttt{owl:TransitiveProperty} property that defines a given
+predicate~$P$ to be transitive. With $P$~marked in this way, querying
+the knowledge base is equivalent to querying the transitive closure
+of~$P$. This, of course, requires transitivity to be hard-coded into
+the knowledge base. If we only wish to query the transitive closure
+for a given query, we can take advantage of property
+paths~\cite{paths} which allows us to indicate that a given
+predicate~$P$ is to be understood as transitive when querying. Only
+during querying is the transitive closure then evaluated.
+
+\input{implementation-transitive-closure.tex}

 There are multiple approaches to querying the GraphDB triple store,
 one based around the standardized SPARQL query language and the other
 on the RDF4J Java library. Both approaches have unique advantages.

-\begin{itemize}
-      \item SPARQL is a standardized query language for RDF triplet
-      data~\cite{sparql}. The specification includes not just syntax
-      and semantics of the language itself, but also a standardized
-      REST interface~\cite{rest} for querying database servers.
+\subsubsection{SPARQL Endpoint}
+
+SPARQL is a standardized query language for RDF triplet
+data~\cite{sparql}. The specification includes not just syntax and
+semantics of the language itself, but also a standardized REST
+interface~\cite{rest} for querying database servers.

-      \textbf{Syntax} SPARQL is inspired by SQL and as such the
+\noindent\textbf{Syntax} SPARQL is inspired by SQL and as such the
 \texttt{SELECT} \texttt{WHERE} syntax should be familiar to many
-      software developers.  A simple query that returns all triplets
-      in the store looks like
+software developers.  A simple query that returns all triplets in the
+store looks like
 \begin{lstlisting}
 SELECT * WHERE { ?s ?p ?o }
 \end{lstlisting}
 where \texttt{?s}, \texttt{?p} and \texttt{?o} are query
-      variables. The result of any query are valid substitutions for
-      the query variables. In this particular case, the database would
-      return a table of all triplets in the store sorted by
-      subject~\texttt{?o}, predicate~\texttt{?p} and
-      object~\texttt{?o}.
+variables. The result of any query are valid substitutions for the
+query variables. In this particular case, the database would return a
+table of all triplets in the store sorted by subject~\texttt{?o},
+predicate~\texttt{?p} and object~\texttt{?o}.

-      \textbf{Advantage} Probably the biggest advantage is that SPARQL
+\noindent\textbf{Advantage} Probably the biggest advantage is that SPARQL
 is ubiquitous. As it is the de facto standard for querying
 triple stores, lots of implementations and documentation are
 available~\cite{sparqlbook, sparqlimpls, gosparql}.

-      \item RDF4J is a Java API for interacting with triple stores,
-      implemented based on a superset of the {SPARQL} REST
-      interface~\cite{rdf4j}.  GraphDB is one of the database
-      servers that supports RDF4J, in fact it is the recommended way
-      of interacting with GraphDB repositories~\cite{graphdbapi}.
+\subsubsection{RDF4J Endpoint}

-      \textbf{Syntax} Instead of formulating textual queries, RDF4J
-      allows developers to query a repository by calling Java API
-      methods. Previous query that requests all triplets in the store
-      looks like
+RDF4J is a Java API for interacting with triple stores, implemented
+based on a superset of the {SPARQL} REST interface~\cite{rdf4j}.
+GraphDB is one of the database servers that supports RDF4J, in fact it
+is the recommended way of interacting with GraphDB
+repositories~\cite{graphdbapi}.
+
+\noindent\textbf{Syntax} Instead of formulating textual queries, RDF4J allows
+developers to query a repository by calling Java API methods. Previous
+query that requests all triplets in the store looks like
 \begin{lstlisting}
 connection.getStatements(null, null, null);
 \end{lstlisting}
-      in RDF4J. \texttt{getStatements(s, p, o)} returns all triplets
-      that have matching subject~\texttt{s}, predicate~\texttt{p} and
-      object~\texttt{o}. Any argument that is \texttt{null} can be
-      replace with any value, i.e.\ it is a query variable to be
-      filled by the call to \texttt{getStatements}.
-
-      \textbf{Advantage} Using RDF4J does introduce a dependency on
-      the JVM and its languages. But in practice, we found RDF4J to be
-      quite convenient, especially for simple queries, as it allows us
-      to formulate everything in a single programming language rather
-      than mixing programming language with awkward query strings.
+in RDF4J. \texttt{getStatements(s, p, o)} returns all triplets that
+have matching subject~\texttt{s}, predicate~\texttt{p} and
+object~\texttt{o}. Any argument that is \texttt{null} can be replace
+with any value, i.e.\ it is a query variable to be filled by the call
+to \texttt{getStatements}.
+
+\noindent\textbf{Advantage} Using RDF4J does introduce a dependency on the JVM
+and its languages. But in practice, we found RDF4J to be quite
+convenient, especially for simple queries, as it allows us to
+formulate everything in a single programming language rather than
+mixing programming language with awkward query strings.

 We also found it quite helpful to generate Java classes from
 OWL~ontologies that contain all definitions of the
-      ontology~\cite{rdf4jgen}.  This provides us with powerful IDE
-      auto completion features during development of ULO applications.
-\end{itemize}
+ontology~\cite{rdf4jgen}.  This provides us with powerful IDE auto
+completion features during development of ULO applications.
+
+\subsubsection{Endpoints in \emph{ulo-storage}}

 We see that both SPARQL and RDF4J have unique advantages. While SPARQL
 is an official W3C standard and implemented by more database systems,

--- a/doc/report/references.bib
+++ b/doc/report/references.bib
@@ -287,7 +287,9 @@

 @misc{wikivirtuoso,
  title={Virtuoso Open-Source Edition},
-  author={Wiki, Virtuoso Open-Source}
+  author={Wiki, Virtuoso Open-Source},
+  url = {http://vos.openlinksw.com/owiki/wiki/VOS},
+  urldate = {2020-09-27},
 }

 @online{tripw3c,
@@ -358,3 +360,44 @@
    urldate = {2020-09-23},
    url = {http://graphdb.ontotext.com/documentation/standard/nested-repositories.html},
 }
+
+@online{graphdbsparql,
+    title = {SPARQL Compliance},
+    organization = {Ontotext},
+    date = {2020},
+    urldate = {2020-09-27},
+    url = {http://graphdb.ontotext.com/documentation/standard/sparql-compliance.html},
+}
+
+@online{graphdbreason,
+    title = {Reasoning},
+    organization = {Ontotext},
+    date = {2020},
+    urldate = {2020-09-27},
+    url = {http://graphdb.ontotext.com/documentation/standard/sparql-compliance.html},
+}
+
+@article{owlspec,
+    title={OWL web ontology language reference},
+    author={Bechhofer, Sean and Van Harmelen, Frank and Hendler, Jim and Horrocks, Ian and McGuinness, Deborah L and Patel-Schneider, Peter F and Stein, Lynn Andrea and others},
+    journal={W3C recommendation},
+    volume={10},
+    number={02},
+    year={2004},
+    url={https://www.w3.org/TR/owl-ref/},
+    urldate = {2020-09-27},
+}
+
+@online{tc,
+    author = {Weisstein, Eric W.},
+    title = {Transitive Closure},
+    urldate = {2020-09-27},
+    url = {https://mathworld.wolfram.com/TransitiveClosure.html},
+}
+
+@online{paths,
+    organization = {W3C},
+    year = {2009},
+    urldate = {2020-09-27},
+    url = {https://www.w3.org/2009/sparql/wiki/Feature:PropertyPaths},
+}
\ No newline at end of file