report: review impl

a40c5d96 · Andreas Schärtl · 97199bf4 · a40c5d96
Commit a40c5d96 authored Oct 9, 2020 by Andreas Schärtl
--- a/doc/report/implementation.tex
+++ b/doc/report/implementation.tex
@@ -61,10 +61,10 @@ is no pressing need to force them into separate processes.
 Our implementation supports two sources for RDF files, namely Git
 repositories and the local file system. The file system Collector
 crawls a given directory on the local machine and looks for
-RDF~XMl~files~\cite{rdfxml} while the Git Collector first clones a Git
+RDF~XML~files~\cite{rdfxml} while the Git Collector first clones a Git
 repository and then passes the checked out working copy to the file
-system Collector. Because we found that is not uncommon for RDF files to be
-compressed, our Collector supports on the fly extraction of
+system Collector. Because we found that is not uncommon for RDF files
+to be compressed, our implementation supports on the fly extraction of
 gzip~\cite{gzip} and xz~\cite{xz} formats which can greatly reduce the
 required disk space in the collection step.

@@ -75,7 +75,7 @@ Coq exports contained URIs which does not fit the official syntax
 specification~\cite{rfc3986} as they contained illegal
 characters. Previous work~\cite{ulo} that processed Coq and Isabelle
 exports used database software such as Virtuoso Open
-Source~\cite{wikivirtuoso} which do not properly check URIs according
+Source~\cite{wikivirtuoso} which does not properly check URIs according
 to spec; in consequence these faults were only discovered now.  To
 tackle these problems, we introduced on the fly correction steps
 during collection that escape the URIs in question and then continue
@@ -164,18 +164,19 @@ Another approach is to regularly re-create the full data
 set~$\mathcal{D}$ from scratch, say every seven days. This circumvents
 the problems related to updating existing data sets, but also means
 that changes in a given library~$\mathcal{L}_i$ take some to propagate
-to~$\mathcal{D}$.  Building on this idea, an advanced version of this
-approach could forgo the requirement for one single database
-storage~$\mathcal{D}$ entirely. Instead of maintaining just one global
-database state~$\mathcal{D}$, we suggest experimenting with dedicated
-database instances~$\mathcal{D}_i$ for each given
+to~$\mathcal{D}$.  Continuing this train of thought, an advanced
+version of this approach could forgo the requirement for one single
+database storage~$\mathcal{D}$ entirely. Instead of maintaining just
+one global database state~$\mathcal{D}$, we suggest experimenting with
+dedicated database instances~$\mathcal{D}_i$ for each given
 library~$\mathcal{L}_i$.  The advantage here is that re-creating a
 given database representation~$\mathcal{D}_i$ is fast as
 exports~$\mathcal{E}_i$ are comparably small. The disadvantage is that
 we still want to query the whole data set~$\mathcal{D} = \mathcal{D}_1
-\cup \mathcal{D}_2 \cup \cdots \cup \mathcal{D}_n$. This does require the
-development of some cross-database query mechanism, functionality GraphDB
-currently only offers limited support for~\cite{graphdbnested}.
+\cup \mathcal{D}_2 \cup \cdots \cup \mathcal{D}_n$. This does require
+the development of some cross-database query mechanism, functionality
+existing systems currently only offer limited support
+for~\cite{graphdbnested}.

 In summary, we see that versioning is a potential challenge for a
 greater tetrapodal search system. While not a pressing issue for
@@ -188,7 +189,7 @@ Endpoint. Recall that an Endpoint provides the programming interface
 for applications that wish to query our collection of organizational
 knowledge. In practice, the choice of Endpoint programming interface
 is determined by the choice of database system as the Endpoint is
-provided directly by the database.
+provided directly by the database system.

 In our project, organizational knowledge is formulated as
 RDF~triplets.  The canonical choice for us is to use a triple store,
@@ -206,7 +207,7 @@ OWL~Reasoning~\cite{owlspec, graphdbreason}.  In particular, this
 means that GraphDB offers support for transitive queries as described
 in previous work on~ULO~\cite{ulo}. A transitive query is one that,
 given a relation~$R$, asks for the transitive closure~$S$
-of~$R$~\cite{tc} (Figure~\ref{fig:tc}).
+(Figure~\ref{fig:tc}) of~$R$.

 \input{implementation-transitive-closure.tex}

@@ -235,14 +236,14 @@ includes not just syntax and semantics of the language itself, but
 also a standardized REST interface~\cite{rest} for querying database
 servers.

-SPARQL was inspired by SQL and as such the \texttt{SELECT}
+The SPARQL syntax was inspired by SQL and as such the \texttt{SELECT}
 \texttt{WHERE} syntax should be familiar to many software developers.
 A simple query that returns all triplets in the store looks like
 \begin{lstlisting}
    SELECT * WHERE { ?s ?p ?o }
 \end{lstlisting}
 where \texttt{?s}, \texttt{?p} and \texttt{?o} are query
-variables. The result of any query are valid substitutions for the
+variables. The result of any query are valid substitutions for all
 query variables. In this particular case, the database would return a
 table of all triplets in the store sorted by subject~\texttt{?o},
 predicate~\texttt{?p} and object~\texttt{?o}.
@@ -300,7 +301,7 @@ containers, that is lightweight virtual machines with a fixed
 environment for running a given application~\cite[pp. 22]{dockerbook}.
 Docker Compose then is a way of combining individual Docker containers
 to run a full tech stack of application, database server and so
-on~\cite[pp. 42]{dockerbook}. All configuration of the overarching a
+on~\cite[pp. 42]{dockerbook}. All configuration of the overarching
 setup is stored in a Docker Compose file that describes the software
 stack.

@@ -313,11 +314,12 @@ code for Collector and Importer is available in the
 deployment files, that is Docker Compose configuration and additional
 Dockerfiles are stored in a separate repository~\cite{dockerfilerepo}.

-This concludes our discussion of the implementation developed for the
-\emph{ulo-storage} project. We designed a system based around (1)~a
-Collector which collects RDF triplets from third party sources, (2)~an
-Importer which imports these triplets into a GraphDB database and
-(3)~looked at different ways of querying a GraphDB Endpoint. All of
-this is easy to deploy using a single Docker Compose file. With this
-stack ready for use, we will continue with a look at some interesting
-applications and queries built on top of this infrastructure.
+With this, we conclude our discussion of the implementation developed
+for the \emph{ulo-storage} project. We designed a system based around
+(1)~a Collector which collects RDF triplets from third party sources,
+(2)~an Importer which imports these triplets into a GraphDB database
+and (3)~looked at different ways of querying a GraphDB Endpoint. All
+of this is easy to deploy using a single Docker Compose file. With
+this stack ready for use, we will now continue with a look at some
+interesting applications and queries built on top of this
+infrastructure.