report: review implementation

3ecf7300 · Andreas Schärtl · 3867d0a0 · 3ecf7300 · 3ecf7300
Commit 3ecf7300 authored Sep 14, 2020 by Andreas Schärtl
--- a/doc/report/implementation.tex
+++ b/doc/report/implementation.tex
@@ -22,12 +22,12 @@ the flow of data.

 \begin{itemize}
 \item ULO triplets are present in various locations, be it Git
-  repositories, on web servers or the local disk.  It is the job of a
-  \emph{Collecter} to assemble these {RDF}~files and forward them for further
+  repositories, web servers or the local disk.  It is the job of a
+  \emph{Collector} to assemble these {RDF}~files and forward them for further
  processing. This may involve cloning a Git repository or crawling
  the file system.

-  \item With streams of ULO files assembled by the Collecter, this
+  \item With streams of ULO files assembled by the Collector, this
    data then gets passed to an \emph{Importer}. An Importer uploads
    RDF~streams into some kind of permanent storage. As we will see,
    the GraphDB~\cite{graphdb} triplet store was a natural fit.
@@ -39,36 +39,36 @@ the flow of data.
  database itself can be understood as an endpoint of its own.
 \end{itemize}

-Collecter, Importer and Endpoint provide us with an easy and automated
+Collector, Importer and Endpoint provide us with an easy and automated
 way of making RDF files available for use within applications. We will
 now take a look at the actual implementation created for
-\emph{ulo-storage}, beginning with the implementation of Collecter and
+\emph{ulo-storage}, beginning with the implementation of Collector and
 Importer.

-\subsection{Collecter and Importer}\label{sec:collecter}
+\subsection{Collector and Importer}\label{sec:collector}

-We previously described Collecter and Importer as two distinct
-components.  The Collecter pulls RDF data from various sources as an
-input and outputs a stream of standardized RDF data while the Importer
-takes such a stream of RDF data and then dumps it to some sort of
-persistent storage.  In the implementation for \emph{ulo-storage},
-both Collecter and Importer ended up being one piece of monolithic
-software. This does not need to be the case but proved convenient
-because (1)~combining Collecter and Importer forgoes the needs for an
-additional IPC~mechanism and (2)~neither Collecter nor Importer are
-terribly large pieces of software in themselves.
+We previously described Collector and Importer as two distinct
+components.  The Collector pulls RDF data from various sources as an
+input and outputs a stream of standardized RDF data. Then, the
+Importer takes such a stream of RDF data and then dumps it to some
+sort of persistent storage.  In the implementation for
+\emph{ulo-storage}, both Collector and Importer ended up being one
+piece of monolithic software. This does not need to be the case but
+proved convenient because (1)~combining Collector and Importer forgoes
+the needs for an additional IPC~mechanism and (2)~neither Collector
+nor Importer are terribly large pieces of software in themselves.

 Our implementation supports two sources for RDF files, namely Git
-repositories and the local file system. The file system Collecter
+repositories and the local file system. The file system Collector
 crawls a given directory on the local machine and looks for
-RDF~XMl~files~\cite{rdfxml} while the Git Collecter first clones a Git
+RDF~XMl~files~\cite{rdfxml} while the Git Collector first clones a Git
 repository and then passes the checked out working copy to the file
-system Collecter. Because it is not uncommon for RDF files to be
-compressed, our Collecter supports on the fly extraction of
+system Collector. Because it is not uncommon for RDF files to be
+compressed, our Collector supports on the fly extraction of
 gzip~\cite{gzip} and xz~\cite{xz} formats which can greatly reduce the
 required disk space in the collection step.

-During development of the Collecter, we found that existing exports
+During development of the Collector, we found that existing exports
 from third party mathematical libraries contain RDF syntax errors
 which were not discovered previously. In particular, both Isabelle and
 Coq export contained URIs which do not fit the official syntax
@@ -77,25 +77,21 @@ characters. Previous work~\cite{ulo} that processed Coq and Isabelle
 exports used database software such as Virtuoso Open Source which do
 not properly check URIs according to spec, in consequence these faults
 were only discovered now.  To tackle these problems, we introduced on
-the fly correction steps during collection that take the broken RDF
-files, fix the mentioned problems related to URIs (by escaping illegal
-characters) and then continue processing. Of course this is only a
+the fly correction steps during collection that escape the URIs in
+question and then continue processing.  Of course this is only a
 work-around. Related bug reports were filed in the respective export
 projects to ensure that in the future this extra step is not
 necessary.

-Our Collecter takes existing RDF files, applies some on the fly
-transformations (extraction of compressed files, fixing of errors),
-the result is a stream of RDF data. This stream gets passed to the
-Importer which imports the encoded RDF triplets into some kind of
-persistent storage. The canonical choice for this task is to use a
-triple store, that is a database optimized for storing RDF
+The output of the Collector is a stream of RDF data.  This stream gets
+passed to the Importer which imports the encoded RDF triplets into
+some kind of persistent storage. The canonical choice for this task is
+to use a triple store, that is a database optimized for storing RDF
 triplets~\cite{triponto, tripw3c}. For our project, we used the
-GraphDB~\cite{graphdb} triple store as it is easy to use an a free
-version that fits our needs is available~\cite{graphdbfree}. The
-import itself is straight-forward, our software only needs to upload
-the RDF file stream as-is to an HTTP endpoint provided by our GraphDB
-instance.
+GraphDB~\cite{graphdb} triple store. A free version that fits our
+needs is available at~\cite{graphdbfree}.  The import itself is
+straight-forward, our software only needs to upload the RDF file
+stream as-is to an HTTP endpoint provided by our GraphDB instance.

 \emph{({TODO}: Write down a small comparison of different database
  types, triplet stores and implementations. Honestly the main
@@ -105,7 +101,7 @@ instance.

 \subsubsection{Scheduling and Version Management}

-Collecter and Importer were implemented as library code that can be
+Collector and Importer were implemented as library code that can be
 called from various front ends. For this project, we provide both a
 command line interface as well as a graphical web front end. While the
 command line interface is only useful for manually starting single
@@ -118,14 +114,14 @@ Automated job control that regularly imports data from the same
 sources leads us to the problem of versioning.  ULO
 exports~$\mathcal{E}$ depend on an original third party
 library~$\mathcal{L}$. Running~$\mathcal{E}$ through the workflow of
-Collecter and Importer, we get some database
+Collector and Importer, we get some database
 representation~$\mathcal{D}$. We see that data flows
 \begin{align*}
  \mathcal{L} \rightarrow \mathcal{E} \rightarrow \mathcal{D}
 \end{align*}
 which means that if records in~$\mathcal{L}$ change, this will
 probably result in different triplets~$\mathcal{E}$ which in turn
-results in a need to update~$\mathcal{D}$. This is difficult.  As it
+results in a need to update~$\mathcal{D}$. This is non-trivial.  As it
 stands, \emph{ulo-storage} only knows about what is in~$\mathcal{E}$.
 While it should be possible to find out the difference between a new
 version of~$\mathcal{E}$ and the current version of~$\mathcal{D}$ and
@@ -135,14 +131,14 @@ suggestion to solve the problem of changing third party libraries is
 to regularly re-create the full data set~$\mathcal{D}$ from scratch,
 say every seven days. This circumvents all problems related to
 updating existing data sets, but it does mean additional computation
-requirements. It also means that changes in~$\mathcal{L}$ takes some
+requirements. It also means that changes in~$\mathcal{L}$ take some
 to propagate to~$\mathcal{D}$.  If the number of triplets raises
 by orders of magnitude, this approach will eventually not be scalable
 anymore.

 \subsection{Endpoints}\label{sec:endpoints}

-With ULO triplets imported into the GraphDB triplet store by Collecter
+With ULO triplets imported into the GraphDB triplet store by Collector
 and Importer, we now have all data available necessary for querying.
 As discussed before, querying from applications happens through an
 Endpoint that exposes some kind of {API}. The interesting question
@@ -222,20 +218,28 @@ implementors to do the same.
 \def\composerepo{https://gl.kwarc.info/supervision/schaertl_andreas/-/tree/master/experimental/compose}

 Software not only needs to get developed, but also deployed. To deploy
-the combination of Collecter, Importer and Endpoint, we provide a
-single Docker Compose file which starts three containers, namely
-(1)~the Collecter/Importer web interface, (2)~a database server for
-that web interface such that it can persist import jobs and finally
-(3)~a GraphDB instance which provides us with the required
-Endpoint. All code for Collecter and Importer is available in the
-\texttt{ulo-storage-collect} Git repository\footnote{\url{\gorepo}}
-Additional deployment files, that is Docker Compose and additional
-Dockerfiles are stored in a separate
-repository\footnote{\url{\composerepo}}.
+the combination of Collector, Importer and Endpoint, we use Docker
+Compose. Docker itself is a technology for wrapping software into
+containers, that is lightweight virtual machines with a fixed
+environment for running a given application~\cite[pp. 22]{dockerbook}.
+Docker Compose then is a way of combining individual Docker containers
+to run a full tech stack of application, database server and so
+on~\cite[pp. 42]{dockerbook}. All configuration of such a setup is
+stored in a Docker Compose file that describes the tech stack.
+
+For \emph{ulo-storage}, we provide a single Docker Compose file which
+starts three containers, namely (1)~the Collector/Importer web
+interface, (2)~a database server for that web interface such that it
+can persist import jobs and finally (3)~a GraphDB instance which
+provides us with the required Endpoint. All code for Collector and
+Importer is available in the \texttt{ulo-storage-collect} Git
+repository~\cite{gorepo}.  Additional deployment files, that is Docker
+Compose and additional Dockerfiles are stored in a separate
+repository~\cite{dockerfilerepo}.

 This concludes our discussion of the implementation developed for the
 \emph{ulo-storage} project. We designed a system based around (1)~a
-Collecter which collects RDF triplets from third party sources, (2)~an
+Collector which collects RDF triplets from third party sources, (2)~an
 Importer which imports these triplets into a GraphDB database and
 (3)~looked at different ways of querying a GraphDB Endpoint. All of
 this is easy to deploy using a single Docker Compose file. With this

--- a/doc/report/references.bib
+++ b/doc/report/references.bib
@@ -327,3 +327,26 @@
  author={Sloane, Neil JA and others},
  year={2003}
 }
+
+@online{gorepo,
+    title = {ULO RDF Collector},
+    date = {2020},
+    urldate = {2020-09-14},
+    url = {https://gitlab.cs.fau.de/kissen/ulo-storage-collect},
+    author = {Andreas Schärtl},
+}
+
+@online{dockerfilerepo,
+    title = {Supervision Repository},
+    date = {2020},
+    urldate = {2020-09-14},
+    url = {https://gl.kwarc.info/supervision/schaertl_andreas/-/tree/master/experimental/compose},
+    author = {Andreas Schärtl},
+}
+
+@book{dockerbook,
+    title={Docker Orchestration},
+    author={Smith, Randall},
+    year={2017},
+    publisher={Packt Publishing Ltd}
+}