diff --git a/doc/report/implementation.tex b/doc/report/implementation.tex index b60bfbaf0c345000e8e3a568de22c47b5e3c2dee..35d4019e4a25a38a9b7148b794a7b23425fa59eb 100644 --- a/doc/report/implementation.tex +++ b/doc/report/implementation.tex @@ -22,12 +22,12 @@ the flow of data. \begin{itemize} \item ULO triplets are present in various locations, be it Git - repositories, on web servers or the local disk. It is the job of a - \emph{Collecter} to assemble these {RDF}~files and forward them for further + repositories, web servers or the local disk. It is the job of a + \emph{Collector} to assemble these {RDF}~files and forward them for further processing. This may involve cloning a Git repository or crawling the file system. - \item With streams of ULO files assembled by the Collecter, this + \item With streams of ULO files assembled by the Collector, this data then gets passed to an \emph{Importer}. An Importer uploads RDF~streams into some kind of permanent storage. As we will see, the GraphDB~\cite{graphdb} triplet store was a natural fit. @@ -39,36 +39,36 @@ the flow of data. database itself can be understood as an endpoint of its own. \end{itemize} -Collecter, Importer and Endpoint provide us with an easy and automated +Collector, Importer and Endpoint provide us with an easy and automated way of making RDF files available for use within applications. We will now take a look at the actual implementation created for -\emph{ulo-storage}, beginning with the implementation of Collecter and +\emph{ulo-storage}, beginning with the implementation of Collector and Importer. -\subsection{Collecter and Importer}\label{sec:collecter} +\subsection{Collector and Importer}\label{sec:collector} -We previously described Collecter and Importer as two distinct -components. The Collecter pulls RDF data from various sources as an -input and outputs a stream of standardized RDF data while the Importer -takes such a stream of RDF data and then dumps it to some sort of -persistent storage. In the implementation for \emph{ulo-storage}, -both Collecter and Importer ended up being one piece of monolithic -software. This does not need to be the case but proved convenient -because (1)~combining Collecter and Importer forgoes the needs for an -additional IPC~mechanism and (2)~neither Collecter nor Importer are -terribly large pieces of software in themselves. +We previously described Collector and Importer as two distinct +components. The Collector pulls RDF data from various sources as an +input and outputs a stream of standardized RDF data. Then, the +Importer takes such a stream of RDF data and then dumps it to some +sort of persistent storage. In the implementation for +\emph{ulo-storage}, both Collector and Importer ended up being one +piece of monolithic software. This does not need to be the case but +proved convenient because (1)~combining Collector and Importer forgoes +the needs for an additional IPC~mechanism and (2)~neither Collector +nor Importer are terribly large pieces of software in themselves. Our implementation supports two sources for RDF files, namely Git -repositories and the local file system. The file system Collecter +repositories and the local file system. The file system Collector crawls a given directory on the local machine and looks for -RDF~XMl~files~\cite{rdfxml} while the Git Collecter first clones a Git +RDF~XMl~files~\cite{rdfxml} while the Git Collector first clones a Git repository and then passes the checked out working copy to the file -system Collecter. Because it is not uncommon for RDF files to be -compressed, our Collecter supports on the fly extraction of +system Collector. Because it is not uncommon for RDF files to be +compressed, our Collector supports on the fly extraction of gzip~\cite{gzip} and xz~\cite{xz} formats which can greatly reduce the required disk space in the collection step. -During development of the Collecter, we found that existing exports +During development of the Collector, we found that existing exports from third party mathematical libraries contain RDF syntax errors which were not discovered previously. In particular, both Isabelle and Coq export contained URIs which do not fit the official syntax @@ -76,26 +76,22 @@ specification~\cite{rfc3986} as they contained illegal characters. Previous work~\cite{ulo} that processed Coq and Isabelle exports used database software such as Virtuoso Open Source which do not properly check URIs according to spec, in consequence these faults -were only discovered now. To tackle these problems, we introduced on -the fly correction steps during collection that take the broken RDF -files, fix the mentioned problems related to URIs (by escaping illegal -characters) and then continue processing. Of course this is only a +were only discovered now. To tackle these problems, we introduced on +the fly correction steps during collection that escape the URIs in +question and then continue processing. Of course this is only a work-around. Related bug reports were filed in the respective export projects to ensure that in the future this extra step is not necessary. -Our Collecter takes existing RDF files, applies some on the fly -transformations (extraction of compressed files, fixing of errors), -the result is a stream of RDF data. This stream gets passed to the -Importer which imports the encoded RDF triplets into some kind of -persistent storage. The canonical choice for this task is to use a -triple store, that is a database optimized for storing RDF +The output of the Collector is a stream of RDF data. This stream gets +passed to the Importer which imports the encoded RDF triplets into +some kind of persistent storage. The canonical choice for this task is +to use a triple store, that is a database optimized for storing RDF triplets~\cite{triponto, tripw3c}. For our project, we used the -GraphDB~\cite{graphdb} triple store as it is easy to use an a free -version that fits our needs is available~\cite{graphdbfree}. The -import itself is straight-forward, our software only needs to upload -the RDF file stream as-is to an HTTP endpoint provided by our GraphDB -instance. +GraphDB~\cite{graphdb} triple store. A free version that fits our +needs is available at~\cite{graphdbfree}. The import itself is +straight-forward, our software only needs to upload the RDF file +stream as-is to an HTTP endpoint provided by our GraphDB instance. \emph{({TODO}: Write down a small comparison of different database types, triplet stores and implementations. Honestly the main @@ -105,7 +101,7 @@ instance. \subsubsection{Scheduling and Version Management} -Collecter and Importer were implemented as library code that can be +Collector and Importer were implemented as library code that can be called from various front ends. For this project, we provide both a command line interface as well as a graphical web front end. While the command line interface is only useful for manually starting single @@ -118,14 +114,14 @@ Automated job control that regularly imports data from the same sources leads us to the problem of versioning. ULO exports~$\mathcal{E}$ depend on an original third party library~$\mathcal{L}$. Running~$\mathcal{E}$ through the workflow of -Collecter and Importer, we get some database +Collector and Importer, we get some database representation~$\mathcal{D}$. We see that data flows \begin{align*} \mathcal{L} \rightarrow \mathcal{E} \rightarrow \mathcal{D} \end{align*} which means that if records in~$\mathcal{L}$ change, this will probably result in different triplets~$\mathcal{E}$ which in turn -results in a need to update~$\mathcal{D}$. This is difficult. As it +results in a need to update~$\mathcal{D}$. This is non-trivial. As it stands, \emph{ulo-storage} only knows about what is in~$\mathcal{E}$. While it should be possible to find out the difference between a new version of~$\mathcal{E}$ and the current version of~$\mathcal{D}$ and @@ -135,14 +131,14 @@ suggestion to solve the problem of changing third party libraries is to regularly re-create the full data set~$\mathcal{D}$ from scratch, say every seven days. This circumvents all problems related to updating existing data sets, but it does mean additional computation -requirements. It also means that changes in~$\mathcal{L}$ takes some +requirements. It also means that changes in~$\mathcal{L}$ take some to propagate to~$\mathcal{D}$. If the number of triplets raises by orders of magnitude, this approach will eventually not be scalable anymore. \subsection{Endpoints}\label{sec:endpoints} -With ULO triplets imported into the GraphDB triplet store by Collecter +With ULO triplets imported into the GraphDB triplet store by Collector and Importer, we now have all data available necessary for querying. As discussed before, querying from applications happens through an Endpoint that exposes some kind of {API}. The interesting question @@ -222,20 +218,28 @@ implementors to do the same. \def\composerepo{https://gl.kwarc.info/supervision/schaertl_andreas/-/tree/master/experimental/compose} Software not only needs to get developed, but also deployed. To deploy -the combination of Collecter, Importer and Endpoint, we provide a -single Docker Compose file which starts three containers, namely -(1)~the Collecter/Importer web interface, (2)~a database server for -that web interface such that it can persist import jobs and finally -(3)~a GraphDB instance which provides us with the required -Endpoint. All code for Collecter and Importer is available in the -\texttt{ulo-storage-collect} Git repository\footnote{\url{\gorepo}} -Additional deployment files, that is Docker Compose and additional -Dockerfiles are stored in a separate -repository\footnote{\url{\composerepo}}. +the combination of Collector, Importer and Endpoint, we use Docker +Compose. Docker itself is a technology for wrapping software into +containers, that is lightweight virtual machines with a fixed +environment for running a given application~\cite[pp. 22]{dockerbook}. +Docker Compose then is a way of combining individual Docker containers +to run a full tech stack of application, database server and so +on~\cite[pp. 42]{dockerbook}. All configuration of such a setup is +stored in a Docker Compose file that describes the tech stack. + +For \emph{ulo-storage}, we provide a single Docker Compose file which +starts three containers, namely (1)~the Collector/Importer web +interface, (2)~a database server for that web interface such that it +can persist import jobs and finally (3)~a GraphDB instance which +provides us with the required Endpoint. All code for Collector and +Importer is available in the \texttt{ulo-storage-collect} Git +repository~\cite{gorepo}. Additional deployment files, that is Docker +Compose and additional Dockerfiles are stored in a separate +repository~\cite{dockerfilerepo}. This concludes our discussion of the implementation developed for the \emph{ulo-storage} project. We designed a system based around (1)~a -Collecter which collects RDF triplets from third party sources, (2)~an +Collector which collects RDF triplets from third party sources, (2)~an Importer which imports these triplets into a GraphDB database and (3)~looked at different ways of querying a GraphDB Endpoint. All of this is easy to deploy using a single Docker Compose file. With this diff --git a/doc/report/references.bib b/doc/report/references.bib index 5e542c5a23e402fe8b27a01ad8999f6fc511d163..9e99f35de0ce7ad42ca440d91cd357db1b57154c 100644 --- a/doc/report/references.bib +++ b/doc/report/references.bib @@ -327,3 +327,26 @@ author={Sloane, Neil JA and others}, year={2003} } + +@online{gorepo, + title = {ULO RDF Collector}, + date = {2020}, + urldate = {2020-09-14}, + url = {https://gitlab.cs.fau.de/kissen/ulo-storage-collect}, + author = {Andreas Schärtl}, +} + +@online{dockerfilerepo, + title = {Supervision Repository}, + date = {2020}, + urldate = {2020-09-14}, + url = {https://gl.kwarc.info/supervision/schaertl_andreas/-/tree/master/experimental/compose}, + author = {Andreas Schärtl}, +} + +@book{dockerbook, + title={Docker Orchestration}, + author={Smith, Randall}, + year={2017}, + publisher={Packt Publishing Ltd} +}