From 7e38b25dddca24b37cfd6e15c35f804528c44eeb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Andreas=20Sch=C3=A4rtl?= <andreas@schaertl.me> Date: Mon, 5 Oct 2020 14:42:00 +0200 Subject: [PATCH] report: review implementation --- doc/report/applications-q3.tex | 7 +- .../implementation-transitive-closure.tex | 2 +- doc/report/implementation.tex | 173 +++++++++--------- doc/report/references.bib | 9 +- 4 files changed, 100 insertions(+), 91 deletions(-) diff --git a/doc/report/applications-q3.tex b/doc/report/applications-q3.tex index 2fa1c7b..41ee73f 100644 --- a/doc/report/applications-q3.tex +++ b/doc/report/applications-q3.tex @@ -18,7 +18,7 @@ data sets take advantage of the \texttt{dcterms} namespace~\cite{dcowl}.}\label{fig:q2a} \end{subfigure} - \vspace{0.5cm} + \vspace{8mm} \begin{subfigure}[]{0.9\textwidth} \begin{lstlisting} @@ -39,6 +39,7 @@ of references. The idea is works that were referenced more often are more important.}\label{fig:q2b} \end{subfigure} + \vspace{8mm} \begin{subfigure}[]{0.9\textwidth} \large @@ -48,8 +49,8 @@ 1 & \texttt{https://isabelle.in.tum.de?HOL-Algebra.Group?...} & 17 \\ 2 & \texttt{https://isabelle.in.tum.de?HOL-Algebra.Group?...} & 17 \\ 3 & \texttt{https://isabelle.in.tum.de?HOL-Algebra.Group?...} & 11 \\ - 4 & \texttt{https://isabelle.in.tum.de?HOL-Algebra.Ring?...} & 11 \\ - 5 & \texttt{https://isabelle.in.tum.de?HOL-Algebra.Group?...} & 10 \\ + % 4 & \texttt{https://isabelle.in.tum.de?HOL-Algebra.Ring?...} & 11 \\ + % 5 & \texttt{https://isabelle.in.tum.de?HOL-Algebra.Group?...} & 10 \\ $\vdots$ & $\vdots$ & $\vdots$ \\ \bottomrule \end{tabular*} \caption{Result of query from Figure~\ref{fig:q2b}. In this particular case diff --git a/doc/report/implementation-transitive-closure.tex b/doc/report/implementation-transitive-closure.tex index 6be0206..fc4ca65 100644 --- a/doc/report/implementation-transitive-closure.tex +++ b/doc/report/implementation-transitive-closure.tex @@ -12,7 +12,7 @@ \includegraphics[width=\textwidth]{figs/tree-transitive.pdf} \caption{Transitive closure~$S$ of relation~$R$. Additionally to each tuple from~$R$ (solid edges), $S$~also contains additional - transitive edges (dotted edges).} + transitive (dotted) edges.} \end{subfigure} \caption{Illustrating the idea behind transitive closures. A transitive closure~$S$ of relation~$R$ is defined as the diff --git a/doc/report/implementation.tex b/doc/report/implementation.tex index 10f1094..58676a7 100644 --- a/doc/report/implementation.tex +++ b/doc/report/implementation.tex @@ -3,18 +3,15 @@ One of the two contributions of \emph{ulo-storage} is that we implemented components for making organizational mathematical knowledge (formulated as RDF~triplets) queryable. This section first -makes out the individual required component for this tasks and then -describes some details of the actual implementation for this project. +makes out the individual components involved in this task. We then +discuss the actual implementation created for this project. \subsection{Components Implemented for \emph{ulo-storage}}\label{sec:components} \FloatBarrier{} -With RDF files exported and available for download as Git -repositories~\cite{uloisabelle, ulocoq}, we have the goal of making -the underlying data available for use in applications. Let us first -take a high level look at the three components we made out for this -job. Figure~\ref{fig:components} illustrates how data flows through -the different components. +Figure~\ref{fig:components} illustrates how data flows through the +different components. In total, we made out three components that make +up the infrastructure provided by \emph{ulo-storage}. \begin{figure}[]\begin{center} \includegraphics[width=0.9\textwidth]{figs/components.png} @@ -28,16 +25,17 @@ the different components. processing. This may involve cloning a Git repository or crawling the file system. - \item With streams of ULO files assembled by the Collector, this - data then gets passed to an \emph{Importer}. An Importer uploads - RDF~streams into some kind of permanent storage. As we will see, - the GraphDB~\cite{graphdb} triple store was a natural fit. + \item With streams of ULO files assembled by the Collector, these + streams then gets passed to an \emph{Importer}. The Importer then + uploads RDF~streams into some kind of permanent storage. As we + will see, the GraphDB~\cite{graphdb} triple store was a natural + fit. \item Finally, with all triplets stored in a database, an \emph{Endpoint} is where applications access the underlying knowledge base. This does not necessarily need to be any custom software, rather the programming interface of the underlying - database server itself can be understood as an endpoint of its own. + database server itself can be understood as an Endpoint of its own. \end{itemize} Collector, Importer and Endpoint provide us with an automated way of @@ -49,14 +47,16 @@ Importer. \subsection{Collector and Importer}\label{sec:collector} We previously described Collector and Importer as two distinct -components. The Collector pulls RDF data from various sources as an -input and outputs a stream of standardized RDF data. The Importer -takes such a stream of RDF data and then dumps it to some sort of -persistent storage. In our implementation, both Collector and +components. First, a Collector pulls RDF data from various sources as +an input and outputs a stream of standardized RDF data. Second, an +Importer takes such a stream of RDF data and then dumps it to some +sort of persistent storage. In our implementation, both Collector and Importer ended up as one piece of monolithic software. This does not -need to be the case but proved convenient because combining Collector -and Importer forgoes the needs for an additional IPC~mechanism -between Collector and Importer. +need to be the case but proved convenient as combining Collector and +Importer forgoes the needs for an additional IPC~mechanism between +Collector and Importer. In addition, neither our Collector nor +Importer are particularly complicated pieces of software, as such there +is no pressing need to force them into separate processes. Our implementation supports two sources for RDF files, namely Git repositories and the local file system. The file system Collector @@ -71,12 +71,12 @@ required disk space in the collection step. During development of the Collector, we found that existing exports from third party mathematical libraries contain RDF syntax errors which were not discovered previously. In particular, both Isabelle and -Coq exports contained URIs which do not fit the official syntax +Coq exports contained URIs which does not fit the official syntax specification~\cite{rfc3986} as they contained illegal characters. Previous work~\cite{ulo} that processed Coq and Isabelle exports used database software such as Virtuoso Open Source~\cite{wikivirtuoso} which do not properly check URIs according -to spec, in consequence these faults were only discovered now. To +to spec; in consequence these faults were only discovered now. To tackle these problems, we introduced on the fly correction steps during collection that escape the URIs in question and then continue processing. Of course this is only a work-around. Related bug reports @@ -87,12 +87,13 @@ The output of the Collector is a stream of RDF~data. This stream gets passed to the Importer which imports the encoded RDF triplets into some kind of persistent storage. In theory, multiple implementations of this Importer are possible, namely different implementations for -different database backends. As we will see in Section~\ref{sec:endpoints}, -for our projected we selected the GraphDB triple store. The Importer -merely needs to make the necessary API~calls to import the RDF stream -into the database. As such the import itself is straight-forward, our -software only needs to upload the RDF file stream as-is to an HTTP -endpoint provided by our GraphDB instance. +different database backends. As we will see in +Section~\ref{sec:endpoints}, for our projected we selected the GraphDB +triple store alone. The Importer merely needs to make the necessary +API~calls to import the RDF stream into the database. As such the +import itself is straight-forward, our software only needs to upload +the RDF file stream as-is to an HTTP endpoint provided by our GraphDB +instance. To review, our combination of Collector and Importer fetches XML~files from Git repositories, applies on the fly decompression and fixes and @@ -105,19 +106,19 @@ Collector and Importer were implemented as library code that can be called from various front ends. For this project, we provide both a command line interface as well as a graphical web front end. While the command line interface is only useful for manually starting single -jobs, the web interface allows scheduling of jobs. Import jobs can be -started manually or scheduled automatically in a web interface -(Figure~\ref{fig:ss}) which also keeps history and logs. +runs, the web interface (Figure~\ref{fig:ss}) allows for more +flexibility. In particular, import jobs can be started either manually +or scheduled to run at fixed intervals. The web interface also +persists error messages and logs. \input{implementation-screenshots.tex} \subsubsection{Version Management} -Automated job control that regularly imports data from the same -sources leads us to the problem of versioning. In our current design, -multiple ULO exports~$\mathcal{E}_i$ depend on original third party -libraries~$\mathcal{L}_i$. Running~$\mathcal{E}_i$ through the -workflow of Collector and Importer, we get some database +Automated job control leads us to the problem of versioning. In our +current design, given ULO exports~$\mathcal{E}_i$ depend on +original third party libraries~$\mathcal{L}_i$. Running~$\mathcal{E}_i$ +through the workflow of Collector and Importer, we get some database representation~$\mathcal{D}$. We see that data flows \begin{align*} \mathcal{L}_1 \rightarrow \; &\mathcal{E}_1 \rightarrow \mathcal{D} \\ @@ -141,51 +142,53 @@ While it should be possible to compute the difference between two exports~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$ and infer the changes necessary to be applied to~$\mathcal{D}$, the big number of triplets makes this appear unfeasible. As this is a problem an -implementer of a greater tetrapodal search system will encounter, we -suggest two possible approaches to solving this problem. +implementer of a greater tetrapodal search system will most likely +encounter, we suggest the following approaches to tackle this +challenge. One approach is to annotate each triplet in~$\mathcal{D}$ with versioning information about which particular export~$\mathcal{E}^{t}_i$ it was derived from. During an import -from~$\mathcal{E}^{s}_i$ into~$\mathcal{D}$, we could (1)~first remove all -triplets in~$\mathcal{D}$ that were derived from the previous version -of~$\mathcal{E}^{s-1}_i$ and (2)~then re-import all triplets from the current -version~$\mathcal{E}^{s}_i$. Annotating triplets with versioning -information is an approach that should work, but it does +from~$\mathcal{E}^{s}_i$ into~$\mathcal{D}$, we could (1)~first remove +all triplets in~$\mathcal{D}$ that were derived from previous +version~$\mathcal{E}^{s-1}_i$ and (2)~then re-import all triplets from +the current version~$\mathcal{E}^{s}_i$. Annotating triplets with +versioning information is an approach that should work, but it does introduce~$\mathcal{O}(n)$ additional triplets in~$\mathcal{D}$ where -$n$~is the number of triplets in~$\mathcal{D}$. This does mean -effectively doubling the database storage space, a not very satisfying +$n$~is the number of triplets in~$\mathcal{D}$. After all, we need to +annotate each of the $n$~triplets with versioning information, +effectively doubling the required storage space. A not very satisfying solution. Another approach is to regularly re-create the full data set~$\mathcal{D}$ from scratch, say every seven days. This circumvents the problems related to updating existing data sets, but also means that changes in a given library~$\mathcal{L}_i$ take some to propagate -to~$\mathcal{D}$. Building on top of this idea, an advanced version -of this approach could forgo the requirement of only one single -database storage~$\mathcal{D}$ entirely. Instead of only maintaining -one global database state~$\mathcal{D}$, we suggest the use of -dedicated database instances~$\mathcal{D}_i$ for each given +to~$\mathcal{D}$. Building on this idea, an advanced version of this +approach could forgo the requirement for one single database +storage~$\mathcal{D}$ entirely. Instead of maintaining just one global +database state~$\mathcal{D}$, we suggest experimenting with dedicated +database instances~$\mathcal{D}_i$ for each given library~$\mathcal{L}_i$. The advantage here is that re-creating a given database representation~$\mathcal{D}_i$ is fast as exports~$\mathcal{E}_i$ are comparably small. The disadvantage is that we still want to query the whole data set~$\mathcal{D} = \mathcal{D}_1 -\cup \mathcal{D}_2 \cup \cdots \cup \mathcal{D}_n$. This requires the -development of some cross-repository query mechanism, something -GraphDB currently only offers limited support -for~\cite{graphdbnested}. +\cup \mathcal{D}_2 \cup \cdots \cup \mathcal{D}_n$. This does require the +development of some cross-database query mechanism, functionality GraphDB +currently only offers limited support for~\cite{graphdbnested}. In summary, we see that versioning is a potential challenge for a greater tetrapodal search system. While not a pressing issue for -\emph{ulo-storage} now, we consider this a topic of future research. +\emph{ulo-storage} now, we consider it a topic of future research. \subsection{Endpoint}\label{sec:endpoints} Finally, we need to discuss how \emph{ulo-storage} realizes the -Endpoint. Recall that the Endpoint provides the programming interface +Endpoint. Recall that an Endpoint provides the programming interface for applications that wish to query our collection of organizational knowledge. In practice, the choice of Endpoint programming interface -is determined by the choice of what database system to use. +is determined by the choice of database system as the Endpoint is +provided directly by the database. In our project, organizational knowledge is formulated as RDF~triplets. The canonical choice for us is to use a triple store, @@ -201,9 +204,9 @@ Virtuoso Open Source~\cite{wikivirtuoso, ulo} is that GraphDB supports recent versions of the SPARQL query language~\cite{graphdbsparql} and OWL~Reasoning~\cite{owlspec, graphdbreason}. In particular, this means that GraphDB offers support for transitive queries as described -in previous work~\cite{ulo}. A transitive query is one that, given a -relation~$R$, asks for the transitive closure~$S$ of~$R$~\cite{tc} -(Figure~\ref{fig:tc}). +in previous work on~ULO~\cite{ulo}. A transitive query is one that, +given a relation~$R$, asks for the transitive closure~$S$ +of~$R$~\cite{tc} (Figure~\ref{fig:tc}). \input{implementation-transitive-closure.tex} @@ -211,15 +214,15 @@ In fact, GraphDB supports two approaches for realizing transitive queries. On one hand, GraphDB supports the \texttt{owl:TransitiveProperty}~\cite[Section 4.4.1]{owlspec} property that defines a given predicate~$P$ to be transitive. With $P$~marked -in this way, querying the knowledge base is equivalent to querying the -transitive closure of~$P$. This, of course, requires transitivity to -be hard-coded into the knowledge base. If we only wish to query the -transitive closure for a given query, we can take advantage of -property paths~\cite{paths} which allows us to indicate that a given -predicate~$P$ is to be understood as transitive when querying. Only -during querying is the transitive closure then evaluated. Either way, -GraphDB supports transitive queries without awkward workarounds -necessary in other systems~\cite{ulo}. +this way, querying the knowledge base is equivalent to querying the +transitive closure of~$P$. This requires transitivity to be hard-coded +into the knowledge base. If we only wish to query the transitive +closure for a given query, we can take advantage of so-called +``property paths''~\cite{paths} which allow us to indicate that a +given predicate~$P$ is to be understood as transitive when +querying. Only during querying is the transitive closure then +evaluated. Either way, GraphDB supports transitive queries without +awkward workarounds necessary in other systems~\cite{ulo}. \subsubsection{SPARQL Endpoint} @@ -266,9 +269,9 @@ that asks for all triplets in the store looks like \end{lstlisting} in RDF4J. \texttt{getStatements(s, p, o)} returns all triplets that have matching subject~\texttt{s}, predicate~\texttt{p} and -object~\texttt{o}. Any argument that is \texttt{null} can be replace -with any value, that is it is a query variable to be filled by the -call to \texttt{getStatements}. +object~\texttt{o}. Any argument that is \texttt{null} can be +substituted with any value, that is it is a query variable to be +filled by the call to \texttt{getStatements}. Using RDF4J does introduce a dependency on the JVM and its languages. But in practice, we found RDF4J to be quite convenient, @@ -281,18 +284,15 @@ provides us with powerful IDE auto completion features during development of ULO applications. Summarizing the last two sections, we see that both SPARQL and RDF4J -have unique advantages. While SPARQL is an official W3C standard and -implemented by more database systems, RDF4J can be more convenient -when dealing with JVM-based projects. For \emph{ulo-storage}, we -played around with both interfaces and chose whatever seemed more -convenient at the moment. We recommend any implementors to do the -same. +have unique advantages. While SPARQL is an official W3C~\cite{w3c} +standard and implemented by more database systems, RDF4J can be more +convenient when dealing with JVM-based projects. For +\emph{ulo-storage}, we played around with both interfaces and chose +whatever seemed more convenient at the moment. We recommend any +implementors to do the same. \subsection{Deployment and Availability} -\def\gorepo{https://gitlab.cs.fau.de/kissen/ulo-storage-collect} -\def\composerepo{https://gl.kwarc.info/supervision/schaertl_andreas/-/tree/master/experimental/compose} - Software not only needs to get developed, but also deployed. To deploy the combination of Collector, Importer and Endpoint, we use Docker Compose. Docker itself is a technology for wrapping software into @@ -300,8 +300,9 @@ containers, that is lightweight virtual machines with a fixed environment for running a given application~\cite[pp. 22]{dockerbook}. Docker Compose then is a way of combining individual Docker containers to run a full tech stack of application, database server and so -on~\cite[pp. 42]{dockerbook}. All configuration of such a setup is -stored in a Docker Compose file that describes the software stack. +on~\cite[pp. 42]{dockerbook}. All configuration of the overarching a +setup is stored in a Docker Compose file that describes the software +stack. For \emph{ulo-storage}, we provide a single Docker Compose file which starts three containers, namely (1)~the Collector/Importer web @@ -309,8 +310,8 @@ interface, (2)~a GraphDB instance which provides us with the required Endpoint and (3)~some test applications that use that Endpoint. All code for Collector and Importer is available in the \texttt{ulo-storage-collect} Git repository~\cite{gorepo}. Additional -deployment files, that is Docker Compose and additional Dockerfiles -are stored in a separate repository~\cite{dockerfilerepo}. +deployment files, that is Docker Compose configuration and additional +Dockerfiles are stored in a separate repository~\cite{dockerfilerepo}. This concludes our discussion of the implementation developed for the \emph{ulo-storage} project. We designed a system based around (1)~a @@ -319,4 +320,4 @@ Importer which imports these triplets into a GraphDB database and (3)~looked at different ways of querying a GraphDB Endpoint. All of this is easy to deploy using a single Docker Compose file. With this stack ready for use, we will continue with a look at some interesting -applications and queries built on top of this interface. +applications and queries built on top of this infrastructure. diff --git a/doc/report/references.bib b/doc/report/references.bib index c12e02d..6d27d8e 100644 --- a/doc/report/references.bib +++ b/doc/report/references.bib @@ -437,7 +437,7 @@ } @misc{afpexport, - title={Making Isabelle Content Accessible in Knowledge Representation Formats}, + title={Making Isabelle Content Accessible in Knowledge Representation Formats}, author={Michael Kohlhase and Florian Rabe and Makarius Wenzel}, year={2020}, eprint={2005.08884}, @@ -455,3 +455,10 @@ publisher={Frontiers} } + +@online{w3c, + organization = {W3C}, + title = {ABOUT W3C}, + urldate = {2020-10-05}, + url = {https://www.w3.org/Consortium/}, +} \ No newline at end of file -- GitLab