From 2003ef783f1a8eeeafc474f61d0539663c0e29c0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Andreas=20Sch=C3=A4rtl?= <andreas@schaertl.me> Date: Wed, 19 Aug 2020 09:52:50 +0200 Subject: [PATCH] report: add some clarifications Next up: Proofreading and then setting up next tasks for September --- doc/report/applications.tex | 76 +++++++++++++------------ doc/report/implementation.tex | 103 +++++++++++++++++----------------- doc/report/introduction.tex | 34 ++++++----- doc/report/references.bib | 7 +++ 4 files changed, 114 insertions(+), 106 deletions(-) diff --git a/doc/report/applications.tex b/doc/report/applications.tex index f4fa274..e19bb78 100644 --- a/doc/report/applications.tex +++ b/doc/report/applications.tex @@ -2,8 +2,8 @@ With programming endpoints in place, we can now query the data set containing both Isabelle and Coq exports stored in {GraphDB}. We -experimented with two kinds of applications thatt take advantage -of our GraphDB endpoint. +experimented with two kinds of applications that talk to a GraphDB +endpoint. \begin{itemize} \item Exploring which ULO predicates are actually used in the @@ -22,7 +22,7 @@ of our GraphDB endpoint. \subsection{Exploring Existing Data Sets}\label{sec:expl} -Four our first application, we looked at what ULO predicates are +For our first application, we looked at what ULO predicates are actually used by the respective data sets. With more than 250~million triplets in the store, we hoped that this would give us some insight into the kind of knowledge we are dealing with. @@ -38,25 +38,26 @@ listed in Appendix~\ref{sec:used}. In both cases, what stands out is that both exports use less than a third of the available predicates. We also see that the Isabelle and Coq exports use different -predicates. For example, the Isabelle contains organizational meta -information such as information about paragraphs and sections in the -source document while the Coq export only tells us about the filename -of the Coq source. That is not particularly problematic as long as we -can trace a given object back to the original source. Regardless, our -results do show that both exports have their own particularities and -with more and more third party libraries exported to ULO one has to -assume that this heterogeneity will only grow. In particular we want -to point to the large number of predicates which remain unused in both -Isabelle and Coq exports. A user formulating queries for ULO might be -oblivious to the fact that only subsets of exports support given -predicates. +predicates. For example, the Isabelle export contains organizational +meta information such as information about paragraphs and sections in +the source document while the Coq export only tells us about the +filename of the Coq source. That is not particularly problematic as +long as we can trace a given object back to the original source. +Regardless, our results do show that both exports have their own +particularities and with more and more third party libraries exported +to ULO one has to assume that this heterogeneity will only grow. In +particular we want to point to the large number of predicates which +remain unused in both Isabelle and Coq exports. A user formulating +queries for ULO might be oblivious to the fact that only subsets of +exports support given predicates. While not a problem for \emph{ulo-storage} per se, we do expect this to be a challenge when building a tetrapodal search system. Recommended ways around this ``missing fields'' problem in database literature include the clever use of default values or inference of missing values~\cite{kdisc, aidisc}, neither of which -feels particularly applicable to an ULO data set. +feels particularly applicable to vast and heterogeneous sets of +ULO~triplets. \input{applications-preds.tex} @@ -87,7 +88,8 @@ implementations. subsets of problems, e.g.\ on the satisfiability of a given CNF formula. A particular example is focused on heuristics for how long a SAT solver will take to find a solution for a - given~{CNF}~\cite{proofsat}. + given~{CNF}. Here, solutions that take long are considered + harder~\cite{proofsat}. \textbf{Organizational Aspect} A first working hypothesis might be to assume that elementary proofs are short. In that case, the @@ -100,13 +102,13 @@ implementations. to find a generalized definition of proof difficulty, we will accept proof size as a first working hypothesis. - {ULO} offers the \texttt{ulo:external-size} - predicate which will allow us to sort by file size. Maybe small - file size also leads to quick check times in proof assistants and - automatic theorem provers. With this assumption in mind we could - use the \texttt{ulo:check-time} predicate. Correlating proof - complexity with file size allows us to define one indicator of - proof complexity based on organizational knowledge alone. + {ULO} offers the \texttt{ulo:external-size} predicate which will + allow us to sort by file size. Maybe proof complexity also leads + to quick check times in proof assistants and automatic theorem + provers. With this assumption in mind we could use the + \texttt{ulo:check-time} predicate. Correlating proof complexity + with file size allows us to define one indicator of proof + complexity based on organizational knowledge alone. \textbf{Other Aspects} A tetrapodal search system should probably also take symbolic knowledge into account. Based on some kind of @@ -165,18 +167,18 @@ implementations. recommends extending ULO to nativly support the concept of algorithms. Both approaches have their distinct trade-offs. - \textbf{Representing Algorithms} As a first approach, we can try - to represent algorithms and problems in terms of existing ULO - predicates. As ULO does have a concept of \texttt{theorem} and - \texttt{proof}, it might be tempting to exploit the Curry-Howard - correspondence and represent algorithms understood as programs in - terms of proofs. But that does not really capture the canonical - understanding of algorithms; algorithms are not actually programs, - rather there are programs that implement algorithms. Even if we do - work around this problem, it is not clear how to represent - problems (e.g.\ the traveling salesman problem or the sorting - problem) in terms of theorems (propositions, types) that get - implemented by a proof (algorithm, program). + As a first approach, we can try to represent algorithms and + problems in terms of existing ULO predicates. As ULO does have a + concept of \texttt{theorem} and \texttt{proof}, it might be + tempting to exploit the Curry-Howard correspondence and represent + algorithms understood as programs in terms of proofs. But that + does not really capture the canonical understanding of algorithms; + algorithms are not actually programs, rather there are programs + that implement algorithms. Even if we do work around this problem, + it is not clear how to represent problems (e.g.\ the traveling + salesman problem or the sorting problem) in terms of theorems + (propositions, types) that get implemented by a proof (algorithm, + program). As algorithms make up an important part of certain areas of research, it might be reasonable to introduce native level support @@ -245,7 +247,7 @@ implementations. \end{lstlisting} We can formulate~$\mathcal{Q}_3$ with just one SPARQL query. Because everything is handled by the database, access - should be quick. + should be about as quick as we can hope it to be. \end{itemize} Experimenting with $\mathcal{Q}_1$ to $\mathcal{Q}_3$ provided us with diff --git a/doc/report/implementation.tex b/doc/report/implementation.tex index 67cd447..0285c83 100644 --- a/doc/report/implementation.tex +++ b/doc/report/implementation.tex @@ -2,17 +2,18 @@ One of the two contributions of \emph{ulo-storage} is that we implemented components for making organizational mathematical -knowledge queryable. This section first makes out the individual -required component for this tasks and then describes some details -of the actual implementation for this project. +knowledge (formulated as RDF~triplets) queryable. This section first +makes out the individual required component for this tasks and then +describes some details of the actual implementation for this project. \subsection{Components Implemented for \emph{ulo-storage}}\label{sec:components} -With RDF files exported and available for download as Git repositories -on MathHub, we have the goal of making the underlying data available -for use in applications. Let us first look at a high level overview of -all involved components. Figure~\ref{fig:components} illustrates each -component and the flow of data. +With RDF files exported and available for download as Git +repositories~\cite{uloisabelle, ulocoq}, we have the goal of making +the underlying data available for use in applications. Let us first +look at a high level overview of all involved +components. Figure~\ref{fig:components} illustrates each component and +the flow of data. \begin{figure}[]\begin{center} \includegraphics{figs/components} @@ -27,28 +28,22 @@ component and the flow of data. the file system. \item With streams of ULO files assembled by the Collecter, this - data then gets passed to an \emph{Importer}. An Importer uploads - RDF~streams into some kind of permanent storage. For - use in this project, the GraphDB~\cite{graphdb} triplet store was - a natural fit. + data then gets passed to an \emph{Importer}. An Importer uploads + RDF~streams into some kind of permanent storage. As we will see, + the GraphDB~\cite{graphdb} triplet store was a natural fit. \item Finally, with all triplets stored in a database, an \emph{Endpoint} is where applications access the underlying knowledge base. This does not necessarily need to be any custom software, rather the programming interface of the underlying - database itself could be understood as an endpoint of its own. - - Regardless, some thought can be put into designing an Endpoint as a - layer that lives between application and database that is more - convenient to use than the one provided by the database. It comes - down to the programming interface we wish to provide to a developer - using this system. + database itself can be understood as an endpoint of its own. \end{itemize} Collecter, Importer and Endpoint provide us with an easy and automated way of making RDF files available for use within applications. We will now take a look at the actual implementation created for -\emph{ulo-storage}. +\emph{ulo-storage}, beginning with the implementation of Collecter and +Importer. \subsection{Collecter and Importer}\label{sec:collecter} @@ -59,11 +54,13 @@ takes such a stream of RDF data and then dumps it to some sort of persistent storage. However in the implementation for \emph{ulo-storage}, both Collecter and Importer ended up being one piece of monolithic software. This does not need to be the case but -simply proved convenient. +proved convenient because (1)~combining Collecter and Importer forgoes +the needs for an additional IPC~mechanism and (2)~neither Collecter +nor Importer are terribly large pieces of software. Our implementation supports two sources for RDF files, namely Git repositories and the local file system. The file system Collecter -simply crawls a given directory on the local machine and looks for +crawls a given directory on the local machine and looks for RDF~XMl~files~\cite{rdfxml} while the Git Collecter first clones a Git repository and then passes the checked out working copy to the file system Collecter. Because it is not uncommon for RDF files to be @@ -72,19 +69,20 @@ Gzip~\cite{gzip} and XZ~\cite{xz} formats which can greatly reduce the required disk space in the collection step. During development of the Collecter, we found that existing exports -from third party mathematical libraries contain RDF syntax errors which -were not discovered previously. In particular, both Isabelle and Coq -export contained URIs which do not fit the official -specification~\cite{rfc3986}. Previous work that processed Coq and -Isabelle exports used database software such as Virtuoso Open -Source~\cite{ulo} which does not properly check URIs according to -spec, in consequence these faults were only discovered now. To tackle -these problems, we introduced on the fly correction steps during -collection that take the broken RDF files, fix the mentioned problems -related to URIs (by escaping illegal characters) and then -continue processing. Of course this is only a work-around; related -bugs were filed in the respective export projects to ensure that in the -future this extra step is not necessary. +from third party mathematical libraries contain RDF syntax errors +which were not discovered previously. In particular, both Isabelle and +Coq export contained URIs which do not fit the official syntax +specification~\cite{rfc3986} as they contained illegal +characters. Previous work~\cite{ulo} that processed Coq and Isabelle +exports used database software such as Virtuoso Open Source which does +not properly check URIs according to spec, in consequence these faults +were only discovered now. To tackle these problems, we introduced on +the fly correction steps during collection that take the broken RDF +files, fix the mentioned problems related to URIs (by escaping illegal +characters) and then continue processing. Of course this is only a +work-around; related bug reports were filed in the respective export +projects to ensure that in the future this extra step is not +necessary. Our Collecter takes existing RDF files, applies some on the fly transformations (extraction of compressed files, fixing of errors), @@ -110,10 +108,11 @@ allows the user to automate import jobs. For example, it is possible to schedule an import of a given Git repository every seven days to a given GraphDB instance. -Automated job control alone however do not solve the problem of -versioning. ULO exports~$\mathcal{E}$ depend on an original third -party library~$\mathcal{L}$. Running~$\mathcal{E}$ through the -workflow of Collecter and Importer, we get some database +Automated job control that regularly imports data from the same +sources leads us to the problem of versioning. ULO +exports~$\mathcal{E}$ depend on an original third party +library~$\mathcal{L}$. Running~$\mathcal{E}$ through the workflow of +Collecter and Importer, we get some database representation~$\mathcal{D}$. We see that data flows \begin{align*} \mathcal{L} \rightarrow \mathcal{E} \rightarrow \mathcal{D} @@ -130,10 +129,10 @@ suggestion to solve the problem of changing third party libraries is to regularly re-create the full data set~$\mathcal{D}$ from scratch, say every seven days. This circumvents all problems related to updating existing data sets, but it does mean additional computation -requirements. For currently existing exports from Coq and Isabelle -this is not a problem as even on weak laptop hardware the imports take -less than an hour. But if the number of triplets raises by orders of -magnitude, this approach will eventually not be scalable anymore. +requirements. It also means that changes in~$\mathcal{L}$ takes some +to propagate to~$\mathcal{D}$. If the number of triplets raises +by orders of magnitude, this approach will eventually not be scalable +anymore. \subsection{Endpoints}\label{sec:endpoints} @@ -152,7 +151,7 @@ on the RDF4J Java library. Both approaches have unique advantages. \item SPARQL is a standardized query language for RDF triplet data~\cite{sparql}. The specification includes not just syntax and semantics of the language itself, but also a standardized - REST interface for querying database servers. + REST interface~\cite{rest} for querying database servers. \textbf{Syntax} SPARQL is inspired by SQL and as such the \texttt{SELECT} \texttt{WHERE} syntax should be familiar to many @@ -168,9 +167,9 @@ on the RDF4J Java library. Both approaches have unique advantages. subject~\texttt{?o}, predicate~\texttt{?p} and object~\texttt{?o}. - \textbf{Advantage} Probably the biggest advantage is that - SPARQL is ubiquitous. As it is the de facto standard for - querying triplet stores, lots of literature and documentation is + \textbf{Advantage} Probably the biggest advantage is that SPARQL + is ubiquitous. As it is the de facto standard for querying + triplet stores, lots of implementations and documentation are available~\cite{sparqlbook, sparqlimpls, gosparql}. \item RDF4J is a Java API for interacting with triplet stores, @@ -199,8 +198,9 @@ on the RDF4J Java library. Both approaches have unique advantages. than mixing programming language with awkward query strings. We also found it quite helpful to generate Java classes from - OWL ontologies that contain all definitions of the ontology and - make it readable by any IDE~\cite{rdf4jgen}. + OWL~ontologies that contain all definitions of the + ontology~\cite{rdf4jgen}. This provides us with powerful IDE + auto completion features during development of ULO applications. \end{itemize} We see that both SPARQL and RDF4J have unique advantages. While SPARQL @@ -223,8 +223,9 @@ that web interface such that it can persist import jobs and finally (3)~a GraphDB instance which provides us with the required Endpoint. All code for Collecter and Importer is available in the \texttt{ulo-storage-collect} Git repository\footnote{\url{\gorepo}} -Additional deployment files, that is Docker Compose and Dockerfiles -are stored in a separate repository\footnote{\url{\composerepo}}. +Additional deployment files, that is Docker Compose and additional +Dockerfiles are stored in a separate +repository\footnote{\url{\composerepo}}. This concludes our discussion of the implementation developed for the \emph{ulo-storage} project. We designed a system based around (1)~a diff --git a/doc/report/introduction.tex b/doc/report/introduction.tex index 85a6aa6..e63d1c5 100644 --- a/doc/report/introduction.tex +++ b/doc/report/introduction.tex @@ -24,37 +24,35 @@ structure, tetrapodal search proposes that each kind of mathematical knowledge should be made available in a storage backend that fits the kind of data it is providing. With all four areas available for querying, tetrapodal search intends to then combine the four indexes -into a single query interface. - -Currently, research is focused on providing schemas, storage backends -and indexes for the four different kinds of mathematical -knowledge. The focus of \emph{ulo-storage} is the area of -organizational knowledge. +into a single query interface. Currently, research is focused on +providing schemas, storage backends and indexes for the four different +kinds of mathematical knowledge. The focus of \emph{ulo-storage} is +the area of organizational knowledge. A previously proposed way to structure such organizational data is the \emph{upper level ontology} (ULO)~\cite{ulo}. ULO takes the form of an OWL~ontology~\cite{uloonto} and as such all organization information is stored as RDF~triplets with a unified schema of ULO~predicates~\cite{owl}. Some effort has been made to export -existing databases of formal mathematical knowledge to {ULO}, in -particular, there exist exports from Isabelle and Coq -libraries~\cite{uloisabelle, ulocoq}. The resulting data set is -already quite large, the Isabelle export alone containing more than +existing databases of formal mathematical knowledge to {ULO}. In +particular, exports from Isabelle and Coq-based libraries are +avilable~\cite{uloisabelle, ulocoq}. The resulting data set is already +quite large, the Isabelle export alone containing more than 200~million triplets. Existing exports from Isabelle and Coq result in single or multiple -RDF~files. This is a convenient format for exchange and easily -versioned using Git. However, considering the vast number of triplets, -it is impossible to query easily and efficiently in this state. This -is what \emph{ulo-storage} is focused on: Making ULO data sets -accessible for querying and analysis. We collected RDF files spread -over different Git repositories, imported them into a database and -then experimented with APIs for accessing that data set. +XML~files that contain RDF~triplets. This is a convenient format for +exchange and easily versioned using Git. However, considering the vast +number of triplets, it is impossible to query easily and efficiently +in this state. This is what \emph{ulo-storage} is focused on: Making +ULO data sets accessible for querying and analysis. We collected RDF +files spread over different Git repositories, imported them into a +database and then experimented with APIs for accessing that data set. The main contribution of \emph{ulo-storage} is twofold. First, (1)~we built up various infrastructure components for making organizational knowledge queryable. These components can make up building blocks of -a larger tetrapodal search system. Design and implementation are +a larger tetrapodal search system. Their design and implementation are discussed in Section~\ref{sec:implementation}. Second, (2)~we ran sample prototype applications and queries on top of this interface. While the applications themselves are admittedly not very diff --git a/doc/report/references.bib b/doc/report/references.bib index 39a883a..80d5889 100644 --- a/doc/report/references.bib +++ b/doc/report/references.bib @@ -314,3 +314,10 @@ url = {http://graphdb.ontotext.com/documentation/free/graphdb-feature-comparison.html}, } +@book{rest, + title={Architectural styles and the design of network-based software architectures}, + author={Fielding, Roy T and Taylor, Richard N}, + volume={7}, + year={2000}, + publisher={University of California, Irvine Irvine} +} -- GitLab