report: add some clarifications

Next up: Proofreading and then setting up next tasks for September

report: add some clarifications
2003ef78 · Andreas Schärtl · 82fe3080 · 2003ef78 · 2003ef78 · 2003ef78
Commit 2003ef78 authored Aug 19, 2020 by Andreas Schärtl
--- a/doc/report/applications.tex
+++ b/doc/report/applications.tex
@@ -2,8 +2,8 @@

 With programming endpoints in place, we can now query the data set
 containing both Isabelle and Coq exports stored in {GraphDB}. We
-experimented with two kinds of applications thatt take advantage
-of our GraphDB endpoint.
+experimented with two kinds of applications that talk to a GraphDB
+endpoint.

 \begin{itemize}
    \item Exploring which ULO predicates are actually used in the
@@ -22,7 +22,7 @@ of our GraphDB endpoint.

 \subsection{Exploring Existing Data Sets}\label{sec:expl}

-Four our first application, we looked at what ULO predicates are
+For our first application, we looked at what ULO predicates are
 actually used by the respective data sets. With more than 250~million
 triplets in the store, we hoped that this would give us some insight
 into the kind of knowledge we are dealing with.
@@ -38,25 +38,26 @@ listed in Appendix~\ref{sec:used}. In both cases, what stands out is
 that both exports use less than a third of the available predicates.

 We also see that the Isabelle and Coq exports use different
-predicates.  For example, the Isabelle contains organizational meta
-information such as information about paragraphs and sections in the
-source document while the Coq export only tells us about the filename
-of the Coq source. That is not particularly problematic as long as we
-can trace a given object back to the original source.  Regardless, our
-results do show that both exports have their own particularities and
-with more and more third party libraries exported to ULO one has to
-assume that this heterogeneity will only grow. In particular we want
-to point to the large number of predicates which remain unused in both
-Isabelle and Coq exports. A user formulating queries for ULO might be
-oblivious to the fact that only subsets of exports support given
-predicates.
+predicates.  For example, the Isabelle export contains organizational
+meta information such as information about paragraphs and sections in
+the source document while the Coq export only tells us about the
+filename of the Coq source. That is not particularly problematic as
+long as we can trace a given object back to the original source.
+Regardless, our results do show that both exports have their own
+particularities and with more and more third party libraries exported
+to ULO one has to assume that this heterogeneity will only grow. In
+particular we want to point to the large number of predicates which
+remain unused in both Isabelle and Coq exports. A user formulating
+queries for ULO might be oblivious to the fact that only subsets of
+exports support given predicates.

 While not a problem for \emph{ulo-storage} per se, we do expect this
 to be a challenge when building a tetrapodal search
 system. Recommended ways around this ``missing fields'' problem in
 database literature include the clever use of default values or
 inference of missing values~\cite{kdisc, aidisc}, neither of which
-feels particularly applicable to an ULO data set.
+feels particularly applicable to vast and heterogeneous sets of
+ULO~triplets.

 \input{applications-preds.tex}

@@ -87,7 +88,8 @@ implementations.
    subsets of problems, e.g.\ on the satisfiability of a given CNF
    formula. A particular example is focused on heuristics for how
    long a SAT solver will take to find a solution for a
-    given~{CNF}~\cite{proofsat}.
+    given~{CNF}. Here, solutions that take long are considered
+    harder~\cite{proofsat}.

    \textbf{Organizational Aspect} A first working hypothesis might be
    to assume that elementary proofs are short. In that case, the
@@ -100,13 +102,13 @@ implementations.
    to find a generalized definition of proof difficulty, we will
    accept proof size as a first working hypothesis.

-    {ULO} offers the \texttt{ulo:external-size}
-    predicate which will allow us to sort by file size. Maybe small
-    file size also leads to quick check times in proof assistants and
-    automatic theorem provers. With this assumption in mind we could
-    use the \texttt{ulo:check-time} predicate. Correlating proof
-    complexity with file size allows us to define one indicator of
-    proof complexity based on organizational knowledge alone.
+    {ULO} offers the \texttt{ulo:external-size} predicate which will
+    allow us to sort by file size. Maybe proof complexity also leads
+    to quick check times in proof assistants and automatic theorem
+    provers. With this assumption in mind we could use the
+    \texttt{ulo:check-time} predicate. Correlating proof complexity
+    with file size allows us to define one indicator of proof
+    complexity based on organizational knowledge alone.

    \textbf{Other Aspects} A tetrapodal search system should probably
    also take symbolic knowledge into account. Based on some kind of
@@ -165,18 +167,18 @@ implementations.
    recommends extending ULO to nativly support the concept of
    algorithms. Both approaches have their distinct trade-offs.

-    \textbf{Representing Algorithms} As a first approach, we can try
-    to represent algorithms and problems in terms of existing ULO
-    predicates. As ULO does have a concept of \texttt{theorem} and
-    \texttt{proof}, it might be tempting to exploit the Curry-Howard
-    correspondence and represent algorithms understood as programs in
-    terms of proofs.  But that does not really capture the canonical
-    understanding of algorithms; algorithms are not actually programs,
-    rather there are programs that implement algorithms. Even if we do
-    work around this problem, it is not clear how to represent
-    problems (e.g.\ the traveling salesman problem or the sorting
-    problem) in terms of theorems (propositions, types) that get
-    implemented by a proof (algorithm, program).
+    As a first approach, we can try to represent algorithms and
+    problems in terms of existing ULO predicates. As ULO does have a
+    concept of \texttt{theorem} and \texttt{proof}, it might be
+    tempting to exploit the Curry-Howard correspondence and represent
+    algorithms understood as programs in terms of proofs.  But that
+    does not really capture the canonical understanding of algorithms;
+    algorithms are not actually programs, rather there are programs
+    that implement algorithms. Even if we do work around this problem,
+    it is not clear how to represent problems (e.g.\ the traveling
+    salesman problem or the sorting problem) in terms of theorems
+    (propositions, types) that get implemented by a proof (algorithm,
+    program).

    As algorithms make up an important part of certain areas of
    research, it might be reasonable to introduce native level support
@@ -245,7 +247,7 @@ implementations.
    \end{lstlisting}
    We can formulate~$\mathcal{Q}_3$ with just one SPARQL
    query. Because everything is handled by the database, access
-    should be quick.
+    should be about as quick as we can hope it to be.
 \end{itemize}

 Experimenting with $\mathcal{Q}_1$ to $\mathcal{Q}_3$ provided us with

--- a/doc/report/implementation.tex
+++ b/doc/report/implementation.tex
@@ -2,17 +2,18 @@

 One of the two contributions of \emph{ulo-storage} is that we
 implemented components for making organizational mathematical
-knowledge queryable. This section first makes out the individual
-required component for this tasks and then describes some details
-of the actual implementation for this project.
+knowledge (formulated as RDF~triplets) queryable. This section first
+makes out the individual required component for this tasks and then
+describes some details of the actual implementation for this project.

 \subsection{Components Implemented for \emph{ulo-storage}}\label{sec:components}

-With RDF files exported and available for download as Git repositories
-on MathHub, we have the goal of making the underlying data available
-for use in applications. Let us first look at a high level overview of
-all involved components. Figure~\ref{fig:components} illustrates each
-component and the flow of data.
+With RDF files exported and available for download as Git
+repositories~\cite{uloisabelle, ulocoq}, we have the goal of making
+the underlying data available for use in applications. Let us first
+look at a high level overview of all involved
+components. Figure~\ref{fig:components} illustrates each component and
+the flow of data.

 \begin{figure}[]\begin{center}
    \includegraphics{figs/components}
@@ -28,27 +29,21 @@ component and the flow of data.

  \item With streams of ULO files assembled by the Collecter, this
    data then gets passed to an \emph{Importer}. An Importer uploads
-  RDF~streams into some kind of permanent storage. For
-  use in this project, the GraphDB~\cite{graphdb} triplet store was
-  a natural fit.
+    RDF~streams into some kind of permanent storage. As we will see,
+    the GraphDB~\cite{graphdb} triplet store was a natural fit.

 \item Finally, with all triplets stored in a database, an
  \emph{Endpoint} is where applications access the underlying
  knowledge base. This does not necessarily need to be any custom
  software, rather the programming interface of the underlying
-  database itself could be understood as an endpoint of its own.
-
-  Regardless, some thought can be put into designing an Endpoint as a
-  layer that lives between application and database that is more
-  convenient to use than the one provided by the database. It comes
-  down to the programming interface we wish to provide to a developer
-  using this system.
+  database itself can be understood as an endpoint of its own.
 \end{itemize}

 Collecter, Importer and Endpoint provide us with an easy and automated
 way of making RDF files available for use within applications. We will
 now take a look at the actual implementation created for
-\emph{ulo-storage}.
+\emph{ulo-storage}, beginning with the implementation of Collecter and
+Importer.

 \subsection{Collecter and Importer}\label{sec:collecter}

@@ -59,11 +54,13 @@ takes such a stream of RDF data and then dumps it to some sort of
 persistent storage.  However in the implementation for
 \emph{ulo-storage}, both Collecter and Importer ended up being one
 piece of monolithic software. This does not need to be the case but
-simply proved convenient.
+proved convenient because (1)~combining Collecter and Importer forgoes
+the needs for an additional IPC~mechanism and (2)~neither Collecter
+nor Importer are terribly large pieces of software.

 Our implementation supports two sources for RDF files, namely Git
 repositories and the local file system. The file system Collecter
-simply crawls a given directory on the local machine and looks for
+crawls a given directory on the local machine and looks for
 RDF~XMl~files~\cite{rdfxml} while the Git Collecter first clones a Git
 repository and then passes the checked out working copy to the file
 system Collecter. Because it is not uncommon for RDF files to be
@@ -72,19 +69,20 @@ Gzip~\cite{gzip} and XZ~\cite{xz} formats which can greatly reduce the
 required disk space in the collection step.

 During development of the Collecter, we found that existing exports
-from third party mathematical libraries contain RDF syntax errors which
-were not discovered previously. In particular, both Isabelle and Coq
-export contained URIs which do not fit the official
-specification~\cite{rfc3986}. Previous work that processed Coq and
-Isabelle exports used database software such as Virtuoso Open
-Source~\cite{ulo} which does not properly check URIs according to
-spec, in consequence these faults were only discovered now. To tackle
-these problems, we introduced on the fly correction steps during
-collection that take the broken RDF files, fix the mentioned problems
-related to URIs (by escaping illegal characters) and then
-continue processing. Of course this is only a work-around; related
-bugs were filed in the respective export projects to ensure that in the
-future this extra step is not necessary.
+from third party mathematical libraries contain RDF syntax errors
+which were not discovered previously. In particular, both Isabelle and
+Coq export contained URIs which do not fit the official syntax
+specification~\cite{rfc3986} as they contained illegal
+characters. Previous work~\cite{ulo} that processed Coq and Isabelle
+exports used database software such as Virtuoso Open Source which does
+not properly check URIs according to spec, in consequence these faults
+were only discovered now. To tackle these problems, we introduced on
+the fly correction steps during collection that take the broken RDF
+files, fix the mentioned problems related to URIs (by escaping illegal
+characters) and then continue processing. Of course this is only a
+work-around; related bug reports were filed in the respective export
+projects to ensure that in the future this extra step is not
+necessary.

 Our Collecter takes existing RDF files, applies some on the fly
 transformations (extraction of compressed files, fixing of errors),
@@ -110,10 +108,11 @@ allows the user to automate import jobs. For example, it is possible
 to schedule an import of a given Git repository every seven days to a
 given GraphDB instance.

-Automated job control alone however do not solve the problem of
-versioning. ULO exports~$\mathcal{E}$ depend on an original third
-party library~$\mathcal{L}$. Running~$\mathcal{E}$ through the
-workflow of Collecter and Importer, we get some database
+Automated job control that regularly imports data from the same
+sources leads us to the problem of versioning.  ULO
+exports~$\mathcal{E}$ depend on an original third party
+library~$\mathcal{L}$. Running~$\mathcal{E}$ through the workflow of
+Collecter and Importer, we get some database
 representation~$\mathcal{D}$. We see that data flows
 \begin{align*}
  \mathcal{L} \rightarrow \mathcal{E} \rightarrow \mathcal{D}
@@ -130,10 +129,10 @@ suggestion to solve the problem of changing third party libraries is
 to regularly re-create the full data set~$\mathcal{D}$ from scratch,
 say every seven days. This circumvents all problems related to
 updating existing data sets, but it does mean additional computation
-requirements. For currently existing exports from Coq and Isabelle
-this is not a problem as even on weak laptop hardware the imports take
-less than an hour. But if the number of triplets raises by orders of
-magnitude, this approach will eventually not be scalable anymore.
+requirements. It also means that changes in~$\mathcal{L}$ takes some
+to propagate to~$\mathcal{D}$.  If the number of triplets raises
+by orders of magnitude, this approach will eventually not be scalable
+anymore.

 \subsection{Endpoints}\label{sec:endpoints}

@@ -152,7 +151,7 @@ on the RDF4J Java library. Both approaches have unique advantages.
      \item SPARQL is a standardized query language for RDF triplet
      data~\cite{sparql}. The specification includes not just syntax
      and semantics of the language itself, but also a standardized
-      REST interface for querying database servers.
+      REST interface~\cite{rest} for querying database servers.

      \textbf{Syntax} SPARQL is inspired by SQL and as such the
      \texttt{SELECT} \texttt{WHERE} syntax should be familiar to many
@@ -168,9 +167,9 @@ on the RDF4J Java library. Both approaches have unique advantages.
      subject~\texttt{?o}, predicate~\texttt{?p} and
      object~\texttt{?o}.

-      \textbf{Advantage} Probably the biggest advantage is that
-      SPARQL is ubiquitous. As it is the de facto standard for
-      querying triplet stores, lots of literature and documentation is
+      \textbf{Advantage} Probably the biggest advantage is that SPARQL
+      is ubiquitous. As it is the de facto standard for querying
+      triplet stores, lots of implementations and documentation are
      available~\cite{sparqlbook, sparqlimpls, gosparql}.

      \item RDF4J is a Java API for interacting with triplet stores,
@@ -199,8 +198,9 @@ on the RDF4J Java library. Both approaches have unique advantages.
      than mixing programming language with awkward query strings.

      We also found it quite helpful to generate Java classes from
-      OWL ontologies that contain all definitions of the ontology and
-      make it readable by any IDE~\cite{rdf4jgen}.
+      OWL~ontologies that contain all definitions of the
+      ontology~\cite{rdf4jgen}.  This provides us with powerful IDE
+      auto completion features during development of ULO applications.
 \end{itemize}

 We see that both SPARQL and RDF4J have unique advantages. While SPARQL
@@ -223,8 +223,9 @@ that web interface such that it can persist import jobs and finally
 (3)~a GraphDB instance which provides us with the required
 Endpoint. All code for Collecter and Importer is available in the
 \texttt{ulo-storage-collect} Git repository\footnote{\url{\gorepo}}
-Additional deployment files, that is Docker Compose and Dockerfiles
-are stored in a separate repository\footnote{\url{\composerepo}}.
+Additional deployment files, that is Docker Compose and additional
+Dockerfiles are stored in a separate
+repository\footnote{\url{\composerepo}}.

 This concludes our discussion of the implementation developed for the
 \emph{ulo-storage} project. We designed a system based around (1)~a

--- a/doc/report/introduction.tex
+++ b/doc/report/introduction.tex
@@ -24,37 +24,35 @@ structure, tetrapodal search proposes that each kind of mathematical
 knowledge should be made available in a storage backend that fits the
 kind of data it is providing. With all four areas available for
 querying, tetrapodal search intends to then combine the four indexes
-into a single query interface.
-
-Currently, research is focused on providing schemas, storage backends
-and indexes for the four different kinds of mathematical
-knowledge. The focus of \emph{ulo-storage} is the area of
-organizational knowledge.
+into a single query interface. Currently, research is focused on
+providing schemas, storage backends and indexes for the four different
+kinds of mathematical knowledge. The focus of \emph{ulo-storage} is
+the area of organizational knowledge.

 A previously proposed way to structure such organizational data is the
 \emph{upper level ontology} (ULO)~\cite{ulo}. ULO takes the form of an
 OWL~ontology~\cite{uloonto} and as such all organization information
 is stored as RDF~triplets with a unified schema of
 ULO~predicates~\cite{owl}.  Some effort has been made to export
-existing databases of formal mathematical knowledge to {ULO}, in
-particular, there exist exports from Isabelle and Coq
-libraries~\cite{uloisabelle, ulocoq}. The resulting data set is
-already quite large, the Isabelle export alone containing more than
+existing databases of formal mathematical knowledge to {ULO}. In
+particular, exports from Isabelle and Coq-based libraries are
+avilable~\cite{uloisabelle, ulocoq}. The resulting data set is already
+quite large, the Isabelle export alone containing more than
 200~million triplets.

 Existing exports from Isabelle and Coq result in single or multiple
-RDF~files. This is a convenient format for exchange and easily
-versioned using Git. However, considering the vast number of triplets,
-it is impossible to query easily and efficiently in this state. This
-is what \emph{ulo-storage} is focused on: Making ULO data sets
-accessible for querying and analysis. We collected RDF files spread
-over different Git repositories, imported them into a database and
-then experimented with APIs for accessing that data set.
+XML~files that contain RDF~triplets. This is a convenient format for
+exchange and easily versioned using Git. However, considering the vast
+number of triplets, it is impossible to query easily and efficiently
+in this state. This is what \emph{ulo-storage} is focused on: Making
+ULO data sets accessible for querying and analysis. We collected RDF
+files spread over different Git repositories, imported them into a
+database and then experimented with APIs for accessing that data set.

 The main contribution of \emph{ulo-storage} is twofold. First, (1)~we
 built up various infrastructure components for making organizational
 knowledge queryable.  These components can make up building blocks of
-a larger tetrapodal search system. Design and implementation are
+a larger tetrapodal search system. Their design and implementation are
 discussed in Section~\ref{sec:implementation}.  Second, (2)~we ran
 sample prototype applications and queries on top of this
 interface. While the applications themselves are admittedly not very

--- a/doc/report/references.bib
+++ b/doc/report/references.bib
@@ -314,3 +314,10 @@
    url = {http://graphdb.ontotext.com/documentation/free/graphdb-feature-comparison.html},
 }

+@book{rest,
+  title={Architectural styles and the design of network-based software architectures},
+  author={Fielding, Roy T and Taylor, Richard N},
+  volume={7},
+  year={2000},
+  publisher={University of California, Irvine Irvine}
+}