report: final paper-based review

a6fb8743 · Andreas Schärtl · 73da559d · a6fb8743 · a6fb8743 · a6fb8743
Commit a6fb8743 authored Oct 9, 2020 by Andreas Schärtl
--- a/doc/report/abstract.tex
+++ b/doc/report/abstract.tex
@@ -6,8 +6,8 @@ necessary for this information to be available for querying.

 \emph{ulo-storage} aims to lay out the groundwork for this task.
 We collected various pieces of exported organizational knowledge into
-a centralized and efficient store, made that store available on the
-network and then evaluated different ways of querying that store. In
-addition, implementations of some exemplary queries on top of our
-created infrastructure resulted in insights on how unified schemas
-for organizational mathematical knowledge could be extended.
+a centralized and efficient store, made the resulting knowledge graph
+available on the network and then evaluated different ways of querying
+that store. In addition, implementations of some exemplary queries on
+top of our created infrastructure resulted in insights on how unified
+schemas for organizational mathematical knowledge could be extended.
--- a/doc/report/applications-q3.tex
+++ b/doc/report/applications-q3.tex
@@ -36,7 +36,7 @@

        \caption{An adapted SPARQL~query based on~\ref{fig:q2a}. It
          lists all works authored by ``John Smith'' rated by number
-          of references.  The idea is works that were referenced more
+          of references.  The idea is that works that were referenced more
          often are more important.}\label{fig:q2b}
    \end{subfigure}
    \vspace{8mm}

--- a/doc/report/applications-screenshots.tex
+++ b/doc/report/applications-screenshots.tex
@@ -5,14 +5,14 @@
    \begin{subfigure}[]{0.475\textwidth}
        \centering
        \shadowbox{\includegraphics[width=\textwidth]{figs/explore1.png}}
-        \caption{Listing of all contributors in the data set. Clicking
+        \caption{Listing all contributors in the data set. Clicking
          on a given author shows their individual contributions.}
    \end{subfigure}
    \hfill
    \begin{subfigure}[]{0.475\textwidth}
        \centering
        \shadowbox{\includegraphics[width=\textwidth]{figs/explore2.png}}
-        \caption{Listing works a given contributor worked on.\\}
+        \caption{Listing works a given contributor has worked on.\\}
    \end{subfigure}
    \vskip\baselineskip{}
    \begin{subfigure}[]{0.475\textwidth}

--- a/doc/report/applications.tex
+++ b/doc/report/applications.tex
@@ -17,13 +17,12 @@ Endpoint, our second contribution.

    \item We investigated queries that could be used to extend the
      system into a larger tetrapodal search system. While some
-      organizational queries have obvious canonical solutions others
+      organizational queries have obvious canonical solutions, others
      introduce questions on how organizational knowledge should be
      structured (Section~\ref{sec:tetraq}).
 \end{itemize}

-\noindent These applications will now be discussed in the following
-sections.
+\noindent We will now discuss each application in more detail.

 \subsection{Exploring Existing Data Sets}\label{sec:expl}

@@ -47,7 +46,7 @@ predicates.  For example, the Isabelle export contains organizational
 meta information such as information about paragraphs and sections in
 the source document while the Coq export only tells us the filename of
 the original Coq source.  Both exports have their particularities and
-with more and more third party libraries exported to ULO one has to
+with more and more third party libraries exported to ULO, one has to
 assume that this heterogeneity will only grow. In particular we want
 to point to the large number of predicates which remain unused in both
 Isabelle and Coq exports. A user formulating queries for ULO might be
@@ -77,41 +76,52 @@ increases.

 \subsection{Interactive Exploration}\label{sec:exp}

-Our second application wants to illustrate how to browse an ULO data
-set interactively. For this purpose, we developed a web front end that
-allows users to browse contributions by author. For this application,
-we used the RDF4J Endpoint, the application itself is implemented in
-Java. Figure~\ref{fig:appss} shows four screenshots of the current
-version.  Once the user has selected a given contribution, we list
-some basic information about the selected object, such as type
-(e.g.\ lemma or corollary) and name as well as references to other
-objects in the knowledge base.
+The second application we want to discuss illustrates how to browse an
+ULO data set interactively. Here, we developed a web front end that
+allows users to browse contributions to the knowledge base sorted by
+author. In this particular case, we used the RDF4J Endpoint, the
+application itself is implemented in Java. Figure~\ref{fig:appss}
+shows four screenshots of the current version. Once the user has
+selected a given contribution, we list some basic information about
+the selected object, such as type (e.g.\ lemma or corollary) and name
+as well as references to other objects in the knowledge base.

 \input{applications-screenshots.tex}

-This application is interesting because similar approaches could be
+This experiment is interesting because similar approaches could be
 applied when designing IDE~integration for working with organizational
 knowledge. Given some object in a document identified by an URI, we
 can look at how that object is connected to other objects in the
 knowledge graph.

-We can also connect the dots back to the original source. In
-particular, we took a look at the \texttt{ulo:sourceref} property
-defined for many objects in the knowledge base.
-\texttt{ulo:sourceref} is ``the URI of the physical location (e.g.,
-file/URI, line, column) of the source code that introduced the
-subject''~\cite{uloonto} and in use in the Isabelle
-export~\cite{uloisabelle}. Making the connection from ULO~object to
-original Isabelle document is convenient for the user, but required
-some extra work from us as application implementors. The reason for
-this is that the format of the \texttt{ulo:sourceref} property is not
-well-defined, rather it is up to implementors of library exporters how
-to format source references. The consequence is that for each export,
-developers of front ends will need to write custom code for finding
-the original source. Maybe it would be a reasonable requirement for
+Our front end can also connect the dots back from ULO~object to
+original source code. In particular, we took a look at the
+\texttt{ulo:sourceref} predicate defined for many objects in the
+knowledge base.  \texttt{ulo:sourceref} is supposed to contain ``the
+URI of the physical location (e.g., file/URI, line, column) of the
+source code that introduced the subject''~\cite{uloonto}; the Isabelle
+export.  But while making the connection from exported ULO~object to
+original Isabelle document is convenient for the user, this feature
+required some extra work from us as application implementors. The
+reason for this is that the format of the \texttt{ulo:sourceref}
+property appears to be not well-defined, rather it is up to
+implementors of library exporters how to format these source
+references. For example, in the Isabelle export, we have to translate
+source references such as
+\begin{verbatim}
+    https://isabelle.in.tum.de/source/HOL/HOL-Algebra/
+      HOL-Algebra.Group.theory#17829.576.8:17836.576.15
+\end{verbatim}
+to the original source
+\begin{verbatim}
+    https://isabelle.in.tum.de/dist/library/HOL/HOL-Algebra/Group.html
+\end{verbatim}
+The consequence is that for each export, developers of
+front ends will need to write custom code for finding the original
+source.  Maybe it would be a reasonable requirement for
 \texttt{ulo:sourceref} to have a well-defined format, ideally an
-actually reachable URI on the open web, if that is applicable for a given
-library.
+actually reachable URI on the open web, if that is applicable for a
+given library.

 While translating from \texttt{ulo:sourceref} to original URI
 introduced some extra work, implementing this application was easy and
@@ -128,11 +138,11 @@ proof of concept implementations and evaluate their applicability.

 \subsubsection{Elementary Proofs and Computed Scores}

-We start with query~$\mathcal{Q}_1$ which illustrates how one can
+We start with query~$\mathcal{Q}_1$ which illustrates how we can
 compute arithmetic scores for objects in our knowledge graph.
 Query~$\mathcal{Q}_1$ asks us to ``\emph{[f]ind theorems with
-  non-elementary proofs}.''  Elementary proofs can be understood as
-those proof that are considered easy and
+  non-elementary proofs}''~\cite{tetra}.  Elementary proofs can be
+understood as those proof that are considered easy and
 obvious~\cite{elempro,elemtest}. In consequence,~$\mathcal{Q}_1$ has
 to search for all proofs which are not trivial.  Of course, just like
 any distinction between ``theorem'' and ``corollary'' is going to be
@@ -149,13 +159,10 @@ prospective school teachers had problems with notation and term
 rewriting and were missing required prerequisites~\cite{proofund,
  proofteach}, none of which seems applicable for grading individual
 formal proofs for difficulty. On the other hand, there is very
-specific research on rating proofs for some given subsets of problems,
-e.g.\ on the satisfiability of a given CNF formula. A particular
-example is focused on heuristics for how long a SAT solver will take
-to find a solution for a given~{CNF}. Here, solutions that take long
-are considered harder~\cite{proofsat}.
+specific research on rating proofs for concrete problems such as the
+satisfiability of a given CNF formula~\cite{proofsat}.

-\noindent\emph{Organizational Aspect.} A first working hypothesis
+\emph{Organizational Aspect.} A first working hypothesis
 might be to assume that elementary proofs are short. In that case, the
 size, that is the number of bytes to store the proof, is our first
 indicator of proof complexity. This is by no means perfect, as even
@@ -170,22 +177,22 @@ working hypothesis.
 us to sort by file size. Maybe proof complexity also leads to quick
 check times in proof assistants and automatic theorem provers. With
 this assumption in mind we could use the \texttt{ulo:check-time}
-predicate. Correlating proof complexity with file size allows us to
-define one indicator of proof complexity based on organizational
-knowledge alone.
+predicate. Correlating proof complexity with file size or check time
+allows us to define one indicator of proof complexity based on
+organizational knowledge alone.

-\noindent\emph{Other Aspects.} A tetrapodal search system should
+\emph{Other Aspects.} A tetrapodal search system should
 probably also take symbolic knowledge into account. Based on some kind
 of measure of formula complexity, different proofs could be
 rated. Similarly, with narrative knowledge available to us, we could
 count the number of words, citations and so on to rate the narrative
 complexity of a proof. Combining symbolic knowledge, narrative
 knowledge and organizational knowledge should allow us to find proofs
-which are probably easier than others.
+which are easier than others.

 \input{applications-q1.tex}

-\noindent\emph{Implementation.} Implementing a naive version of the
+\emph{Implementation.} Implementing a naive version of the
 organizational aspect can be as simple as querying for all theorems
 justified by proofs, ordered by size (or check time).
 Figure~\ref{fig:q1short} illustrates how this can be achieved with a
@@ -231,13 +238,13 @@ something that will surely be useful for many applications.
 \subsubsection{Categorizing Algorithms and Algorithmic Problems}

 The second query~$\mathcal{Q}_2$ we decided to focus on wants to
-``\emph{[f]ind algorithms that solve $NP$-complete graph problems.}''
-Here we want the tetrapodal search system to return a listing of
-algorithms that solve (graph) problems with a given property (runtime
-complexity). We need to consider where each of these three components
-might be stored.
+``\emph{[f]ind algorithms that solve $NP$-complete graph
+problems}''~\cite{tetra}.  Here we want the tetrapodal search system
+to return a listing of algorithms that solve (graph) problems with a
+given property (runtime complexity). We need to consider where and how
+each of these components might be found.

-\noindent\emph{Symbolic and Concrete Aspects.} First, let us consider
+\emph{Symbolic and Concrete Aspects.} First, let us consider
 algorithms. Algorithms can be formulated as computer code which can be
 understood as symbolic knowledge (code represented as a syntax tree)
 or as concrete knowledge (code as text
@@ -248,25 +255,20 @@ automatically in the general case~\cite{np}. Metadata of this kind
 needs to be stored in a separate index for organizational knowledge,
 it being the only fit.

-\noindent\emph{Organizational Aspect.} If we wish to look up properties
-about algorithms from organizational knowledge, we first have to
-think about how to represent this information. We propose two
-approaches, one using the existing ULO ontology and one that
-recommends extending ULO to nativly support the concept of
-algorithms. Both approaches have their distinct trade-offs.
-
-As a first approach, we can try to represent algorithms and
-problems in terms of existing ULO predicates. As ULO does have a
-concept of \texttt{theorem} and \texttt{proof}, it might be
-tempting to exploit the Curry-Howard correspondence and represent
-algorithms understood as programs in terms of proofs.  But that
-does not really capture the canonical understanding of algorithms;
-algorithms are not actually programs, rather there are programs
-that implement algorithms. Even if we do work around this problem,
-it is not clear how to represent problems (e.g.\ the traveling
-salesman problem or the sorting problem) in terms of theorems
-(propositions, types) that get implemented by a proof (algorithm,
-program).
+\emph{Organizational Aspect.} If we wish to look up properties about
+algorithms from organizational knowledge, we first have to think about
+how to represent this information.  As a first approach, we can try to
+represent algorithms and problems in terms of existing ULO
+predicates. As ULO does have a concept of \texttt{ulo:theorem} and
+\texttt{ulo:proof}, it might be tempting to exploit these predicates
+and represent algorithms understood as programs in terms of proofs.
+But that does not really capture the canonical understanding of
+algorithms. Algorithms are not actually programs, rather there are
+programs that implement algorithms. Even if we do work around this
+problem, it is not clear how to represent problems (e.g.\ the
+traveling salesman problem or the sorting problem) in terms of
+theorems (propositions, types) that get implemented by a proof
+(algorithm, program).

 As algorithms make up an important part of certain areas of research,
 it might be reasonable to introduce native level support for
@@ -291,16 +293,16 @@ solve algorithmic problems.

 \subsubsection{Contributors and Number of References}

-Finally, query~$\mathcal{Q}_3$ from literature wants to
-know ``\emph{[a]ll areas of math that {Nicolas G.\ de Bruijn} has
-  worked in and his main contributions}''~\cite{tetra}.  $\mathcal{Q}_3$~is asking
-for works of a given author~$A$.  It also asks for their main
+Finally, query~$\mathcal{Q}_3$ from literature wants to know
+``\emph{[a]ll areas of math that {Nicolas G.\ de Bruijn} has worked in
+and his main contributions}''~\cite{tetra}.  $\mathcal{Q}_3$~is
+asking for works of a given author~$A$.  It also asks for their main
 contributions, for example which particularly interesting paragraphs
-or code~$A$ has authored.  We picked this particular query as it
-is asking for metadata, something that should be easily serviced by
+or code~$A$ has authored.  We picked this particular query as it is
+asking for metadata, something that should be easily serviced by
 organizational knowledge.

-\noindent\emph{Organizational Aspect.} ULO has no concept of authors,
+\emph{Organizational Aspect.} ULO has no concept of authors,
 contributors, dates and so on. Rather, the idea is to take advantage
 of the Dublin Core project which provides an ontology for such
 metadata~\cite{dcreport, dcowl}. For example, Dublin Core provides us
@@ -316,19 +318,18 @@ objects ordered by authors.

 \input{applications-q3.tex}

-\noindent\emph{Implementation.} A search for contributions by a given
-author can easily be formulated in {SPARQL}~(Figure~\ref{fig:q2a}).
-Query $\mathcal{Q}_3$ is also asking for the main contributions
-of~$A$, that is those works that~$A$ authored that are the most
-important. Sorting the result by number of references might be a good
-start.  To get the main contributions, we rate each individual work by
-its number of \texttt{ulo:uses} references. Extending the previous
-{SPARQL} query, we can ask the database for an ordered list of works,
-starting with the one that has the most
-references~(Figure~\ref{fig:q2b}).  We see that one can
-formulate~$\mathcal{Q}_3$ with just one SPARQL query. Because
-everything is handled by the database, access should be about as quick
-as we can hope it to be.
+\emph{Implementation.} A search for contributions by a given author
+can easily be formulated in {SPARQL}~(Figure~\ref{fig:q2a}).  Query
+$\mathcal{Q}_3$ is also asking for the main contributions of~$A$, that
+is those works that~$A$ authored that are the most important. Sorting
+the result by number of references can be a good start.  To get the
+main contributions, we rate each individual work by its number of
+\texttt{ulo:uses} references. Extending the previous {SPARQL} query,
+we can ask the database for an ordered list of works, starting with
+the one that has the most references~(Figure~\ref{fig:q2b}).  We see
+that one can formulate~$\mathcal{Q}_3$ with just one SPARQL
+query. Because everything is handled by the database, access should be
+about as quick as we can hope it to be.

 While the sparse data set available to use only returned a handful of
 results, we see that queries like~$\mathcal{Q}_3$ are easily serviced
@@ -341,15 +342,16 @@ approaches~\cite{citcart}.

 Experimenting with $\mathcal{Q}_1$ to $\mathcal{Q}_3$ provided us with
 some insight into ULO and existing ULO exports. $\mathcal{Q}_1$ shows
-that while there is no formal definition for ``elementary proof'', ULO
-allows us to query for heuristics and calculate arbitrary arithmetic scores for
-objects of organizational knowledge. Query~$\mathcal{Q}_2$ illustrates
-the difficulty in finding universal schemas. It remains an open question
-whether ULO should include algorithms as a first class citizen, as a
-concept based around existing ULO predicates or whether it is a better
-idea to design a dedicated ontology and potentially data store entirely.
-Finally, while we were able to formulate a SPARQL query that should
-take care of most of~$\mathcal{Q}_3$ we found that the existing data sets
-contain very little information about authorship. This underlines
-the observations made previously in Section~\ref{sec:expl} and should
-be on the mind of anyone writing exporters that output ULO~triplets.
+that while there is no universal definition for ``elementary proof'',
+ULO allows us to query for heuristics and calculate arbitrary
+arithmetic scores for objects of organizational
+knowledge. Query~$\mathcal{Q}_2$ illustrates the difficulty in finding
+universal schemas. It remains an open question whether ULO should
+include algorithms as a first class citizen, as a concept based around
+existing ULO predicates or whether it is a better idea to design a
+dedicated ontology and data store entirely.  Finally, while we were
+able to formulate a SPARQL query that should take care of most
+of~$\mathcal{Q}_3$, we found that the existing data sets contain very
+little information about authorship. This underlines the observations
+made previously in Section~\ref{sec:expl} and should be on the mind of
+anyone writing exporters that output ULO~triplets.
--- a/doc/report/conclusion.tex
+++ b/doc/report/conclusion.tex
@@ -5,50 +5,50 @@ Section~\ref{sec:implementation} we were able to take existing RDF
 exports and import them into a GraphDB database. This made it possible
 to experiment with the applications and examples of
 Section~\ref{sec:applications}. We showed that organizational
-knowledge formulated as ULO triplets can already give some insights,
-in particular it is possible to formulate queries for meta information
-such as authorship and contribution and resolve the interlinks between
-proofs and theorems. On the other hand, our examples also showed that existing
-ULO~exports only take advantage of a subset of ULO~predicates,
-something to watch as more ULO exports are being worked on.
+knowledge formulated as ULO triplets can already give some insights.
+In particular, it is possible to formulate queries about meta
+information such as authorship and contribution.  We also manged to
+resolve the interlinks between proofs and theorems.

-\subsection{Potential for Future Work}
+To sum up this report, we want to recap three problems we encountered
+when working on \emph{ulo-storage}. They have the potential to result
+in future projects that further improve the availability of
+organizational knowledge.

-Finally, we want to recap three different problems we encountered when
-working on \emph{ulo-storage}. The first problem was that of malformed
-RDF~exports. Various exports contained invalid URIs and wrong
-namespaces. As a work around we provided on the fly correction but of
-course this does not solve the problem in itself. Perhaps a long term
-solution for this problem is to fully automate the export from third
-party libraries (e.g.\ Coq, Isabelle) to ULO triplets in a database,
-eliminating the step of XML files on disk. During import into the
-database, the imported data is thoroughly checked and mistakes are
-reported right away.  Bugs in exporters that produce faulty XML would
-be found earlier in development.
+The first problem we encountered while working on \emph{ulo-storage}
+was that of malformed RDF~exports. Various exports contained invalid
+URIs and incorrect namespaces. As a work around, we provided on the
+fly correction. But this does not solve the problem in itself. Perhaps
+a long term solution for this problem is to fully automate the export
+from third party libraries (e.g.\ Coq, Isabelle) to ULO triplets in a
+database, eliminating the step of XML files on disk. During import
+into the database, the imported data is thoroughly checked and
+mistakes are reported right away.  Bugs in exporters that produce
+faulty XML would be found earlier in development.

-The second problem is that of versioning triplets in the GraphDB
-triple store.  While adding new triplets to an existing GraphDB store
-is not a problem, updating existing triplets is
+The second problem we want to recap is that of versioning triplets in
+the GraphDB triple store.  While adding new triplets to an existing
+GraphDB store is not a problem, updating existing triplets is
 difficult. \emph{ulo-storage} circumvents this problem by re-creating
 the GraphDB data set in regular intervals.  Indeed it might be
 difficult to find an efficient alternative. Tagging each triplet with
 some version number doubles the number of triplets that need to be
 stored. Re-creating the index and maybe splitting up the knowledge
 base into smaller easier to update sub-repositories looks like the
-most promising approach for now.
+most promising approach for now. What is missing is an efficient way
+of servicing queries across multiple database instances.

-The third problem is that of missing predicates in existing ULO
-exports. The upper level ontology boasts a total of almost
-80~predicates, yet only a third of them are actually used by Coq and
-Isabelle exports. A developer writing queries that take advantage of
-the full ULO~vocabulary might be surprised that not data is coming
-back. This shows the difficulty of designing an ontology that is both
-concise and expressive. But while it is all good and well to recommend
-writers of exports to use the full set of predicates, it might simply
-not make sense to use the full set for a given third party library. We
-think that it is too early to argue for the removal of particular
-predicates, rather it is better to look at future export projects and
-then evaluate which predicates are in use and which are not.
+Finally, perhaps the most interesting but also most challenging
+question is the discussion about how to develop ULO, the universal
+ontology for organizational knowledge.  On one hand, ULO boasts a
+total of almost 80~predicates, yet only a third of them are actually
+used by existing Coq and Isabelle exports. A developer writing queries
+that take advantage of the full ULO~vocabulary might be surprised that
+not data is coming back.  On the other hand, we found it difficult to
+model algorithms in terms of ULO predicates and argued that it might
+be necessary to further extend the upper level ontology.  It will be
+necessary to look at the properties of future exports to make a more
+well-funded decision.

 Despite many open questions, \emph{ulo-storage} provides the necessary
 infrastructure for importing ULO triplets into an efficient storage

--- a/doc/report/implementation.tex
+++ b/doc/report/implementation.tex
@@ -61,10 +61,10 @@ is no pressing need to force them into separate processes.
 Our implementation supports two sources for RDF files, namely Git
 repositories and the local file system. The file system Collector
 crawls a given directory on the local machine and looks for
-RDF~XMl~files~\cite{rdfxml} while the Git Collector first clones a Git
+RDF~XML~files~\cite{rdfxml} while the Git Collector first clones a Git
 repository and then passes the checked out working copy to the file
-system Collector. Because we found that is not uncommon for RDF files to be
-compressed, our Collector supports on the fly extraction of
+system Collector. Because we found that is not uncommon for RDF files
+to be compressed, our implementation supports on the fly extraction of
 gzip~\cite{gzip} and xz~\cite{xz} formats which can greatly reduce the
 required disk space in the collection step.

@@ -75,7 +75,7 @@ Coq exports contained URIs which does not fit the official syntax
 specification~\cite{rfc3986} as they contained illegal
 characters. Previous work~\cite{ulo} that processed Coq and Isabelle
 exports used database software such as Virtuoso Open
-Source~\cite{wikivirtuoso} which do not properly check URIs according
+Source~\cite{wikivirtuoso} which does not properly check URIs according
 to spec; in consequence these faults were only discovered now.  To
 tackle these problems, we introduced on the fly correction steps
 during collection that escape the URIs in question and then continue
@@ -164,18 +164,19 @@ Another approach is to regularly re-create the full data
 set~$\mathcal{D}$ from scratch, say every seven days. This circumvents
 the problems related to updating existing data sets, but also means
 that changes in a given library~$\mathcal{L}_i$ take some to propagate
-to~$\mathcal{D}$.  Building on this idea, an advanced version of this
-approach could forgo the requirement for one single database
-storage~$\mathcal{D}$ entirely. Instead of maintaining just one global
-database state~$\mathcal{D}$, we suggest experimenting with dedicated
-database instances~$\mathcal{D}_i$ for each given
+to~$\mathcal{D}$.  Continuing this train of thought, an advanced
+version of this approach could forgo the requirement for one single
+database storage~$\mathcal{D}$ entirely. Instead of maintaining just
+one global database state~$\mathcal{D}$, we suggest experimenting with
+dedicated database instances~$\mathcal{D}_i$ for each given
 library~$\mathcal{L}_i$.  The advantage here is that re-creating a
 given database representation~$\mathcal{D}_i$ is fast as
 exports~$\mathcal{E}_i$ are comparably small. The disadvantage is that
 we still want to query the whole data set~$\mathcal{D} = \mathcal{D}_1
-\cup \mathcal{D}_2 \cup \cdots \cup \mathcal{D}_n$. This does require the
-development of some cross-database query mechanism, functionality GraphDB
-currently only offers limited support for~\cite{graphdbnested}.
+\cup \mathcal{D}_2 \cup \cdots \cup \mathcal{D}_n$. This does require
+the development of some cross-database query mechanism, functionality
+existing systems currently only offer limited support
+for~\cite{graphdbnested}.

 In summary, we see that versioning is a potential challenge for a
 greater tetrapodal search system. While not a pressing issue for
@@ -188,7 +189,7 @@ Endpoint. Recall that an Endpoint provides the programming interface
 for applications that wish to query our collection of organizational
 knowledge. In practice, the choice of Endpoint programming interface
 is determined by the choice of database system as the Endpoint is
-provided directly by the database.
+provided directly by the database system.

 In our project, organizational knowledge is formulated as
 RDF~triplets.  The canonical choice for us is to use a triple store,
@@ -206,7 +207,7 @@ OWL~Reasoning~\cite{owlspec, graphdbreason}.  In particular, this
 means that GraphDB offers support for transitive queries as described
 in previous work on~ULO~\cite{ulo}. A transitive query is one that,
 given a relation~$R$, asks for the transitive closure~$S$
-of~$R$~\cite{tc} (Figure~\ref{fig:tc}).
+(Figure~\ref{fig:tc}) of~$R$.

 \input{implementation-transitive-closure.tex}

@@ -235,14 +236,14 @@ includes not just syntax and semantics of the language itself, but
 also a standardized REST interface~\cite{rest} for querying database
 servers.

-SPARQL was inspired by SQL and as such the \texttt{SELECT}
+The SPARQL syntax was inspired by SQL and as such the \texttt{SELECT}
 \texttt{WHERE} syntax should be familiar to many software developers.
 A simple query that returns all triplets in the store looks like
 \begin{lstlisting}
    SELECT * WHERE { ?s ?p ?o }
 \end{lstlisting}
 where \texttt{?s}, \texttt{?p} and \texttt{?o} are query
-variables. The result of any query are valid substitutions for the
+variables. The result of any query are valid substitutions for all
 query variables. In this particular case, the database would return a
 table of all triplets in the store sorted by subject~\texttt{?o},
 predicate~\texttt{?p} and object~\texttt{?o}.
@@ -300,7 +301,7 @@ containers, that is lightweight virtual machines with a fixed
 environment for running a given application~\cite[pp. 22]{dockerbook}.
 Docker Compose then is a way of combining individual Docker containers
 to run a full tech stack of application, database server and so
-on~\cite[pp. 42]{dockerbook}. All configuration of the overarching a
+on~\cite[pp. 42]{dockerbook}. All configuration of the overarching
 setup is stored in a Docker Compose file that describes the software
 stack.

@@ -313,11 +314,12 @@ code for Collector and Importer is available in the
 deployment files, that is Docker Compose configuration and additional
 Dockerfiles are stored in a separate repository~\cite{dockerfilerepo}.

-This concludes our discussion of the implementation developed for the
-\emph{ulo-storage} project. We designed a system based around (1)~a
-Collector which collects RDF triplets from third party sources, (2)~an
-Importer which imports these triplets into a GraphDB database and
-(3)~looked at different ways of querying a GraphDB Endpoint. All of
-this is easy to deploy using a single Docker Compose file. With this
-stack ready for use, we will continue with a look at some interesting
-applications and queries built on top of this infrastructure.
+With this, we conclude our discussion of the implementation developed
+for the \emph{ulo-storage} project. We designed a system based around
+(1)~a Collector which collects RDF triplets from third party sources,
+(2)~an Importer which imports these triplets into a GraphDB database
+and (3)~looked at different ways of querying a GraphDB Endpoint. All
+of this is easy to deploy using a single Docker Compose file. With
+this stack ready for use, we will now continue with a look at some
+interesting applications and queries built on top of this
+infrastructure.
--- a/doc/report/introduction.tex
+++ b/doc/report/introduction.tex
@@ -13,14 +13,14 @@ One topic of research in this field is the idea of a \emph{tetrapodal
  search} that combines four distinct areas of mathematical knowledge.
 These four kinds being (1)~the actual formulae as \emph{symbolic
  knowledge}, (2)~examples and concrete objects as \emph{concrete
-  knowledge}, (3)~names and comments as \emph{narrative knowledge} and
+  knowledge}, (3)~prose, names and comments as \emph{narrative knowledge} and
 finally (4)~identifiers, references and their relationships, referred
 to as \emph{organizational knowledge}~\cite{tetra}.

 Tetrapodal search aims to provide a unified search engine that indexes
 each of these four different subsets of mathematical knowledge.
 Because all four kinds of knowledge are inherently unique in their
-structure, tetrapodal search proposes that each kind of mathematical
+structure, tetrapodal search proposes that each subset of mathematical
 knowledge should be made available in a storage backend that fits the
 kind of data it is providing. With all four areas available for
 querying, tetrapodal search intends to then combine the four indexes
@@ -31,8 +31,8 @@ mathematical knowledge. In this context, the focus of

 A previously proposed way to structure organizational knowledge is the
 \emph{upper level ontology} (ULO)~\cite{ulo}. ULO takes the form of an
-OWL~ontology~\cite{uloonto} and as such all organization information
-is stored as RDF~triplets with a unified schema of
+OWL~ontology~\cite{uloonto} and as such all organization knowledge is
+stored as RDF~triplets with a unified schema of
 ULO~predicates~\cite{owl}.  Some effort has been made to export
 existing collections of formal mathematical knowledge to {ULO}. In
 particular, exports from Isabelle and Coq-based libraries are
@@ -51,8 +51,8 @@ analysis. We collected RDF files spread over different Git
 repositories, imported them into a database and then experimented with
 APIs for accessing that data set.

-The main contribution of \emph{ulo-storage} is twofold. First, (1)~we
-built up various infrastructure components that make organizational
+The contribution of \emph{ulo-storage} is twofold. First, (1)~we built
+up various infrastructure components that make organizational
 knowledge easy to query.  These components can make up building blocks
 of a larger tetrapodal search system. Their design and implementation
 are discussed in Section~\ref{sec:implementation}.  Second, (2)~we ran