report: final paper-based review

a6fb8743 · Andreas Schärtl · 73da559d · a6fb8743 · a6fb8743 · a6fb8743
Commit a6fb8743 authored Oct 9, 2020 by Andreas Schärtl
--- a/doc/report/abstract.tex
+++ b/doc/report/abstract.tex
@@ -6,8 +6,8 @@ necessary for this information to be available for querying.
 \emph{ulo-storage} aims to lay out the groundwork for this task.
 We collected various pieces of exported organizational knowledge into
-a centralized and efficient store, made that store available on the
+a centralized and efficient store, made the resulting knowledge graph
-network and then evaluated different ways of querying that store. In
+available on the network and then evaluated different ways of querying
-addition, implementations of some exemplary queries on top of our
+that store. In addition, implementations of some exemplary queries on
-created infrastructure resulted in insights on how unified schemas
+top of our created infrastructure resulted in insights on how unified
-for organizational mathematical knowledge could be extended.
+schemas for organizational mathematical knowledge could be extended.
--- a/doc/report/applications-q3.tex
+++ b/doc/report/applications-q3.tex
@@ -36,7 +36,7 @@
        \caption{An adapted SPARQL~query based on~\ref{fig:q2a}. It
          lists all works authored by ``John Smith'' rated by number
-          of references.  The idea is works that were referenced more
+          of references.  The idea is that works that were referenced more
          often are more important.}\label{fig:q2b}
    \end{subfigure}
    \vspace{8mm}

--- a/doc/report/applications-screenshots.tex
+++ b/doc/report/applications-screenshots.tex
@@ -5,14 +5,14 @@
    \begin{subfigure}[]{0.475\textwidth}
        \centering
        \shadowbox{\includegraphics[width=\textwidth]{figs/explore1.png}}
-        \caption{Listing of all contributors in the data set. Clicking
+        \caption{Listing all contributors in the data set. Clicking
          on a given author shows their individual contributions.}
    \end{subfigure}
    \hfill
    \begin{subfigure}[]{0.475\textwidth}
        \centering
        \shadowbox{\includegraphics[width=\textwidth]{figs/explore2.png}}
-        \caption{Listing works a given contributor worked on.\\}
+        \caption{Listing works a given contributor has worked on.\\}
    \end{subfigure}
    \vskip\baselineskip{}
    \begin{subfigure}[]{0.475\textwidth}

--- a/doc/report/applications.tex
+++ b/doc/report/applications.tex
@@ -17,13 +17,12 @@ Endpoint, our second contribution.
    \item We investigated queries that could be used to extend the
      system into a larger tetrapodal search system. While some
-      organizational queries have obvious canonical solutions others
+      organizational queries have obvious canonical solutions, others
      introduce questions on how organizational knowledge should be
      structured (Section~\ref{sec:tetraq}).
 \end{itemize}
-\noindent These applications will now be discussed in the following
+\noindent We will now discuss each application in more detail.
-sections.
 \subsection{Exploring Existing Data Sets}\label{sec:expl}
@@ -47,7 +46,7 @@ predicates.  For example, the Isabelle export contains organizational
 meta information such as information about paragraphs and sections in
 the source document while the Coq export only tells us the filename of
 the original Coq source.  Both exports have their particularities and
-with more and more third party libraries exported to ULO one has to
+with more and more third party libraries exported to ULO, one has to
 assume that this heterogeneity will only grow. In particular we want
 to point to the large number of predicates which remain unused in both
 Isabelle and Coq exports. A user formulating queries for ULO might be
@@ -77,41 +76,52 @@ increases.
 \subsection{Interactive Exploration}\label{sec:exp}
-Our second application wants to illustrate how to browse an ULO data
+The second application we want to discuss illustrates how to browse an
-set interactively. For this purpose, we developed a web front end that
+ULO data set interactively. Here, we developed a web front end that
-allows users to browse contributions by author. For this application,
+allows users to browse contributions to the knowledge base sorted by
-we used the RDF4J Endpoint, the application itself is implemented in
+author. In this particular case, we used the RDF4J Endpoint, the
-Java. Figure~\ref{fig:appss} shows four screenshots of the current
+application itself is implemented in Java. Figure~\ref{fig:appss}
-version.  Once the user has selected a given contribution, we list
+shows four screenshots of the current version. Once the user has
-some basic information about the selected object, such as type
+selected a given contribution, we list some basic information about
-(e.g.\ lemma or corollary) and name as well as references to other
+the selected object, such as type (e.g.\ lemma or corollary) and name
-objects in the knowledge base.
+as well as references to other objects in the knowledge base.
 \input{applications-screenshots.tex}
-This application is interesting because similar approaches could be
+This experiment is interesting because similar approaches could be
 applied when designing IDE~integration for working with organizational
 knowledge. Given some object in a document identified by an URI, we
 can look at how that object is connected to other objects in the
 knowledge graph.
-We can also connect the dots back to the original source. In
+Our front end can also connect the dots back from ULO~object to
-particular, we took a look at the \texttt{ulo:sourceref} property
+original source code. In particular, we took a look at the
-defined for many objects in the knowledge base.
+\texttt{ulo:sourceref} predicate defined for many objects in the
-\texttt{ulo:sourceref} is ``the URI of the physical location (e.g.,
+knowledge base.  \texttt{ulo:sourceref} is supposed to contain ``the
-file/URI, line, column) of the source code that introduced the
+URI of the physical location (e.g., file/URI, line, column) of the
-subject''~\cite{uloonto} and in use in the Isabelle
+source code that introduced the subject''~\cite{uloonto}; the Isabelle
-export~\cite{uloisabelle}. Making the connection from ULO~object to
+export.  But while making the connection from exported ULO~object to
-original Isabelle document is convenient for the user, but required
+original Isabelle document is convenient for the user, this feature
-some extra work from us as application implementors. The reason for
+required some extra work from us as application implementors. The
-this is that the format of the \texttt{ulo:sourceref} property is not
+reason for this is that the format of the \texttt{ulo:sourceref}
-well-defined, rather it is up to implementors of library exporters how
+property appears to be not well-defined, rather it is up to
-to format source references. The consequence is that for each export,
+implementors of library exporters how to format these source
-developers of front ends will need to write custom code for finding
+references. For example, in the Isabelle export, we have to translate
-the original source. Maybe it would be a reasonable requirement for
+source references such as
+\begin{verbatim}
+    https://isabelle.in.tum.de/source/HOL/HOL-Algebra/
+      HOL-Algebra.Group.theory#17829.576.8:17836.576.15
+\end{verbatim}
+to the original source
+\begin{verbatim}
+    https://isabelle.in.tum.de/dist/library/HOL/HOL-Algebra/Group.html
+\end{verbatim}
+The consequence is that for each export, developers of
+front ends will need to write custom code for finding the original
+source.  Maybe it would be a reasonable requirement for
 \texttt{ulo:sourceref} to have a well-defined format, ideally an
-actually reachable URI on the open web, if that is applicable for a given
+actually reachable URI on the open web, if that is applicable for a
-library.
+given library.
 While translating from \texttt{ulo:sourceref} to original URI
 introduced some extra work, implementing this application was easy and
@@ -128,11 +138,11 @@ proof of concept implementations and evaluate their applicability.
 \subsubsection{Elementary Proofs and Computed Scores}
-We start with query~$\mathcal{Q}_1$ which illustrates how one can
+We start with query~$\mathcal{Q}_1$ which illustrates how we can
 compute arithmetic scores for objects in our knowledge graph.
 Query~$\mathcal{Q}_1$ asks us to ``\emph{[f]ind theorems with
-  non-elementary proofs}.''  Elementary proofs can be understood as
+  non-elementary proofs}''~\cite{tetra}.  Elementary proofs can be
-those proof that are considered easy and
+understood as those proof that are considered easy and
 obvious~\cite{elempro,elemtest}. In consequence,~$\mathcal{Q}_1$ has
 to search for all proofs which are not trivial.  Of course, just like
 any distinction between ``theorem'' and ``corollary'' is going to be
@@ -149,13 +159,10 @@ prospective school teachers had problems with notation and term
 rewriting and were missing required prerequisites~\cite{proofund,
  proofteach}, none of which seems applicable for grading individual
 formal proofs for difficulty. On the other hand, there is very
-specific research on rating proofs for some given subsets of problems,
+specific research on rating proofs for concrete problems such as the
-e.g.\ on the satisfiability of a given CNF formula. A particular
+satisfiability of a given CNF formula~\cite{proofsat}.
-example is focused on heuristics for how long a SAT solver will take
-to find a solution for a given~{CNF}. Here, solutions that take long
-are considered harder~\cite{proofsat}.
-\noindent\emph{Organizational Aspect.} A first working hypothesis
+\emph{Organizational Aspect.} A first working hypothesis
 might be to assume that elementary proofs are short. In that case, the
 size, that is the number of bytes to store the proof, is our first
 indicator of proof complexity. This is by no means perfect, as even
@@ -170,22 +177,22 @@ working hypothesis.
 us to sort by file size. Maybe proof complexity also leads to quick
 check times in proof assistants and automatic theorem provers. With
 this assumption in mind we could use the \texttt{ulo:check-time}
-predicate. Correlating proof complexity with file size allows us to
+predicate. Correlating proof complexity with file size or check time
-define one indicator of proof complexity based on organizational
+allows us to define one indicator of proof complexity based on
-knowledge alone.
+organizational knowledge alone.
-\noindent\emph{Other Aspects.} A tetrapodal search system should
+\emph{Other Aspects.} A tetrapodal search system should
 probably also take symbolic knowledge into account. Based on some kind
 of measure of formula complexity, different proofs could be
 rated. Similarly, with narrative knowledge available to us, we could
 count the number of words, citations and so on to rate the narrative
 complexity of a proof. Combining symbolic knowledge, narrative
 knowledge and organizational knowledge should allow us to find proofs
-which are probably easier than others.
+which are easier than others.
 \input{applications-q1.tex}
-\noindent\emph{Implementation.} Implementing a naive version of the
+\emph{Implementation.} Implementing a naive version of the
 organizational aspect can be as simple as querying for all theorems
 justified by proofs, ordered by size (or check time).
 Figure~\ref{fig:q1short} illustrates how this can be achieved with a
@@ -231,13 +238,13 @@ something that will surely be useful for many applications.
 \subsubsection{Categorizing Algorithms and Algorithmic Problems}
 The second query~$\mathcal{Q}_2$ we decided to focus on wants to
-``\emph{[f]ind algorithms that solve $NP$-complete graph problems.}''
+``\emph{[f]ind algorithms that solve $NP$-complete graph
-Here we want the tetrapodal search system to return a listing of
+problems}''~\cite{tetra}.  Here we want the tetrapodal search system
-algorithms that solve (graph) problems with a given property (runtime
+to return a listing of algorithms that solve (graph) problems with a
-complexity). We need to consider where each of these three components
+given property (runtime complexity). We need to consider where and how
-might be stored.
+each of these components might be found.
-\noindent\emph{Symbolic and Concrete Aspects.} First, let us consider
+\emph{Symbolic and Concrete Aspects.} First, let us consider
 algorithms. Algorithms can be formulated as computer code which can be
 understood as symbolic knowledge (code represented as a syntax tree)
 or as concrete knowledge (code as text
@@ -248,25 +255,20 @@ automatically in the general case~\cite{np}. Metadata of this kind
 needs to be stored in a separate index for organizational knowledge,
 it being the only fit.
-\noindent\emph{Organizational Aspect.} If we wish to look up properties
+\emph{Organizational Aspect.} If we wish to look up properties about
-about algorithms from organizational knowledge, we first have to
+algorithms from organizational knowledge, we first have to think about
-think about how to represent this information. We propose two
+how to represent this information.  As a first approach, we can try to
-approaches, one using the existing ULO ontology and one that
+represent algorithms and problems in terms of existing ULO
-recommends extending ULO to nativly support the concept of
+predicates. As ULO does have a concept of \texttt{ulo:theorem} and
-algorithms. Both approaches have their distinct trade-offs.
+\texttt{ulo:proof}, it might be tempting to exploit these predicates
+and represent algorithms understood as programs in terms of proofs.
-As a first approach, we can try to represent algorithms and
+But that does not really capture the canonical understanding of
-problems in terms of existing ULO predicates. As ULO does have a
+algorithms. Algorithms are not actually programs, rather there are
-concept of \texttt{theorem} and \texttt{proof}, it might be
+programs that implement algorithms. Even if we do work around this
-tempting to exploit the Curry-Howard correspondence and represent
+problem, it is not clear how to represent problems (e.g.\ the
-algorithms understood as programs in terms of proofs.  But that
+traveling salesman problem or the sorting problem) in terms of
-does not really capture the canonical understanding of algorithms;
+theorems (propositions, types) that get implemented by a proof
-algorithms are not actually programs, rather there are programs
+(algorithm, program).
-that implement algorithms. Even if we do work around this problem,
-it is not clear how to represent problems (e.g.\ the traveling
-salesman problem or the sorting problem) in terms of theorems
-(propositions, types) that get implemented by a proof (algorithm,
-program).
 As algorithms make up an important part of certain areas of research,
 it might be reasonable to introduce native level support for
@@ -291,16 +293,16 @@ solve algorithmic problems.
 \subsubsection{Contributors and Number of References}
-Finally, query~$\mathcal{Q}_3$ from literature wants to
+Finally, query~$\mathcal{Q}_3$ from literature wants to know
-know ``\emph{[a]ll areas of math that {Nicolas G.\ de Bruijn} has
+``\emph{[a]ll areas of math that {Nicolas G.\ de Bruijn} has worked in
-  worked in and his main contributions}''~\cite{tetra}.  $\mathcal{Q}_3$~is asking
+and his main contributions}''~\cite{tetra}.  $\mathcal{Q}_3$~is
-for works of a given author~$A$.  It also asks for their main
+asking for works of a given author~$A$.  It also asks for their main
 contributions, for example which particularly interesting paragraphs
-or code~$A$ has authored.  We picked this particular query as it
+or code~$A$ has authored.  We picked this particular query as it is
-is asking for metadata, something that should be easily serviced by
+asking for metadata, something that should be easily serviced by
 organizational knowledge.
-\noindent\emph{Organizational Aspect.} ULO has no concept of authors,
+\emph{Organizational Aspect.} ULO has no concept of authors,
 contributors, dates and so on. Rather, the idea is to take advantage
 of the Dublin Core project which provides an ontology for such
 metadata~\cite{dcreport, dcowl}. For example, Dublin Core provides us
@@ -316,19 +318,18 @@ objects ordered by authors.
 \input{applications-q3.tex}
-\noindent\emph{Implementation.} A search for contributions by a given
+\emph{Implementation.} A search for contributions by a given author
-author can easily be formulated in {SPARQL}~(Figure~\ref{fig:q2a}).
+can easily be formulated in {SPARQL}~(Figure~\ref{fig:q2a}).  Query
-Query $\mathcal{Q}_3$ is also asking for the main contributions
+$\mathcal{Q}_3$ is also asking for the main contributions of~$A$, that
-of~$A$, that is those works that~$A$ authored that are the most
+is those works that~$A$ authored that are the most important. Sorting
-important. Sorting the result by number of references might be a good
+the result by number of references can be a good start.  To get the
-start.  To get the main contributions, we rate each individual work by
+main contributions, we rate each individual work by its number of
-its number of \texttt{ulo:uses} references. Extending the previous
+\texttt{ulo:uses} references. Extending the previous {SPARQL} query,
-{SPARQL} query, we can ask the database for an ordered list of works,
+we can ask the database for an ordered list of works, starting with
-starting with the one that has the most
+the one that has the most references~(Figure~\ref{fig:q2b}).  We see
-references~(Figure~\ref{fig:q2b}).  We see that one can
+that one can formulate~$\mathcal{Q}_3$ with just one SPARQL
-formulate~$\mathcal{Q}_3$ with just one SPARQL query. Because
+query. Because everything is handled by the database, access should be
-everything is handled by the database, access should be about as quick
+about as quick as we can hope it to be.
-as we can hope it to be.
 While the sparse data set available to use only returned a handful of
 results, we see that queries like~$\mathcal{Q}_3$ are easily serviced
@@ -341,15 +342,16 @@ approaches~\cite{citcart}.
 Experimenting with $\mathcal{Q}_1$ to $\mathcal{Q}_3$ provided us with
 some insight into ULO and existing ULO exports. $\mathcal{Q}_1$ shows
-that while there is no formal definition for ``elementary proof'', ULO
+that while there is no universal definition for ``elementary proof'',
-allows us to query for heuristics and calculate arbitrary arithmetic scores for
+ULO allows us to query for heuristics and calculate arbitrary
-objects of organizational knowledge. Query~$\mathcal{Q}_2$ illustrates
+arithmetic scores for objects of organizational
-the difficulty in finding universal schemas. It remains an open question
+knowledge. Query~$\mathcal{Q}_2$ illustrates the difficulty in finding
-whether ULO should include algorithms as a first class citizen, as a
+universal schemas. It remains an open question whether ULO should
-concept based around existing ULO predicates or whether it is a better
+include algorithms as a first class citizen, as a concept based around
-idea to design a dedicated ontology and potentially data store entirely.
+existing ULO predicates or whether it is a better idea to design a
-Finally, while we were able to formulate a SPARQL query that should
+dedicated ontology and data store entirely.  Finally, while we were
-take care of most of~$\mathcal{Q}_3$ we found that the existing data sets
+able to formulate a SPARQL query that should take care of most
-contain very little information about authorship. This underlines
+of~$\mathcal{Q}_3$, we found that the existing data sets contain very
-the observations made previously in Section~\ref{sec:expl} and should
+little information about authorship. This underlines the observations
-be on the mind of anyone writing exporters that output ULO~triplets.
+made previously in Section~\ref{sec:expl} and should be on the mind of
+anyone writing exporters that output ULO~triplets.
--- a/doc/report/conclusion.tex
+++ b/doc/report/conclusion.tex
@@ -5,50 +5,50 @@ Section~\ref{sec:implementation} we were able to take existing RDF
 exports and import them into a GraphDB database. This made it possible
 to experiment with the applications and examples of
 Section~\ref{sec:applications}. We showed that organizational
-knowledge formulated as ULO triplets can already give some insights,
+knowledge formulated as ULO triplets can already give some insights.
-in particular it is possible to formulate queries for meta information
+In particular, it is possible to formulate queries about meta
-such as authorship and contribution and resolve the interlinks between
+information such as authorship and contribution.  We also manged to
-proofs and theorems. On the other hand, our examples also showed that existing
+resolve the interlinks between proofs and theorems.
-ULO~exports only take advantage of a subset of ULO~predicates,
-something to watch as more ULO exports are being worked on.
-\subsection{Potential for Future Work}
+To sum up this report, we want to recap three problems we encountered
+when working on \emph{ulo-storage}. They have the potential to result
+in future projects that further improve the availability of
+organizational knowledge.
-Finally, we want to recap three different problems we encountered when
+The first problem we encountered while working on \emph{ulo-storage}
-working on \emph{ulo-storage}. The first problem was that of malformed
+was that of malformed RDF~exports. Various exports contained invalid
-RDF~exports. Various exports contained invalid URIs and wrong
+URIs and incorrect namespaces. As a work around, we provided on the
-namespaces. As a work around we provided on the fly correction but of
+fly correction. But this does not solve the problem in itself. Perhaps
-course this does not solve the problem in itself. Perhaps a long term
+a long term solution for this problem is to fully automate the export
-solution for this problem is to fully automate the export from third
+from third party libraries (e.g.\ Coq, Isabelle) to ULO triplets in a
-party libraries (e.g.\ Coq, Isabelle) to ULO triplets in a database,
+database, eliminating the step of XML files on disk. During import
-eliminating the step of XML files on disk. During import into the
+into the database, the imported data is thoroughly checked and
-database, the imported data is thoroughly checked and mistakes are
+mistakes are reported right away.  Bugs in exporters that produce
-reported right away.  Bugs in exporters that produce faulty XML would
+faulty XML would be found earlier in development.
-be found earlier in development.
-The second problem is that of versioning triplets in the GraphDB
+The second problem we want to recap is that of versioning triplets in
-triple store.  While adding new triplets to an existing GraphDB store
+the GraphDB triple store.  While adding new triplets to an existing
-is not a problem, updating existing triplets is
+GraphDB store is not a problem, updating existing triplets is
 difficult. \emph{ulo-storage} circumvents this problem by re-creating
 the GraphDB data set in regular intervals.  Indeed it might be
 difficult to find an efficient alternative. Tagging each triplet with
 some version number doubles the number of triplets that need to be
 stored. Re-creating the index and maybe splitting up the knowledge
 base into smaller easier to update sub-repositories looks like the
-most promising approach for now.
+most promising approach for now. What is missing is an efficient way
+of servicing queries across multiple database instances.
-The third problem is that of missing predicates in existing ULO
+Finally, perhaps the most interesting but also most challenging
-exports. The upper level ontology boasts a total of almost
+question is the discussion about how to develop ULO, the universal
-80~predicates, yet only a third of them are actually used by Coq and
+ontology for organizational knowledge.  On one hand, ULO boasts a
-Isabelle exports. A developer writing queries that take advantage of
+total of almost 80~predicates, yet only a third of them are actually
-the full ULO~vocabulary might be surprised that not data is coming
+used by existing Coq and Isabelle exports. A developer writing queries
-back. This shows the difficulty of designing an ontology that is both
+that take advantage of the full ULO~vocabulary might be surprised that
-concise and expressive. But while it is all good and well to recommend
+not data is coming back.  On the other hand, we found it difficult to
-writers of exports to use the full set of predicates, it might simply
+model algorithms in terms of ULO predicates and argued that it might
-not make sense to use the full set for a given third party library. We
+be necessary to further extend the upper level ontology.  It will be
-think that it is too early to argue for the removal of particular
+necessary to look at the properties of future exports to make a more
-predicates, rather it is better to look at future export projects and
+well-funded decision.
-then evaluate which predicates are in use and which are not.
 Despite many open questions, \emph{ulo-storage} provides the necessary
 infrastructure for importing ULO triplets into an efficient storage

--- a/doc/report/implementation.tex
+++ b/doc/report/implementation.tex
@@ -61,10 +61,10 @@ is no pressing need to force them into separate processes.
 Our implementation supports two sources for RDF files, namely Git
 repositories and the local file system. The file system Collector
 crawls a given directory on the local machine and looks for
-RDF~XMl~files~\cite{rdfxml} while the Git Collector first clones a Git
+RDF~XML~files~\cite{rdfxml} while the Git Collector first clones a Git
 repository and then passes the checked out working copy to the file
-system Collector. Because we found that is not uncommon for RDF files to be
+system Collector. Because we found that is not uncommon for RDF files
-compressed, our Collector supports on the fly extraction of
+to be compressed, our implementation supports on the fly extraction of
 gzip~\cite{gzip} and xz~\cite{xz} formats which can greatly reduce the
 required disk space in the collection step.
@@ -75,7 +75,7 @@ Coq exports contained URIs which does not fit the official syntax
 specification~\cite{rfc3986} as they contained illegal
 characters. Previous work~\cite{ulo} that processed Coq and Isabelle
 exports used database software such as Virtuoso Open
-Source~\cite{wikivirtuoso} which do not properly check URIs according
+Source~\cite{wikivirtuoso} which does not properly check URIs according
 to spec; in consequence these faults were only discovered now.  To
 tackle these problems, we introduced on the fly correction steps
 during collection that escape the URIs in question and then continue
@@ -164,18 +164,19 @@ Another approach is to regularly re-create the full data
 set~$\mathcal{D}$ from scratch, say every seven days. This circumvents
 the problems related to updating existing data sets, but also means
 that changes in a given library~$\mathcal{L}_i$ take some to propagate
-to~$\mathcal{D}$.  Building on this idea, an advanced version of this
+to~$\mathcal{D}$.  Continuing this train of thought, an advanced
-approach could forgo the requirement for one single database
+version of this approach could forgo the requirement for one single
-storage~$\mathcal{D}$ entirely. Instead of maintaining just one global
+database storage~$\mathcal{D}$ entirely. Instead of maintaining just
-database state~$\mathcal{D}$, we suggest experimenting with dedicated
+one global database state~$\mathcal{D}$, we suggest experimenting with
-database instances~$\mathcal{D}_i$ for each given
+dedicated database instances~$\mathcal{D}_i$ for each given
 library~$\mathcal{L}_i$.  The advantage here is that re-creating a
 given database representation~$\mathcal{D}_i$ is fast as
 exports~$\mathcal{E}_i$ are comparably small. The disadvantage is that
 we still want to query the whole data set~$\mathcal{D} = \mathcal{D}_1
-\cup \mathcal{D}_2 \cup \cdots \cup \mathcal{D}_n$. This does require the
+\cup \mathcal{D}_2 \cup \cdots \cup \mathcal{D}_n$. This does require
-development of some cross-database query mechanism, functionality GraphDB
+the development of some cross-database query mechanism, functionality
-currently only offers limited support for~\cite{graphdbnested}.
+existing systems currently only offer limited support
+for~\cite{graphdbnested}.
 In summary, we see that versioning is a potential challenge for a
 greater tetrapodal search system. While not a pressing issue for
@@ -188,7 +189,7 @@ Endpoint. Recall that an Endpoint provides the programming interface
 for applications that wish to query our collection of organizational
 knowledge. In practice, the choice of Endpoint programming interface
 is determined by the choice of database system as the Endpoint is
-provided directly by the database.
+provided directly by the database system.
 In our project, organizational knowledge is formulated as
 RDF~triplets.  The canonical choice for us is to use a triple store,
@@ -206,7 +207,7 @@ OWL~Reasoning~\cite{owlspec, graphdbreason}.  In particular, this
 means that GraphDB offers support for transitive queries as described
 in previous work on~ULO~\cite{ulo}. A transitive query is one that,
 given a relation~$R$, asks for the transitive closure~$S$
-of~$R$~\cite{tc} (Figure~\ref{fig:tc}).
+(Figure~\ref{fig:tc}) of~$R$.
 \input{implementation-transitive-closure.tex}
@@ -235,14 +236,14 @@ includes not just syntax and semantics of the language itself, but
 also a standardized REST interface~\cite{rest} for querying database
 servers.
-SPARQL was inspired by SQL and as such the \texttt{SELECT}
+The SPARQL syntax was inspired by SQL and as such the \texttt{SELECT}
 \texttt{WHERE} syntax should be familiar to many software developers.
 A simple query that returns all triplets in the store looks like
 \begin{lstlisting}
    SELECT * WHERE { ?s ?p ?o }
 \end{lstlisting}
 where \texttt{?s}, \texttt{?p} and \texttt{?o} are query
-variables. The result of any query are valid substitutions for the
+variables. The result of any query are valid substitutions for all
 query variables. In this particular case, the database would return a
 table of all triplets in the store sorted by subject~\texttt{?o},
 predicate~\texttt{?p} and object~\texttt{?o}.
@@ -300,7 +301,7 @@ containers, that is lightweight virtual machines with a fixed
 environment for running a given application~\cite[pp. 22]{dockerbook}.
 Docker Compose then is a way of combining individual Docker containers
 to run a full tech stack of application, database server and so
-on~\cite[pp. 42]{dockerbook}. All configuration of the overarching a
+on~\cite[pp. 42]{dockerbook}. All configuration of the overarching
 setup is stored in a Docker Compose file that describes the software
 stack.
@@ -313,11 +314,12 @@ code for Collector and Importer is available in the
 deployment files, that is Docker Compose configuration and additional
 Dockerfiles are stored in a separate repository~\cite{dockerfilerepo}.
-This concludes our discussion of the implementation developed for the
+With this, we conclude our discussion of the implementation developed
-\emph{ulo-storage} project. We designed a system based around (1)~a
+for the \emph{ulo-storage} project. We designed a system based around
-Collector which collects RDF triplets from third party sources, (2)~an
+(1)~a Collector which collects RDF triplets from third party sources,
-Importer which imports these triplets into a GraphDB database and
+(2)~an Importer which imports these triplets into a GraphDB database
-(3)~looked at different ways of querying a GraphDB Endpoint. All of
+and (3)~looked at different ways of querying a GraphDB Endpoint. All
-this is easy to deploy using a single Docker Compose file. With this
+of this is easy to deploy using a single Docker Compose file. With
-stack ready for use, we will continue with a look at some interesting
+this stack ready for use, we will now continue with a look at some
-applications and queries built on top of this infrastructure.
+interesting applications and queries built on top of this
+infrastructure.
--- a/doc/report/introduction.tex
+++ b/doc/report/introduction.tex
@@ -13,14 +13,14 @@ One topic of research in this field is the idea of a \emph{tetrapodal
  search} that combines four distinct areas of mathematical knowledge.
 These four kinds being (1)~the actual formulae as \emph{symbolic
  knowledge}, (2)~examples and concrete objects as \emph{concrete
-  knowledge}, (3)~names and comments as \emph{narrative knowledge} and
+  knowledge}, (3)~prose, names and comments as \emph{narrative knowledge} and
 finally (4)~identifiers, references and their relationships, referred
 to as \emph{organizational knowledge}~\cite{tetra}.
 Tetrapodal search aims to provide a unified search engine that indexes
 each of these four different subsets of mathematical knowledge.
 Because all four kinds of knowledge are inherently unique in their
-structure, tetrapodal search proposes that each kind of mathematical
+structure, tetrapodal search proposes that each subset of mathematical
 knowledge should be made available in a storage backend that fits the
 kind of data it is providing. With all four areas available for
 querying, tetrapodal search intends to then combine the four indexes
@@ -31,8 +31,8 @@ mathematical knowledge. In this context, the focus of
 A previously proposed way to structure organizational knowledge is the
 \emph{upper level ontology} (ULO)~\cite{ulo}. ULO takes the form of an
-OWL~ontology~\cite{uloonto} and as such all organization information
+OWL~ontology~\cite{uloonto} and as such all organization knowledge is
-is stored as RDF~triplets with a unified schema of
+stored as RDF~triplets with a unified schema of
 ULO~predicates~\cite{owl}.  Some effort has been made to export
 existing collections of formal mathematical knowledge to {ULO}. In
 particular, exports from Isabelle and Coq-based libraries are
@@ -51,8 +51,8 @@ analysis. We collected RDF files spread over different Git
 repositories, imported them into a database and then experimented with
 APIs for accessing that data set.
-The main contribution of \emph{ulo-storage} is twofold. First, (1)~we
+The contribution of \emph{ulo-storage} is twofold. First, (1)~we built
-built up various infrastructure components that make organizational
+up various infrastructure components that make organizational
 knowledge easy to query.  These components can make up building blocks
 of a larger tetrapodal search system. Their design and implementation
 are discussed in Section~\ref{sec:implementation}.  Second, (2)~we ran