intro.tex

\section{Introduction}\label{sec:introduction}

To tackle the vast array of mathematical
publications, various ways of \emph{computerizing} mathematical
knowledge have been experimented with. As it is already difficult for
human mathematicians to keep even a subset of all mathematical
knowledge in their mind, a hope is that computerization will yield
great improvement to mathematical (and really any formal) research by
making the results of all collected publications readily available and
easy to search~\cite{onebrain}.

One research topic in this field is the idea of a \emph{tetrapodal
search} that combines four distinct areas of mathematical knowledge.
These four kinds being (1)~the actual formulae as \emph{symbolic
knowledge}, (2)~examples and concrete objects as \emph{concrete knowledge},
(3)~names and comments as \emph{narrative knowledge} and finally
(4)~identifiers, references and their relationships, referred to as
\emph{organizational knowledge}~\cite{tetra}.

Tetrapodal search aims to provide a unified search engine that indexes
each of the four different subsets of mathematical knowledge.  Because
all four kinds of knowledge are inherently different in their
structure, tetrapodal search proposes that each kind of mathematical
knowledge should be made available in a storage backend that fits the
kind of data it is providing. With all four areas available for
querying, tetrapodal search intends to then combine the four indexes
into a single query interface.

\subsection{Focus on Organizational Knowledge}

Currently, research is focused on providing schemas, storage backends
and indexes for the four different kinds of mathematical
knowledge. The focus of \emph{ulo-storage} is the area of
organizational knowledge.

A previously proposed way to structure such organizational data is the
\emph{upper level ontology} (ULO)~\cite{ulo}. ULO takes the form of an
OWL~ontology~\cite{uloonto} and as such all organization information
is stored as RDF~triplets with a unified schema of
ULO~predicates~\cite{owl}.  Some effort has been made to export
existing databases of formal mathematical knowledge to {ULO}, in
particular, there exist exports from Isabelle and Coq
libraries~\cite{uloisabelle, ulocoq}. The resulting data set is
already quite large, the Isabelle export alone containing more than
200~million triplets.

Existing exports from Isabelle and Coq result in single or multiple
RDF~files. This is a convenient format for exchange and easily
versioned using Git. However, considering the vast number of triplets,
it is impossible to query easily and efficiently in this state. This
is what \emph{ulo-storage} is focused on: Making ULO data sets
accessible for querying and analysis. We collected RDF files spread
over different Git repositories, imported them into a database and
then experimented with APIs for accessing that data set.

The main contribution of \emph{ulo-storage} is twofold. First, (1)~we
built up various infrastructure components that can make up building
blocks in a larger tetrapodal search system. Second, (2)~we ran sample
prototype applications and queries on top of this interface. While the
applications themselves are admittedly not very interesting, they can
give us insight about future development of the upper level ontology.

\subsection{Components Implemented for \emph{ulo-storage}}\label{sec:components}

With RDF files exported and available for download as Git repositories
on MathHub, we have the goal of making the underlying data available
for use in applications.  It makes sense to first identify the various
components that might be involved in such a system.
Figure~\ref{fig:components} illustrates all components and their
relationships.

\begin{figure}[]\begin{center}
    \includegraphics{figs/components}
    \caption{Components involved in the \emph{ulo-storage} system.}\label{fig:components}
\end{center}\end{figure}

\begin{itemize}
\item ULO triplets are present in various locations, be it Git
  repositories, on web servers or the local disk.  It is the job of a
  \emph{Collecter} to assemble these {RDF}~files and forward them for further
  processing. This may involve cloning a Git repository or crawling
  the file system.

  \item With streams of ULO files assembled by the Collecter, this
  data then gets passed to an \emph{Importer}. An Importer uploads
  RDF~streams into some kind of permanent storage. For
  use in this project, the GraphDB~\cite{graphdb} triplet store was
  a natural fit.

  For this project, both Collecter and Importer ended up being one
  piece of monolithic software, but this does not have to be the case.

\item Finally, with all triplets stored in a database, an
  \emph{Endpoint} is where applications access the underlying
  knowledge base. This does not necessarily need to be any custom
  software, rather the programming interface of the underlying
  database itself could be understood as an endpoint of its own.

  Regardless, some thought can be put into designing an Endpoint as a
  layer that lives between application and database that is more
  convenient to use than the one provided by the database. It comes
  down to the programming interface we wish to provide to a developer
  using this system.
\end{itemize}

\subsection{An Additional Harvester Component}

These are the three components realized for
\emph{ulo-storage}. However, additionally to these components, one
could think of a \emph{Harvester} component.  Above we assumed that
the ULO triplets are already available in RDF~format.  This is not
necessarily true.  It might be desirable to automate the export from
third party formats to ULO and we think this should be the job of a
Harvester component.  It fetches mathematical knowledge from some
remote source and then provides a volatile stream of ULO data to the
Collecter, which then passes it to the Importer and so on. The big
advantage of such an approach would be that exports from third party
libraries can always be up to date and do not have to be initiated
manually. Another advantage of this hypothetical component is that
running exports through the Harvester involves the whole import chain
of Collecter and Importer which involves syntax~checking for the
exported RDF data. Bugs in exporters that produce faulty XML would be
found earlier in development.

We did not implement a Harvester for \emph{ulo-storage} but we suggest
that it is an idea to keep in mind. The components we did implement
(Collecter, Importer and Endpoint) provide us with an easy and
automated way of making RDF files ready for use with applications.  In
this introduction we only wanted to give the reader a general
understanding in the infrastructure that makes up \emph{ulo-storage},
the following sections will explain each component in more detail.