Skip to content
Snippets Groups Projects
Commit 3fd1ab97 authored by Andreas Schärtl's avatar Andreas Schärtl
Browse files

report: review intro, merge w/ components

parent 848c1987
No related branches found
No related tags found
No related merge requests found
TEX_SOURCES = report.tex abstract.tex intro.tex components.tex \
TEX_SOURCES = report.tex abstract.tex intro.tex \
endpoints.tex applications.tex conclusion.tex \
applications-ulo-table.tex collecter.tex
......
\section{Components}\label{sec:components}
With ULO/RDF files in place and available for download on
MathHub~\cite{uloisabelle, ulocoq} as Git repositories, we have the
aim of making the underlying data available for use in applications.
It makes sense to first identify the various components that might
be involved in such a system. They are illustrated in
figure~\ref{fig:components}.
\begin{figure}[]\begin{center}
\includegraphics{figs/components}
\caption{Components involved in the \emph{ulo-storage} system.}\label{fig:components}
\end{center}\end{figure}
\begin{itemize}
\item ULO data is present on various locations, be it Git
repositories, available on web servers or on the local disk.
Regardless where this ULO data is stored, it is the job of a
\emph{Collecter} to assemble these {RDF}~files and forward them
for further processing. This may involve cloning a Git repository or
crawling a file system.
\item With streams of ULO files assembled by the Collecter, this
data then gets passed to an \emph{Importer}. An Importer uploads
received RDF~streams into some kind of permanent storage. For
use in this project, the GraphDB~\cite{graphdb} triplet store was
natural fit.
In this project, both Collecter and Importer ended up being one piece
of software, but this does not have to be the case.
\item Finally, with all triplets stored in a database, an
\emph{Endpoint} is where applications access the underlying
knowledge base. This does not necessarily need to be any specific
software, rather the programming interface of the underlying
database itself could be understood as an endpoint of its
own.
Regardless, some thought can be put into designing an Endpoint as a
layer that lives between application and database that is more
convenient to use than the one provided by the database.
\end{itemize}
These are the components realized for \emph{ulo-storage}. Additionally
to these components, one could think of a \emph{Harvester} component.
We assumed that the ULO triplets are already available in RDF~format.
Indeed for this project this was the case as we work with already
exported triplets from the Isabelle and Coq libraries. However, this
is not necessarily true.
It might be desirable to automate the export from third party formats
to ULO and we think this is what a Harvester should do. It fetches
mathematical knowledge from some remote source and then provides a
volatile stream of ULO data to the Collecter, which then passes it
to the Importer and so on. The big advantage of such an approach would
be that exports from third party libraries can always be up to date
and do not have to be initiated manually. We did not implement a Harvester
for \emph{ulo-storage} but we suggest that it is an idea to keep in mind.
Each component we did implement (Collecter, Importer, Endpoint) will
now be discussed in a separate section. Here we only wanted to give
the reader a general understanding in the infrastructure that makes
up \emph{ulo-storage}.
\section{Endpoints}\label{sec:endpoints}
With ULO/RDF triplets imported into the GraphDB triplet store, we have
all data available for querying. There are multiple approaches to
querying this database, one based around the standardized SPARQL query
language and the other on the RDF4J Java library implemented by
various vendors. We will quickly look at both approaches as they have
unique advantages.
With ULO triplets imported into the GraphDB triplet store by Collecter
and Importer, we now have all data available necessary for querying.
As discussed before, querying from applications happens through an
Endpoint that exposes some kind of {API}. The interesting question
here is probably not so much the implementation of the endpoint itself,
rather it is the choice of API than can make or break such a project.
\subsection{SPARQL}
There are multiple approaches to querying the GraphDB triplet store,
one based around the standardized SPARQL query language and the other
on the RDF4J Java library implemented by various vendors. Both
approaches have unique advantages.
SPARQL is a standardized query language for RDF triplet
data~\cite{sparql}. The spec includes not just syntax and semantics of
the language itself, but also a standardized REST interface for
querying databases. Various implementations of this standard,
e.g.~\cite{gosparql}, are available so using SPARQL has the advantage
of making us independent of a specific programming language or
environment.
\subsection{Available Application Interfaces}
SPARQL is inspired by SQL and as such the \texttt{SELECT} \texttt{WHERE}
syntax should be familiar to many software developers. A simple query
that returns all triplets in the store looks like
\begin{itemize}
\item SPARQL is a standardized query language for RDF triplet
data~\cite{sparql}. The spec includes not just syntax and
semantics of the language itself, but also a standardized REST
interface for querying databases. Various implementations of
this standard, e.g.~\cite{gosparql}, are available so using
SPARQL has the advantage of making us independent of a specific
programming language or environment.
SPARQL is inspired by SQL and as such the \texttt{SELECT}
\texttt{WHERE} syntax should be familiar to many software
developers. A simple query that returns all triplets in the
store looks like
\begin{verbatim}
SELECT * WHERE { ?s ?p ?o }
\end{verbatim}
where \texttt{?s}, \texttt{?p} and \texttt{?o} are query
variables. The result of a query are valid substitutions for the query
variables. In this case, the database would return a table of all
triplets in the store sorted by subject~\texttt{?o},
variables. The result of a query are valid substitutions for the
query variables. In this case, the database would return a table
of all triplets in the store sorted by subject~\texttt{?o},
predicate~\texttt{?p} and object~\texttt{?o}.
Of course, queries might return a lot of data. Importing just the
Isabelle exports into GraphDB results in more than 200 million
triplets. For practical applications it will be necessary to limit
the number of result or use pagination
Of course, queries might return a lot of data. Importing just
the Isabelle exports into GraphDB results in more than 200
million triplets. For practical applications it will be
necessary to limit the number of result or use pagination
techniques~\cite{sparqlpagination}.
\subsection{RDF4J}
RDF4J is a Java API for interacting with triplet stores, implemented
based on a superset of {SPARQL}~\cite{rdf4j}. GraphDB supports RDF4J,
in fact it is the recommended way of interacting with GraphDB
\item RDF4J is a Java API for interacting with triplet stores,
implemented based on a superset of
{SPARQL}~\cite{rdf4j}. GraphDB supports RDF4J, in fact it is the
recommended way of interacting with GraphDB
repositories~\cite{graphdbapi}. Instead of formulating textual
queries, RDF4J allows developers to query a repository by calling Java
API methods. Above query that returns all triplets in the store looks
like
queries, RDF4J allows developers to query a repository by
calling Java API methods. Above query that returns all triplets
in the store looks like
\begin{verbatim}
connection.getStatements(null, null, null);
\end{verbatim}
in RDF4J. Method \texttt{getStatements(s, p, o)} returns all triplets
that have matching subject~\texttt{s}, predicate~\texttt{p} and
object~\texttt{o}. If any of these arguments is \texttt{null}, it can
be any value, i.e.\ it is a query variable that is to be filled by the
call to \texttt{getStatements}.
in RDF4J. Method \texttt{getStatements(s, p, o)} returns all
triplets that have matching subject~\texttt{s},
predicate~\texttt{p} and object~\texttt{o}. If any of these
arguments is \texttt{null}, it can be any value, i.e.\ it is a
query variable that is to be filled by the call to
\texttt{getStatements}.
Using RDF4J does introduce a dependency on the JVM family of
languages, but also offers some conveniences. For example, we can
generate Java classes that contain all URIs in an OWL ontology as
constants~\cite{rdf4jgen}. In combination with IDE support, we found
this to be very convenient when writing applications that interface
with ULO data sets.
languages, but also offers some conveniences. For example, we
can generate Java classes that contain all URIs in an OWL
ontology as constants~\cite{rdf4jgen}. In combination with IDE
support, we found this to be very convenient when writing
applications that interface with ULO data sets.
\end{itemize}
\subsection{Comparision}
\subsection{Comparison}
\emph{TODO}
\section{Introduction}\label{sec:introduction}
To tackle the vast array of mathematical publications, various ways of
\emph{computerizing} mathematical knowledge have been researched and
developed. As it is already difficult for human mathematicians to keep
even a subset of all mathematical knowledge in their mind, a hope is
that computerization will yield great improvement to mathematical (and
really any) research by making the results of all collected research
readily available and easy to search~\cite{onebrain}.
To tackle the vast array of mathematical
publications, various ways of \emph{computerizing} mathematical
knowledge have been experimented with. As it is already difficult for
human mathematicians to keep even a subset of all mathematical
knowledge in their mind, a hope is that computerization will yield
great improvement to mathematical (and really any formal) research by
making the results of all collected publications readily available and
easy to search~\cite{onebrain}.
One research topic in this field is the idea of a \emph{tetrapodal
search} that combines four distinct areas of mathematical knowledge
and data. These four kinds being (1)~the actual formulae as \emph{symbolic
search} that combines four distinct areas of mathematical knowledge.
These four kinds being (1)~the actual formulae as \emph{symbolic
knowledge}, (2)~examples and concrete objects as \emph{concrete knowledge},
(3)~names and comments as \emph{narrative knowledge} and finally
(4)~identifiers, references and their relationships, referred to as
......@@ -25,10 +26,11 @@ exactly with the kind of data it is providing. With all four areas
available for querying, tetrapodal search intends to then combine
the four indexes into a single query interface.
As it stands, tetrapodal search is not really close to usable
prototypes. Research is focused on providing schemer's, storage backends
\subsection{Focus on Organizational Knowledge}
Currently, research is focused on providing schemas, storage backends
and indexes for the four different kinds of mathematical
knowledge. The focus of the \emph{ulo-storage} project is the area of
knowledge. The focus of \emph{ulo-storage} is the area of
organizational knowledge.
A previously proposed way to structure such organizational data is the
......@@ -41,19 +43,81 @@ from Isabelle and Coq libraries~\cite{uloisabelle, ulocoq}. The
resulting data set is already quite large, the Isabelle export alone
containing more than 200~million triplets.
Existing exports from Isabelle and Coq result in a single or multiple
RDF~files. This is a convenient form for exchange and easily versioned
using Git. However, considering the vast number of triplets, it is
impossible to query easily and efficiently as it is. Ultimately this
is what \emph{ulo-storage} work is focused on: Making ULO data sets
accessible for querying and analysis. We collect the RDF files spread
from different Git repositories, import them into a database and then
provide an API for accessing the data set.
Existing exports from Isabelle and Coq result in single or multiple
RDF~files. This is a convenient format for exchange and easily
versioned using Git. However, considering the vast number of triplets,
it is impossible to query easily and efficiently in this state. This
is what \emph{ulo-storage} is focused on: Making ULO data sets
accessible for querying and analysis. We collected RDF files spread
over different Git repositories, imported them into a database and
then experimented with APIs for accessing that data set.
The main contribution of this project is twofold. First, (1) we built
up various infrastructure components that can make up building blocks
in a larger tetrapodal search system. Second, (2)~we ran sample
prototype applications and queries on top of this interface. While the
applications themselves are admittedly not very interesting, they give
insight for future development of~{ULO}. The following section
provides an overview of the involved components in \emph{ulo-storage}.
applications themselves are admittedly not very interesting, they can give
us insight about future development of~{ULO}.
\subsection{Components Implemented for \emph{ulo-storage}}\label{sec:components}
With RDF files exported and available for download as Git repositories
on MathHub, we have the goal of making the underlying data available
for use in applications. It makes sense to first identify the various
components that might be involved in such a system. Figure~\ref{fig:components}
illustrates all components and their relationships.
\begin{figure}[]\begin{center}
\includegraphics{figs/components}
\caption{Components involved in the \emph{ulo-storage} system.}\label{fig:components}
\end{center}\end{figure}
\begin{description}
\item[Collecter] ULO triplets are present in various locations, be it Git
repositories, available on web servers or on local disk.
It is the job of a Collecter to assemble these {RDF}~files and
forward them for further processing. This may involve cloning a Git
repository or crawling the file system.
\item[Importer] With streams of ULO files assembled by the Collecter, this
data then gets passed to an Importer. An Importer uploads
received RDF~streams into some kind of permanent storage. For
use in this project, the GraphDB~\cite{graphdb} triplet store was
a natural fit.
In this project, both Collecter and Importer ended up being one piece
of software, but this does not have to be the case.
\item[Endpoint] Finally, with all triplets stored in a database, an
Endpoint is where applications access the underlying
knowledge base. This does not necessarily need to be any custom
software, rather the programming interface of the underlying
database itself could be understood as an endpoint of its
own.
Regardless, some thought can be put into designing an Endpoint as a
layer that lives between application and database that is more
convenient to use than the one provided by the database.
\end{description}
\subsection{An Additional Harvester Component}
These are the components realized for \emph{ulo-storage}. However,
additionally to these components, one could think of a
\emph{Harvester} component. We assumed that the ULO triplets are
already available in RDF~format. This is not necessarily true. It
might be desirable to automate the export from third party formats to
ULO and we think this should be the job of a Harvester component. It
fetches mathematical knowledge from some remote source and then
provides a volatile stream of ULO data to the Collecter, which then
passes it to the Importer and so on. The big advantage of such an
approach would be that exports from third party libraries can always
be up to date and do not have to be initiated manually.
We did not implement a Harvester for \emph{ulo-storage} but we suggest
that it is an idea to keep in mind. The components we did implement
(Collecter, Importer and Endpoint) provide us with an easy and
automated way of making RDF files ready for use with applications. In
this introduction we only wanted to give the reader a general
understanding in the infrastructure that makes up \emph{ulo-storage};
the following sections will explain each component in more detail.
......@@ -50,7 +50,6 @@
\newpage
\input{intro.tex}
\input{components.tex}
\input{collecter.tex}
\input{endpoints.tex}
\input{applications.tex}
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment