Skip to content
Snippets Groups Projects
Commit a6fb8743 authored by Andreas Schärtl's avatar Andreas Schärtl
Browse files

report: final paper-based review

parent 73da559d
Branches
No related tags found
No related merge requests found
...@@ -6,8 +6,8 @@ necessary for this information to be available for querying. ...@@ -6,8 +6,8 @@ necessary for this information to be available for querying.
\emph{ulo-storage} aims to lay out the groundwork for this task. \emph{ulo-storage} aims to lay out the groundwork for this task.
We collected various pieces of exported organizational knowledge into We collected various pieces of exported organizational knowledge into
a centralized and efficient store, made that store available on the a centralized and efficient store, made the resulting knowledge graph
network and then evaluated different ways of querying that store. In available on the network and then evaluated different ways of querying
addition, implementations of some exemplary queries on top of our that store. In addition, implementations of some exemplary queries on
created infrastructure resulted in insights on how unified schemas top of our created infrastructure resulted in insights on how unified
for organizational mathematical knowledge could be extended. schemas for organizational mathematical knowledge could be extended.
...@@ -36,7 +36,7 @@ ...@@ -36,7 +36,7 @@
\caption{An adapted SPARQL~query based on~\ref{fig:q2a}. It \caption{An adapted SPARQL~query based on~\ref{fig:q2a}. It
lists all works authored by ``John Smith'' rated by number lists all works authored by ``John Smith'' rated by number
of references. The idea is works that were referenced more of references. The idea is that works that were referenced more
often are more important.}\label{fig:q2b} often are more important.}\label{fig:q2b}
\end{subfigure} \end{subfigure}
\vspace{8mm} \vspace{8mm}
......
...@@ -5,14 +5,14 @@ ...@@ -5,14 +5,14 @@
\begin{subfigure}[]{0.475\textwidth} \begin{subfigure}[]{0.475\textwidth}
\centering \centering
\shadowbox{\includegraphics[width=\textwidth]{figs/explore1.png}} \shadowbox{\includegraphics[width=\textwidth]{figs/explore1.png}}
\caption{Listing of all contributors in the data set. Clicking \caption{Listing all contributors in the data set. Clicking
on a given author shows their individual contributions.} on a given author shows their individual contributions.}
\end{subfigure} \end{subfigure}
\hfill \hfill
\begin{subfigure}[]{0.475\textwidth} \begin{subfigure}[]{0.475\textwidth}
\centering \centering
\shadowbox{\includegraphics[width=\textwidth]{figs/explore2.png}} \shadowbox{\includegraphics[width=\textwidth]{figs/explore2.png}}
\caption{Listing works a given contributor worked on.\\} \caption{Listing works a given contributor has worked on.\\}
\end{subfigure} \end{subfigure}
\vskip\baselineskip{} \vskip\baselineskip{}
\begin{subfigure}[]{0.475\textwidth} \begin{subfigure}[]{0.475\textwidth}
......
...@@ -17,13 +17,12 @@ Endpoint, our second contribution. ...@@ -17,13 +17,12 @@ Endpoint, our second contribution.
\item We investigated queries that could be used to extend the \item We investigated queries that could be used to extend the
system into a larger tetrapodal search system. While some system into a larger tetrapodal search system. While some
organizational queries have obvious canonical solutions others organizational queries have obvious canonical solutions, others
introduce questions on how organizational knowledge should be introduce questions on how organizational knowledge should be
structured (Section~\ref{sec:tetraq}). structured (Section~\ref{sec:tetraq}).
\end{itemize} \end{itemize}
\noindent These applications will now be discussed in the following \noindent We will now discuss each application in more detail.
sections.
\subsection{Exploring Existing Data Sets}\label{sec:expl} \subsection{Exploring Existing Data Sets}\label{sec:expl}
...@@ -47,7 +46,7 @@ predicates. For example, the Isabelle export contains organizational ...@@ -47,7 +46,7 @@ predicates. For example, the Isabelle export contains organizational
meta information such as information about paragraphs and sections in meta information such as information about paragraphs and sections in
the source document while the Coq export only tells us the filename of the source document while the Coq export only tells us the filename of
the original Coq source. Both exports have their particularities and the original Coq source. Both exports have their particularities and
with more and more third party libraries exported to ULO one has to with more and more third party libraries exported to ULO, one has to
assume that this heterogeneity will only grow. In particular we want assume that this heterogeneity will only grow. In particular we want
to point to the large number of predicates which remain unused in both to point to the large number of predicates which remain unused in both
Isabelle and Coq exports. A user formulating queries for ULO might be Isabelle and Coq exports. A user formulating queries for ULO might be
...@@ -77,41 +76,52 @@ increases. ...@@ -77,41 +76,52 @@ increases.
\subsection{Interactive Exploration}\label{sec:exp} \subsection{Interactive Exploration}\label{sec:exp}
Our second application wants to illustrate how to browse an ULO data The second application we want to discuss illustrates how to browse an
set interactively. For this purpose, we developed a web front end that ULO data set interactively. Here, we developed a web front end that
allows users to browse contributions by author. For this application, allows users to browse contributions to the knowledge base sorted by
we used the RDF4J Endpoint, the application itself is implemented in author. In this particular case, we used the RDF4J Endpoint, the
Java. Figure~\ref{fig:appss} shows four screenshots of the current application itself is implemented in Java. Figure~\ref{fig:appss}
version. Once the user has selected a given contribution, we list shows four screenshots of the current version. Once the user has
some basic information about the selected object, such as type selected a given contribution, we list some basic information about
(e.g.\ lemma or corollary) and name as well as references to other the selected object, such as type (e.g.\ lemma or corollary) and name
objects in the knowledge base. as well as references to other objects in the knowledge base.
\input{applications-screenshots.tex} \input{applications-screenshots.tex}
This application is interesting because similar approaches could be This experiment is interesting because similar approaches could be
applied when designing IDE~integration for working with organizational applied when designing IDE~integration for working with organizational
knowledge. Given some object in a document identified by an URI, we knowledge. Given some object in a document identified by an URI, we
can look at how that object is connected to other objects in the can look at how that object is connected to other objects in the
knowledge graph. knowledge graph.
We can also connect the dots back to the original source. In Our front end can also connect the dots back from ULO~object to
particular, we took a look at the \texttt{ulo:sourceref} property original source code. In particular, we took a look at the
defined for many objects in the knowledge base. \texttt{ulo:sourceref} predicate defined for many objects in the
\texttt{ulo:sourceref} is ``the URI of the physical location (e.g., knowledge base. \texttt{ulo:sourceref} is supposed to contain ``the
file/URI, line, column) of the source code that introduced the URI of the physical location (e.g., file/URI, line, column) of the
subject''~\cite{uloonto} and in use in the Isabelle source code that introduced the subject''~\cite{uloonto}; the Isabelle
export~\cite{uloisabelle}. Making the connection from ULO~object to export. But while making the connection from exported ULO~object to
original Isabelle document is convenient for the user, but required original Isabelle document is convenient for the user, this feature
some extra work from us as application implementors. The reason for required some extra work from us as application implementors. The
this is that the format of the \texttt{ulo:sourceref} property is not reason for this is that the format of the \texttt{ulo:sourceref}
well-defined, rather it is up to implementors of library exporters how property appears to be not well-defined, rather it is up to
to format source references. The consequence is that for each export, implementors of library exporters how to format these source
developers of front ends will need to write custom code for finding references. For example, in the Isabelle export, we have to translate
the original source. Maybe it would be a reasonable requirement for source references such as
\begin{verbatim}
https://isabelle.in.tum.de/source/HOL/HOL-Algebra/
HOL-Algebra.Group.theory#17829.576.8:17836.576.15
\end{verbatim}
to the original source
\begin{verbatim}
https://isabelle.in.tum.de/dist/library/HOL/HOL-Algebra/Group.html
\end{verbatim}
The consequence is that for each export, developers of
front ends will need to write custom code for finding the original
source. Maybe it would be a reasonable requirement for
\texttt{ulo:sourceref} to have a well-defined format, ideally an \texttt{ulo:sourceref} to have a well-defined format, ideally an
actually reachable URI on the open web, if that is applicable for a given actually reachable URI on the open web, if that is applicable for a
library. given library.
While translating from \texttt{ulo:sourceref} to original URI While translating from \texttt{ulo:sourceref} to original URI
introduced some extra work, implementing this application was easy and introduced some extra work, implementing this application was easy and
...@@ -128,11 +138,11 @@ proof of concept implementations and evaluate their applicability. ...@@ -128,11 +138,11 @@ proof of concept implementations and evaluate their applicability.
\subsubsection{Elementary Proofs and Computed Scores} \subsubsection{Elementary Proofs and Computed Scores}
We start with query~$\mathcal{Q}_1$ which illustrates how one can We start with query~$\mathcal{Q}_1$ which illustrates how we can
compute arithmetic scores for objects in our knowledge graph. compute arithmetic scores for objects in our knowledge graph.
Query~$\mathcal{Q}_1$ asks us to ``\emph{[f]ind theorems with Query~$\mathcal{Q}_1$ asks us to ``\emph{[f]ind theorems with
non-elementary proofs}.'' Elementary proofs can be understood as non-elementary proofs}''~\cite{tetra}. Elementary proofs can be
those proof that are considered easy and understood as those proof that are considered easy and
obvious~\cite{elempro,elemtest}. In consequence,~$\mathcal{Q}_1$ has obvious~\cite{elempro,elemtest}. In consequence,~$\mathcal{Q}_1$ has
to search for all proofs which are not trivial. Of course, just like to search for all proofs which are not trivial. Of course, just like
any distinction between ``theorem'' and ``corollary'' is going to be any distinction between ``theorem'' and ``corollary'' is going to be
...@@ -149,13 +159,10 @@ prospective school teachers had problems with notation and term ...@@ -149,13 +159,10 @@ prospective school teachers had problems with notation and term
rewriting and were missing required prerequisites~\cite{proofund, rewriting and were missing required prerequisites~\cite{proofund,
proofteach}, none of which seems applicable for grading individual proofteach}, none of which seems applicable for grading individual
formal proofs for difficulty. On the other hand, there is very formal proofs for difficulty. On the other hand, there is very
specific research on rating proofs for some given subsets of problems, specific research on rating proofs for concrete problems such as the
e.g.\ on the satisfiability of a given CNF formula. A particular satisfiability of a given CNF formula~\cite{proofsat}.
example is focused on heuristics for how long a SAT solver will take
to find a solution for a given~{CNF}. Here, solutions that take long
are considered harder~\cite{proofsat}.
\noindent\emph{Organizational Aspect.} A first working hypothesis \emph{Organizational Aspect.} A first working hypothesis
might be to assume that elementary proofs are short. In that case, the might be to assume that elementary proofs are short. In that case, the
size, that is the number of bytes to store the proof, is our first size, that is the number of bytes to store the proof, is our first
indicator of proof complexity. This is by no means perfect, as even indicator of proof complexity. This is by no means perfect, as even
...@@ -170,22 +177,22 @@ working hypothesis. ...@@ -170,22 +177,22 @@ working hypothesis.
us to sort by file size. Maybe proof complexity also leads to quick us to sort by file size. Maybe proof complexity also leads to quick
check times in proof assistants and automatic theorem provers. With check times in proof assistants and automatic theorem provers. With
this assumption in mind we could use the \texttt{ulo:check-time} this assumption in mind we could use the \texttt{ulo:check-time}
predicate. Correlating proof complexity with file size allows us to predicate. Correlating proof complexity with file size or check time
define one indicator of proof complexity based on organizational allows us to define one indicator of proof complexity based on
knowledge alone. organizational knowledge alone.
\noindent\emph{Other Aspects.} A tetrapodal search system should \emph{Other Aspects.} A tetrapodal search system should
probably also take symbolic knowledge into account. Based on some kind probably also take symbolic knowledge into account. Based on some kind
of measure of formula complexity, different proofs could be of measure of formula complexity, different proofs could be
rated. Similarly, with narrative knowledge available to us, we could rated. Similarly, with narrative knowledge available to us, we could
count the number of words, citations and so on to rate the narrative count the number of words, citations and so on to rate the narrative
complexity of a proof. Combining symbolic knowledge, narrative complexity of a proof. Combining symbolic knowledge, narrative
knowledge and organizational knowledge should allow us to find proofs knowledge and organizational knowledge should allow us to find proofs
which are probably easier than others. which are easier than others.
\input{applications-q1.tex} \input{applications-q1.tex}
\noindent\emph{Implementation.} Implementing a naive version of the \emph{Implementation.} Implementing a naive version of the
organizational aspect can be as simple as querying for all theorems organizational aspect can be as simple as querying for all theorems
justified by proofs, ordered by size (or check time). justified by proofs, ordered by size (or check time).
Figure~\ref{fig:q1short} illustrates how this can be achieved with a Figure~\ref{fig:q1short} illustrates how this can be achieved with a
...@@ -231,13 +238,13 @@ something that will surely be useful for many applications. ...@@ -231,13 +238,13 @@ something that will surely be useful for many applications.
\subsubsection{Categorizing Algorithms and Algorithmic Problems} \subsubsection{Categorizing Algorithms and Algorithmic Problems}
The second query~$\mathcal{Q}_2$ we decided to focus on wants to The second query~$\mathcal{Q}_2$ we decided to focus on wants to
``\emph{[f]ind algorithms that solve $NP$-complete graph problems.}'' ``\emph{[f]ind algorithms that solve $NP$-complete graph
Here we want the tetrapodal search system to return a listing of problems}''~\cite{tetra}. Here we want the tetrapodal search system
algorithms that solve (graph) problems with a given property (runtime to return a listing of algorithms that solve (graph) problems with a
complexity). We need to consider where each of these three components given property (runtime complexity). We need to consider where and how
might be stored. each of these components might be found.
\noindent\emph{Symbolic and Concrete Aspects.} First, let us consider \emph{Symbolic and Concrete Aspects.} First, let us consider
algorithms. Algorithms can be formulated as computer code which can be algorithms. Algorithms can be formulated as computer code which can be
understood as symbolic knowledge (code represented as a syntax tree) understood as symbolic knowledge (code represented as a syntax tree)
or as concrete knowledge (code as text or as concrete knowledge (code as text
...@@ -248,25 +255,20 @@ automatically in the general case~\cite{np}. Metadata of this kind ...@@ -248,25 +255,20 @@ automatically in the general case~\cite{np}. Metadata of this kind
needs to be stored in a separate index for organizational knowledge, needs to be stored in a separate index for organizational knowledge,
it being the only fit. it being the only fit.
\noindent\emph{Organizational Aspect.} If we wish to look up properties \emph{Organizational Aspect.} If we wish to look up properties about
about algorithms from organizational knowledge, we first have to algorithms from organizational knowledge, we first have to think about
think about how to represent this information. We propose two how to represent this information. As a first approach, we can try to
approaches, one using the existing ULO ontology and one that represent algorithms and problems in terms of existing ULO
recommends extending ULO to nativly support the concept of predicates. As ULO does have a concept of \texttt{ulo:theorem} and
algorithms. Both approaches have their distinct trade-offs. \texttt{ulo:proof}, it might be tempting to exploit these predicates
and represent algorithms understood as programs in terms of proofs.
As a first approach, we can try to represent algorithms and But that does not really capture the canonical understanding of
problems in terms of existing ULO predicates. As ULO does have a algorithms. Algorithms are not actually programs, rather there are
concept of \texttt{theorem} and \texttt{proof}, it might be programs that implement algorithms. Even if we do work around this
tempting to exploit the Curry-Howard correspondence and represent problem, it is not clear how to represent problems (e.g.\ the
algorithms understood as programs in terms of proofs. But that traveling salesman problem or the sorting problem) in terms of
does not really capture the canonical understanding of algorithms; theorems (propositions, types) that get implemented by a proof
algorithms are not actually programs, rather there are programs (algorithm, program).
that implement algorithms. Even if we do work around this problem,
it is not clear how to represent problems (e.g.\ the traveling
salesman problem or the sorting problem) in terms of theorems
(propositions, types) that get implemented by a proof (algorithm,
program).
As algorithms make up an important part of certain areas of research, As algorithms make up an important part of certain areas of research,
it might be reasonable to introduce native level support for it might be reasonable to introduce native level support for
...@@ -291,16 +293,16 @@ solve algorithmic problems. ...@@ -291,16 +293,16 @@ solve algorithmic problems.
\subsubsection{Contributors and Number of References} \subsubsection{Contributors and Number of References}
Finally, query~$\mathcal{Q}_3$ from literature wants to Finally, query~$\mathcal{Q}_3$ from literature wants to know
know ``\emph{[a]ll areas of math that {Nicolas G.\ de Bruijn} has ``\emph{[a]ll areas of math that {Nicolas G.\ de Bruijn} has worked in
worked in and his main contributions}''~\cite{tetra}. $\mathcal{Q}_3$~is asking and his main contributions}''~\cite{tetra}. $\mathcal{Q}_3$~is
for works of a given author~$A$. It also asks for their main asking for works of a given author~$A$. It also asks for their main
contributions, for example which particularly interesting paragraphs contributions, for example which particularly interesting paragraphs
or code~$A$ has authored. We picked this particular query as it or code~$A$ has authored. We picked this particular query as it is
is asking for metadata, something that should be easily serviced by asking for metadata, something that should be easily serviced by
organizational knowledge. organizational knowledge.
\noindent\emph{Organizational Aspect.} ULO has no concept of authors, \emph{Organizational Aspect.} ULO has no concept of authors,
contributors, dates and so on. Rather, the idea is to take advantage contributors, dates and so on. Rather, the idea is to take advantage
of the Dublin Core project which provides an ontology for such of the Dublin Core project which provides an ontology for such
metadata~\cite{dcreport, dcowl}. For example, Dublin Core provides us metadata~\cite{dcreport, dcowl}. For example, Dublin Core provides us
...@@ -316,19 +318,18 @@ objects ordered by authors. ...@@ -316,19 +318,18 @@ objects ordered by authors.
\input{applications-q3.tex} \input{applications-q3.tex}
\noindent\emph{Implementation.} A search for contributions by a given \emph{Implementation.} A search for contributions by a given author
author can easily be formulated in {SPARQL}~(Figure~\ref{fig:q2a}). can easily be formulated in {SPARQL}~(Figure~\ref{fig:q2a}). Query
Query $\mathcal{Q}_3$ is also asking for the main contributions $\mathcal{Q}_3$ is also asking for the main contributions of~$A$, that
of~$A$, that is those works that~$A$ authored that are the most is those works that~$A$ authored that are the most important. Sorting
important. Sorting the result by number of references might be a good the result by number of references can be a good start. To get the
start. To get the main contributions, we rate each individual work by main contributions, we rate each individual work by its number of
its number of \texttt{ulo:uses} references. Extending the previous \texttt{ulo:uses} references. Extending the previous {SPARQL} query,
{SPARQL} query, we can ask the database for an ordered list of works, we can ask the database for an ordered list of works, starting with
starting with the one that has the most the one that has the most references~(Figure~\ref{fig:q2b}). We see
references~(Figure~\ref{fig:q2b}). We see that one can that one can formulate~$\mathcal{Q}_3$ with just one SPARQL
formulate~$\mathcal{Q}_3$ with just one SPARQL query. Because query. Because everything is handled by the database, access should be
everything is handled by the database, access should be about as quick about as quick as we can hope it to be.
as we can hope it to be.
While the sparse data set available to use only returned a handful of While the sparse data set available to use only returned a handful of
results, we see that queries like~$\mathcal{Q}_3$ are easily serviced results, we see that queries like~$\mathcal{Q}_3$ are easily serviced
...@@ -341,15 +342,16 @@ approaches~\cite{citcart}. ...@@ -341,15 +342,16 @@ approaches~\cite{citcart}.
Experimenting with $\mathcal{Q}_1$ to $\mathcal{Q}_3$ provided us with Experimenting with $\mathcal{Q}_1$ to $\mathcal{Q}_3$ provided us with
some insight into ULO and existing ULO exports. $\mathcal{Q}_1$ shows some insight into ULO and existing ULO exports. $\mathcal{Q}_1$ shows
that while there is no formal definition for ``elementary proof'', ULO that while there is no universal definition for ``elementary proof'',
allows us to query for heuristics and calculate arbitrary arithmetic scores for ULO allows us to query for heuristics and calculate arbitrary
objects of organizational knowledge. Query~$\mathcal{Q}_2$ illustrates arithmetic scores for objects of organizational
the difficulty in finding universal schemas. It remains an open question knowledge. Query~$\mathcal{Q}_2$ illustrates the difficulty in finding
whether ULO should include algorithms as a first class citizen, as a universal schemas. It remains an open question whether ULO should
concept based around existing ULO predicates or whether it is a better include algorithms as a first class citizen, as a concept based around
idea to design a dedicated ontology and potentially data store entirely. existing ULO predicates or whether it is a better idea to design a
Finally, while we were able to formulate a SPARQL query that should dedicated ontology and data store entirely. Finally, while we were
take care of most of~$\mathcal{Q}_3$ we found that the existing data sets able to formulate a SPARQL query that should take care of most
contain very little information about authorship. This underlines of~$\mathcal{Q}_3$, we found that the existing data sets contain very
the observations made previously in Section~\ref{sec:expl} and should little information about authorship. This underlines the observations
be on the mind of anyone writing exporters that output ULO~triplets. made previously in Section~\ref{sec:expl} and should be on the mind of
anyone writing exporters that output ULO~triplets.
...@@ -5,50 +5,50 @@ Section~\ref{sec:implementation} we were able to take existing RDF ...@@ -5,50 +5,50 @@ Section~\ref{sec:implementation} we were able to take existing RDF
exports and import them into a GraphDB database. This made it possible exports and import them into a GraphDB database. This made it possible
to experiment with the applications and examples of to experiment with the applications and examples of
Section~\ref{sec:applications}. We showed that organizational Section~\ref{sec:applications}. We showed that organizational
knowledge formulated as ULO triplets can already give some insights, knowledge formulated as ULO triplets can already give some insights.
in particular it is possible to formulate queries for meta information In particular, it is possible to formulate queries about meta
such as authorship and contribution and resolve the interlinks between information such as authorship and contribution. We also manged to
proofs and theorems. On the other hand, our examples also showed that existing resolve the interlinks between proofs and theorems.
ULO~exports only take advantage of a subset of ULO~predicates,
something to watch as more ULO exports are being worked on.
\subsection{Potential for Future Work} To sum up this report, we want to recap three problems we encountered
when working on \emph{ulo-storage}. They have the potential to result
in future projects that further improve the availability of
organizational knowledge.
Finally, we want to recap three different problems we encountered when The first problem we encountered while working on \emph{ulo-storage}
working on \emph{ulo-storage}. The first problem was that of malformed was that of malformed RDF~exports. Various exports contained invalid
RDF~exports. Various exports contained invalid URIs and wrong URIs and incorrect namespaces. As a work around, we provided on the
namespaces. As a work around we provided on the fly correction but of fly correction. But this does not solve the problem in itself. Perhaps
course this does not solve the problem in itself. Perhaps a long term a long term solution for this problem is to fully automate the export
solution for this problem is to fully automate the export from third from third party libraries (e.g.\ Coq, Isabelle) to ULO triplets in a
party libraries (e.g.\ Coq, Isabelle) to ULO triplets in a database, database, eliminating the step of XML files on disk. During import
eliminating the step of XML files on disk. During import into the into the database, the imported data is thoroughly checked and
database, the imported data is thoroughly checked and mistakes are mistakes are reported right away. Bugs in exporters that produce
reported right away. Bugs in exporters that produce faulty XML would faulty XML would be found earlier in development.
be found earlier in development.
The second problem is that of versioning triplets in the GraphDB The second problem we want to recap is that of versioning triplets in
triple store. While adding new triplets to an existing GraphDB store the GraphDB triple store. While adding new triplets to an existing
is not a problem, updating existing triplets is GraphDB store is not a problem, updating existing triplets is
difficult. \emph{ulo-storage} circumvents this problem by re-creating difficult. \emph{ulo-storage} circumvents this problem by re-creating
the GraphDB data set in regular intervals. Indeed it might be the GraphDB data set in regular intervals. Indeed it might be
difficult to find an efficient alternative. Tagging each triplet with difficult to find an efficient alternative. Tagging each triplet with
some version number doubles the number of triplets that need to be some version number doubles the number of triplets that need to be
stored. Re-creating the index and maybe splitting up the knowledge stored. Re-creating the index and maybe splitting up the knowledge
base into smaller easier to update sub-repositories looks like the base into smaller easier to update sub-repositories looks like the
most promising approach for now. most promising approach for now. What is missing is an efficient way
of servicing queries across multiple database instances.
The third problem is that of missing predicates in existing ULO Finally, perhaps the most interesting but also most challenging
exports. The upper level ontology boasts a total of almost question is the discussion about how to develop ULO, the universal
80~predicates, yet only a third of them are actually used by Coq and ontology for organizational knowledge. On one hand, ULO boasts a
Isabelle exports. A developer writing queries that take advantage of total of almost 80~predicates, yet only a third of them are actually
the full ULO~vocabulary might be surprised that not data is coming used by existing Coq and Isabelle exports. A developer writing queries
back. This shows the difficulty of designing an ontology that is both that take advantage of the full ULO~vocabulary might be surprised that
concise and expressive. But while it is all good and well to recommend not data is coming back. On the other hand, we found it difficult to
writers of exports to use the full set of predicates, it might simply model algorithms in terms of ULO predicates and argued that it might
not make sense to use the full set for a given third party library. We be necessary to further extend the upper level ontology. It will be
think that it is too early to argue for the removal of particular necessary to look at the properties of future exports to make a more
predicates, rather it is better to look at future export projects and well-funded decision.
then evaluate which predicates are in use and which are not.
Despite many open questions, \emph{ulo-storage} provides the necessary Despite many open questions, \emph{ulo-storage} provides the necessary
infrastructure for importing ULO triplets into an efficient storage infrastructure for importing ULO triplets into an efficient storage
......
...@@ -61,10 +61,10 @@ is no pressing need to force them into separate processes. ...@@ -61,10 +61,10 @@ is no pressing need to force them into separate processes.
Our implementation supports two sources for RDF files, namely Git Our implementation supports two sources for RDF files, namely Git
repositories and the local file system. The file system Collector repositories and the local file system. The file system Collector
crawls a given directory on the local machine and looks for crawls a given directory on the local machine and looks for
RDF~XMl~files~\cite{rdfxml} while the Git Collector first clones a Git RDF~XML~files~\cite{rdfxml} while the Git Collector first clones a Git
repository and then passes the checked out working copy to the file repository and then passes the checked out working copy to the file
system Collector. Because we found that is not uncommon for RDF files to be system Collector. Because we found that is not uncommon for RDF files
compressed, our Collector supports on the fly extraction of to be compressed, our implementation supports on the fly extraction of
gzip~\cite{gzip} and xz~\cite{xz} formats which can greatly reduce the gzip~\cite{gzip} and xz~\cite{xz} formats which can greatly reduce the
required disk space in the collection step. required disk space in the collection step.
...@@ -75,7 +75,7 @@ Coq exports contained URIs which does not fit the official syntax ...@@ -75,7 +75,7 @@ Coq exports contained URIs which does not fit the official syntax
specification~\cite{rfc3986} as they contained illegal specification~\cite{rfc3986} as they contained illegal
characters. Previous work~\cite{ulo} that processed Coq and Isabelle characters. Previous work~\cite{ulo} that processed Coq and Isabelle
exports used database software such as Virtuoso Open exports used database software such as Virtuoso Open
Source~\cite{wikivirtuoso} which do not properly check URIs according Source~\cite{wikivirtuoso} which does not properly check URIs according
to spec; in consequence these faults were only discovered now. To to spec; in consequence these faults were only discovered now. To
tackle these problems, we introduced on the fly correction steps tackle these problems, we introduced on the fly correction steps
during collection that escape the URIs in question and then continue during collection that escape the URIs in question and then continue
...@@ -164,18 +164,19 @@ Another approach is to regularly re-create the full data ...@@ -164,18 +164,19 @@ Another approach is to regularly re-create the full data
set~$\mathcal{D}$ from scratch, say every seven days. This circumvents set~$\mathcal{D}$ from scratch, say every seven days. This circumvents
the problems related to updating existing data sets, but also means the problems related to updating existing data sets, but also means
that changes in a given library~$\mathcal{L}_i$ take some to propagate that changes in a given library~$\mathcal{L}_i$ take some to propagate
to~$\mathcal{D}$. Building on this idea, an advanced version of this to~$\mathcal{D}$. Continuing this train of thought, an advanced
approach could forgo the requirement for one single database version of this approach could forgo the requirement for one single
storage~$\mathcal{D}$ entirely. Instead of maintaining just one global database storage~$\mathcal{D}$ entirely. Instead of maintaining just
database state~$\mathcal{D}$, we suggest experimenting with dedicated one global database state~$\mathcal{D}$, we suggest experimenting with
database instances~$\mathcal{D}_i$ for each given dedicated database instances~$\mathcal{D}_i$ for each given
library~$\mathcal{L}_i$. The advantage here is that re-creating a library~$\mathcal{L}_i$. The advantage here is that re-creating a
given database representation~$\mathcal{D}_i$ is fast as given database representation~$\mathcal{D}_i$ is fast as
exports~$\mathcal{E}_i$ are comparably small. The disadvantage is that exports~$\mathcal{E}_i$ are comparably small. The disadvantage is that
we still want to query the whole data set~$\mathcal{D} = \mathcal{D}_1 we still want to query the whole data set~$\mathcal{D} = \mathcal{D}_1
\cup \mathcal{D}_2 \cup \cdots \cup \mathcal{D}_n$. This does require the \cup \mathcal{D}_2 \cup \cdots \cup \mathcal{D}_n$. This does require
development of some cross-database query mechanism, functionality GraphDB the development of some cross-database query mechanism, functionality
currently only offers limited support for~\cite{graphdbnested}. existing systems currently only offer limited support
for~\cite{graphdbnested}.
In summary, we see that versioning is a potential challenge for a In summary, we see that versioning is a potential challenge for a
greater tetrapodal search system. While not a pressing issue for greater tetrapodal search system. While not a pressing issue for
...@@ -188,7 +189,7 @@ Endpoint. Recall that an Endpoint provides the programming interface ...@@ -188,7 +189,7 @@ Endpoint. Recall that an Endpoint provides the programming interface
for applications that wish to query our collection of organizational for applications that wish to query our collection of organizational
knowledge. In practice, the choice of Endpoint programming interface knowledge. In practice, the choice of Endpoint programming interface
is determined by the choice of database system as the Endpoint is is determined by the choice of database system as the Endpoint is
provided directly by the database. provided directly by the database system.
In our project, organizational knowledge is formulated as In our project, organizational knowledge is formulated as
RDF~triplets. The canonical choice for us is to use a triple store, RDF~triplets. The canonical choice for us is to use a triple store,
...@@ -206,7 +207,7 @@ OWL~Reasoning~\cite{owlspec, graphdbreason}. In particular, this ...@@ -206,7 +207,7 @@ OWL~Reasoning~\cite{owlspec, graphdbreason}. In particular, this
means that GraphDB offers support for transitive queries as described means that GraphDB offers support for transitive queries as described
in previous work on~ULO~\cite{ulo}. A transitive query is one that, in previous work on~ULO~\cite{ulo}. A transitive query is one that,
given a relation~$R$, asks for the transitive closure~$S$ given a relation~$R$, asks for the transitive closure~$S$
of~$R$~\cite{tc} (Figure~\ref{fig:tc}). (Figure~\ref{fig:tc}) of~$R$.
\input{implementation-transitive-closure.tex} \input{implementation-transitive-closure.tex}
...@@ -235,14 +236,14 @@ includes not just syntax and semantics of the language itself, but ...@@ -235,14 +236,14 @@ includes not just syntax and semantics of the language itself, but
also a standardized REST interface~\cite{rest} for querying database also a standardized REST interface~\cite{rest} for querying database
servers. servers.
SPARQL was inspired by SQL and as such the \texttt{SELECT} The SPARQL syntax was inspired by SQL and as such the \texttt{SELECT}
\texttt{WHERE} syntax should be familiar to many software developers. \texttt{WHERE} syntax should be familiar to many software developers.
A simple query that returns all triplets in the store looks like A simple query that returns all triplets in the store looks like
\begin{lstlisting} \begin{lstlisting}
SELECT * WHERE { ?s ?p ?o } SELECT * WHERE { ?s ?p ?o }
\end{lstlisting} \end{lstlisting}
where \texttt{?s}, \texttt{?p} and \texttt{?o} are query where \texttt{?s}, \texttt{?p} and \texttt{?o} are query
variables. The result of any query are valid substitutions for the variables. The result of any query are valid substitutions for all
query variables. In this particular case, the database would return a query variables. In this particular case, the database would return a
table of all triplets in the store sorted by subject~\texttt{?o}, table of all triplets in the store sorted by subject~\texttt{?o},
predicate~\texttt{?p} and object~\texttt{?o}. predicate~\texttt{?p} and object~\texttt{?o}.
...@@ -300,7 +301,7 @@ containers, that is lightweight virtual machines with a fixed ...@@ -300,7 +301,7 @@ containers, that is lightweight virtual machines with a fixed
environment for running a given application~\cite[pp. 22]{dockerbook}. environment for running a given application~\cite[pp. 22]{dockerbook}.
Docker Compose then is a way of combining individual Docker containers Docker Compose then is a way of combining individual Docker containers
to run a full tech stack of application, database server and so to run a full tech stack of application, database server and so
on~\cite[pp. 42]{dockerbook}. All configuration of the overarching a on~\cite[pp. 42]{dockerbook}. All configuration of the overarching
setup is stored in a Docker Compose file that describes the software setup is stored in a Docker Compose file that describes the software
stack. stack.
...@@ -313,11 +314,12 @@ code for Collector and Importer is available in the ...@@ -313,11 +314,12 @@ code for Collector and Importer is available in the
deployment files, that is Docker Compose configuration and additional deployment files, that is Docker Compose configuration and additional
Dockerfiles are stored in a separate repository~\cite{dockerfilerepo}. Dockerfiles are stored in a separate repository~\cite{dockerfilerepo}.
This concludes our discussion of the implementation developed for the With this, we conclude our discussion of the implementation developed
\emph{ulo-storage} project. We designed a system based around (1)~a for the \emph{ulo-storage} project. We designed a system based around
Collector which collects RDF triplets from third party sources, (2)~an (1)~a Collector which collects RDF triplets from third party sources,
Importer which imports these triplets into a GraphDB database and (2)~an Importer which imports these triplets into a GraphDB database
(3)~looked at different ways of querying a GraphDB Endpoint. All of and (3)~looked at different ways of querying a GraphDB Endpoint. All
this is easy to deploy using a single Docker Compose file. With this of this is easy to deploy using a single Docker Compose file. With
stack ready for use, we will continue with a look at some interesting this stack ready for use, we will now continue with a look at some
applications and queries built on top of this infrastructure. interesting applications and queries built on top of this
infrastructure.
...@@ -13,14 +13,14 @@ One topic of research in this field is the idea of a \emph{tetrapodal ...@@ -13,14 +13,14 @@ One topic of research in this field is the idea of a \emph{tetrapodal
search} that combines four distinct areas of mathematical knowledge. search} that combines four distinct areas of mathematical knowledge.
These four kinds being (1)~the actual formulae as \emph{symbolic These four kinds being (1)~the actual formulae as \emph{symbolic
knowledge}, (2)~examples and concrete objects as \emph{concrete knowledge}, (2)~examples and concrete objects as \emph{concrete
knowledge}, (3)~names and comments as \emph{narrative knowledge} and knowledge}, (3)~prose, names and comments as \emph{narrative knowledge} and
finally (4)~identifiers, references and their relationships, referred finally (4)~identifiers, references and their relationships, referred
to as \emph{organizational knowledge}~\cite{tetra}. to as \emph{organizational knowledge}~\cite{tetra}.
Tetrapodal search aims to provide a unified search engine that indexes Tetrapodal search aims to provide a unified search engine that indexes
each of these four different subsets of mathematical knowledge. each of these four different subsets of mathematical knowledge.
Because all four kinds of knowledge are inherently unique in their Because all four kinds of knowledge are inherently unique in their
structure, tetrapodal search proposes that each kind of mathematical structure, tetrapodal search proposes that each subset of mathematical
knowledge should be made available in a storage backend that fits the knowledge should be made available in a storage backend that fits the
kind of data it is providing. With all four areas available for kind of data it is providing. With all four areas available for
querying, tetrapodal search intends to then combine the four indexes querying, tetrapodal search intends to then combine the four indexes
...@@ -31,8 +31,8 @@ mathematical knowledge. In this context, the focus of ...@@ -31,8 +31,8 @@ mathematical knowledge. In this context, the focus of
A previously proposed way to structure organizational knowledge is the A previously proposed way to structure organizational knowledge is the
\emph{upper level ontology} (ULO)~\cite{ulo}. ULO takes the form of an \emph{upper level ontology} (ULO)~\cite{ulo}. ULO takes the form of an
OWL~ontology~\cite{uloonto} and as such all organization information OWL~ontology~\cite{uloonto} and as such all organization knowledge is
is stored as RDF~triplets with a unified schema of stored as RDF~triplets with a unified schema of
ULO~predicates~\cite{owl}. Some effort has been made to export ULO~predicates~\cite{owl}. Some effort has been made to export
existing collections of formal mathematical knowledge to {ULO}. In existing collections of formal mathematical knowledge to {ULO}. In
particular, exports from Isabelle and Coq-based libraries are particular, exports from Isabelle and Coq-based libraries are
...@@ -51,8 +51,8 @@ analysis. We collected RDF files spread over different Git ...@@ -51,8 +51,8 @@ analysis. We collected RDF files spread over different Git
repositories, imported them into a database and then experimented with repositories, imported them into a database and then experimented with
APIs for accessing that data set. APIs for accessing that data set.
The main contribution of \emph{ulo-storage} is twofold. First, (1)~we The contribution of \emph{ulo-storage} is twofold. First, (1)~we built
built up various infrastructure components that make organizational up various infrastructure components that make organizational
knowledge easy to query. These components can make up building blocks knowledge easy to query. These components can make up building blocks
of a larger tetrapodal search system. Their design and implementation of a larger tetrapodal search system. Their design and implementation
are discussed in Section~\ref{sec:implementation}. Second, (2)~we ran are discussed in Section~\ref{sec:implementation}. Second, (2)~we ran
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment