Skip to content
Snippets Groups Projects
Commit 1f5ccf24 authored by Andreas Schärtl's avatar Andreas Schärtl
Browse files

report: review implementation

parent fe251517
No related branches found
No related tags found
No related merge requests found
doc/report/figs/screenshot0.png

85.3 KiB

doc/report/figs/screenshot1.png

133 KiB

\begin{figure}
\centering
\begin{subfigure}[b]{0.8\textwidth}
\centering
\shadowbox{\includegraphics[width=\textwidth]{figs/screenshot0.png}}
\caption{Scheduling a new import job in the web interface.}
\end{subfigure}
\vspace{5mm}
\begin{subfigure}[b]{0.8\textwidth}
\centering
\shadowbox{\includegraphics[width=\textwidth]{figs/screenshot1.png}}
\caption{Reviewing a previous import job. This particular job
failed, which illustrates how our web interface presents errors
and logs.}
\end{subfigure}
\caption{The web interface for controlling the Collector and
Importer components. While Collector and Importer can also be
used as library code or with a command line utility, managing
jobs in a web interface is probably the most
convenient.}\label{fig:ss}
\end{figure}
...@@ -6,14 +6,15 @@ ...@@ -6,14 +6,15 @@
\caption{We can think of this tree as visualizing a relation~$R$ where \caption{We can think of this tree as visualizing a relation~$R$ where
$(X, Y)~\in~R$ iff there is an edge from~$X$ to~$Y$.} $(X, Y)~\in~R$ iff there is an edge from~$X$ to~$Y$.}
\end{subfigure} \end{subfigure}
\hspace{5mm}
\begin{subfigure}[b]{0.4\textwidth} \begin{subfigure}[b]{0.4\textwidth}
\centering \centering
\includegraphics[width=\textwidth]{figs/tree-transitive.pdf} \includegraphics[width=\textwidth]{figs/tree-transitive.pdf}
\caption{Transitive closure~$S$ of relation~$R$. Additionally to \caption{Transitive closure~$S$ of relation~$R$. Additionally to
each tuple from~$R$ (solid edges), $S$~also contains additional each tuple from~$R$ (solid edges), $S$~also contains additional
transitive edges (dotted lines).} transitive edges (dotted edges).}
\end{subfigure} \end{subfigure}
\caption{Illustrating the idea behind the transative closure. A \caption{Illustrating the idea behind transitive closures. A
transative closure~$S$ of relation~$R$ is defined as the transitive closure~$S$ of relation~$R$ is defined as the
``minimal transitive relation that contains~$R$''~\cite{tc}.}\label{fig:tc} ``minimal transitive relation that contains~$R$''~\cite{tc}.}\label{fig:tc}
\end{figure*} \end{figure*}
...@@ -12,9 +12,9 @@ describes some details of the actual implementation for this project. ...@@ -12,9 +12,9 @@ describes some details of the actual implementation for this project.
With RDF files exported and available for download as Git With RDF files exported and available for download as Git
repositories~\cite{uloisabelle, ulocoq}, we have the goal of making repositories~\cite{uloisabelle, ulocoq}, we have the goal of making
the underlying data available for use in applications. Let us first the underlying data available for use in applications. Let us first
look at a high level overview of all involved take a high level look at the three components we made out for this
components. Figure~\ref{fig:components} illustrates each component and job. Figure~\ref{fig:components} illustrates how data flows through
the flow of data. the different components.
\begin{figure}[]\begin{center} \begin{figure}[]\begin{center}
\includegraphics[width=0.9\textwidth]{figs/components.png} \includegraphics[width=0.9\textwidth]{figs/components.png}
...@@ -37,12 +37,12 @@ the flow of data. ...@@ -37,12 +37,12 @@ the flow of data.
\emph{Endpoint} is where applications access the underlying \emph{Endpoint} is where applications access the underlying
knowledge base. This does not necessarily need to be any custom knowledge base. This does not necessarily need to be any custom
software, rather the programming interface of the underlying software, rather the programming interface of the underlying
database itself can be understood as an endpoint of its own. database server itself can be understood as an endpoint of its own.
\end{itemize} \end{itemize}
Collector, Importer and Endpoint provide us with an easy and automated Collector, Importer and Endpoint provide us with an automated way of
way of making RDF files available for use within applications. We will making RDF files available for use within applications. We will now
now take a look at the actual implementation created for take a look at the actual implementation created for
\emph{ulo-storage}, beginning with the implementation of Collector and \emph{ulo-storage}, beginning with the implementation of Collector and
Importer. Importer.
...@@ -50,21 +50,20 @@ Importer. ...@@ -50,21 +50,20 @@ Importer.
We previously described Collector and Importer as two distinct We previously described Collector and Importer as two distinct
components. The Collector pulls RDF data from various sources as an components. The Collector pulls RDF data from various sources as an
input and outputs a stream of standardized RDF data. Then, the input and outputs a stream of standardized RDF data. The Importer
Importer takes such a stream of RDF data and then dumps it to some takes such a stream of RDF data and then dumps it to some sort of
sort of persistent storage. In the implementation for persistent storage. In our implementation, both Collector and
\emph{ulo-storage}, both Collector and Importer ended up being one Importer ended up as one piece of monolithic software. This does not
piece of monolithic software. This does not need to be the case but need to be the case but proved convenient because combining Collector
proved convenient because (1)~combining Collector and Importer forgoes and Importer forgoes the needs for an additional IPC~mechanism
the needs for an additional IPC~mechanism and (2)~neither Collector between Collector and Importer.
nor Importer are terribly large pieces of software in themselves.
Our implementation supports two sources for RDF files, namely Git Our implementation supports two sources for RDF files, namely Git
repositories and the local file system. The file system Collector repositories and the local file system. The file system Collector
crawls a given directory on the local machine and looks for crawls a given directory on the local machine and looks for
RDF~XMl~files~\cite{rdfxml} while the Git Collector first clones a Git RDF~XMl~files~\cite{rdfxml} while the Git Collector first clones a Git
repository and then passes the checked out working copy to the file repository and then passes the checked out working copy to the file
system Collector. Because it is not uncommon for RDF files to be system Collector. Because we found that is not uncommon for RDF files to be
compressed, our Collector supports on the fly extraction of compressed, our Collector supports on the fly extraction of
gzip~\cite{gzip} and xz~\cite{xz} formats which can greatly reduce the gzip~\cite{gzip} and xz~\cite{xz} formats which can greatly reduce the
required disk space in the collection step. required disk space in the collection step.
...@@ -72,7 +71,7 @@ required disk space in the collection step. ...@@ -72,7 +71,7 @@ required disk space in the collection step.
During development of the Collector, we found that existing exports During development of the Collector, we found that existing exports
from third party mathematical libraries contain RDF syntax errors from third party mathematical libraries contain RDF syntax errors
which were not discovered previously. In particular, both Isabelle and which were not discovered previously. In particular, both Isabelle and
Coq export contained URIs which do not fit the official syntax Coq exports contained URIs which do not fit the official syntax
specification~\cite{rfc3986} as they contained illegal specification~\cite{rfc3986} as they contained illegal
characters. Previous work~\cite{ulo} that processed Coq and Isabelle characters. Previous work~\cite{ulo} that processed Coq and Isabelle
exports used database software such as Virtuoso Open exports used database software such as Virtuoso Open
...@@ -95,17 +94,24 @@ into the database. As such the import itself is straight-forward, our ...@@ -95,17 +94,24 @@ into the database. As such the import itself is straight-forward, our
software only needs to upload the RDF file stream as-is to an HTTP software only needs to upload the RDF file stream as-is to an HTTP
endpoint provided by our GraphDB instance. endpoint provided by our GraphDB instance.
To review, our combination of Collector and Importer fetches XML~files
from Git repositories, applies on the fly decompression and fixes and
then imports the collected RDF~triplets into persistent database
storage.
\subsection{Scheduling and Version Management} \subsubsection{Scheduling}
Collector and Importer were implemented as library code that can be Collector and Importer were implemented as library code that can be
called from various front ends. For this project, we provide both a called from various front ends. For this project, we provide both a
command line interface as well as a graphical web front end. While the command line interface as well as a graphical web front end. While the
command line interface is only useful for manually starting single command line interface is only useful for manually starting single
jobs, the web interface allows scheduling of jobs. In particular, it jobs, the web interface allows scheduling of jobs. Import jobs can be
allows the user to automate import jobs. For example, it is possible started manually or scheduled automatically in a web interface
to schedule an import of a given Git repository every seven days to a (Figure~\ref{fig:ss}) which also keeps history and logs.
given GraphDB instance.
\input{implementation-screenshots.tex}
\subsubsection{Version Management}
Automated job control that regularly imports data from the same Automated job control that regularly imports data from the same
sources leads us to the problem of versioning. In our current design, sources leads us to the problem of versioning. In our current design,
...@@ -122,16 +128,17 @@ representation~$\mathcal{D}$. We see that data flows ...@@ -122,16 +128,17 @@ representation~$\mathcal{D}$. We see that data flows
from $n$~individual libraries~$\mathcal{L}_i$ into a single from $n$~individual libraries~$\mathcal{L}_i$ into a single
database storage~$\mathcal{D}$ that is used for querying. database storage~$\mathcal{D}$ that is used for querying.
However, mathematical knowledge isn't static. When a given However, we must not ignore that mathematical knowledge is ever
library~$\mathcal{L}^{t}_i$ at revision~$t$ gets updated to a new changing and not static. When a given library~$\mathcal{L}^{t}_i$ at
version~$\mathcal{L}^{t+1}_i$, this change will eventually propagate revision~$t$ gets updated to a new version~$\mathcal{L}^{t+1}_i$, this
to the associated export and result in a new set of RDF change will eventually propagate to the associated export and result
triplets~$\mathcal{E}^{t+1}_i$. Our global database in a new set of RDF triplets~$\mathcal{E}^{t+1}_i$. Our global
state~$\mathcal{D}$ needs to get updated to match the changes database state~$\mathcal{D}$ needs to get updated to match the changes
between~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$. Finding an between~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$.
efficient implementation for this problem is not trivial. While it
should be possible to find out the difference between two Finding an efficient implementation for this problem is not trivial.
exports~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$ and compute the While it should be possible to compute the difference between two
exports~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$ and infer the
changes necessary to be applied to~$\mathcal{D}$, the big number of changes necessary to be applied to~$\mathcal{D}$, the big number of
triplets makes this appear unfeasible. As this is a problem an triplets makes this appear unfeasible. As this is a problem an
implementer of a greater tetrapodal search system will encounter, we implementer of a greater tetrapodal search system will encounter, we
...@@ -142,7 +149,7 @@ versioning information about which particular ...@@ -142,7 +149,7 @@ versioning information about which particular
export~$\mathcal{E}^{t}_i$ it was derived from. During an import export~$\mathcal{E}^{t}_i$ it was derived from. During an import
from~$\mathcal{E}^{s}_i$ into~$\mathcal{D}$, we could (1)~first remove all from~$\mathcal{E}^{s}_i$ into~$\mathcal{D}$, we could (1)~first remove all
triplets in~$\mathcal{D}$ that were derived from the previous version triplets in~$\mathcal{D}$ that were derived from the previous version
of~$\mathcal{E}^{t-1}_i$ and (2)~then re-import all triplets from the current of~$\mathcal{E}^{s-1}_i$ and (2)~then re-import all triplets from the current
version~$\mathcal{E}^{s}_i$. Annotating triplets with versioning version~$\mathcal{E}^{s}_i$. Annotating triplets with versioning
information is an approach that should work, but it does information is an approach that should work, but it does
introduce~$\mathcal{O}(n)$ additional triplets in~$\mathcal{D}$ where introduce~$\mathcal{O}(n)$ additional triplets in~$\mathcal{D}$ where
...@@ -152,29 +159,33 @@ solution. ...@@ -152,29 +159,33 @@ solution.
Another approach is to regularly re-create the full data Another approach is to regularly re-create the full data
set~$\mathcal{D}$ from scratch, say every seven days. This circumvents set~$\mathcal{D}$ from scratch, say every seven days. This circumvents
all problems related to updating existing data sets, but it does have the problems related to updating existing data sets, but also means
additional computation requirements. It also means that changes in a that changes in a given library~$\mathcal{L}_i$ take some to propagate
given library~$\mathcal{L}_i$ take some to propagate to~$\mathcal{D}$. to~$\mathcal{D}$. Building on top of this idea, an advanced version
Building on top of this idea, an advanced version of this approach of this approach could forgo the requirement of only one single
could forgo the requirement of only one single database database storage~$\mathcal{D}$ entirely. Instead of only maintaining
storage~$\mathcal{D}$. Instead of only maintaining one global database one global database state~$\mathcal{D}$, we suggest the use of
state~$\mathcal{D}$, we suggest the use of dedicated database dedicated database instances~$\mathcal{D}_i$ for each given
instances~$\mathcal{D}_i$ for each given library~$\mathcal{L}_i$. The library~$\mathcal{L}_i$. The advantage here is that re-creating a
advantage here is that re-creating a given database given database representation~$\mathcal{D}_i$ is fast as
representation~$\mathcal{D}_i$ is fast as exports~$\mathcal{E}_i$ are exports~$\mathcal{E}_i$ are comparably small. The disadvantage is that
comparably small. The disadvantage is that we still want to query the we still want to query the whole data set~$\mathcal{D} = \mathcal{D}_1
whole data set~$\mathcal{D} = \mathcal{D}_1 \cup \mathcal{D}_2 \cup \cup \mathcal{D}_2 \cup \cdots \cup \mathcal{D}_n$. This requires the
\cdots \cup \mathcal{D}_n$. This requires the development of some development of some cross-repository query mechanism, something
cross-repository query mechanism, something GraphDB currently only GraphDB currently only offers limited support
offers limited support for~\cite{graphdbnested}. for~\cite{graphdbnested}.
In summary, we see that versioning is a potential challenge for a
greater tetrapodal search system. While not a pressing issue for
\emph{ulo-storage} now, we consider this a topic of future research.
\subsection{Endpoint}\label{sec:endpoints} \subsection{Endpoint}\label{sec:endpoints}
Finally, we need to discuss how \emph{ulo-storage} realizes the Finally, we need to discuss how \emph{ulo-storage} realizes the
Endpoint. Recall that the Endpoint provides the programming interface Endpoint. Recall that the Endpoint provides the programming interface
for systems that wish to query our collection of organizational for applications that wish to query our collection of organizational
knowledge. In practice, the choice of Endpoint programming interface knowledge. In practice, the choice of Endpoint programming interface
is determined by the choice backing database storage. is determined by the choice of what database system to use.
In our project, organizational knowledge is formulated as In our project, organizational knowledge is formulated as
RDF~triplets. The canonical choice for us is to use a triple store, RDF~triplets. The canonical choice for us is to use a triple store,
...@@ -185,45 +196,45 @@ at~\cite{graphdbfree}. ...@@ -185,45 +196,45 @@ at~\cite{graphdbfree}.
\subsubsection{Transitive Queries} \subsubsection{Transitive Queries}
A big advantage of GraphDB compared to other systems, such as Virtuoso A notable advantage of GraphDB compared to other systems such as
Open Source~\cite{wikivirtuoso} used in previous work related to the Virtuoso Open Source~\cite{wikivirtuoso, ulo} is that GraphDB supports
upper level ontology~\cite{ulo}, is that it supports recent versions recent versions of the SPARQL query language~\cite{graphdbsparql} and
of SPARQL~\cite{graphdbsparql} and OWL~Reasoning~\cite{owlspec, OWL~Reasoning~\cite{owlspec, graphdbreason}. In particular, this
graphdbreason}. In particular, this means that GraphDB offers means that GraphDB offers support for transitive queries as described
support for transitive queries as described in previous in previous work~\cite{ulo}. A transitive query is one that, given a
work~\cite{ulo}. A transitive query is one that, given a relation~$R$, relation~$R$, asks for the transitive closure~$S$ of~$R$~\cite{tc}
asks for the transitive closure~$S$ of~$R$~\cite{tc}
(Figure~\ref{fig:tc}). (Figure~\ref{fig:tc}).
\input{implementation-transitive-closure.tex}
In fact, GraphDB supports two approaches for realizing transitive In fact, GraphDB supports two approaches for realizing transitive
queries. On one hand, GraphDB supports the queries. On one hand, GraphDB supports the
\texttt{owl:TransitiveProperty} property that defines a given \texttt{owl:TransitiveProperty}~\cite[Section 4.4.1]{owlspec} property
predicate~$P$ to be transitive. With $P$~marked in this way, querying that defines a given predicate~$P$ to be transitive. With $P$~marked
the knowledge base is equivalent to querying the transitive closure in this way, querying the knowledge base is equivalent to querying the
of~$P$. This, of course, requires transitivity to be hard-coded into transitive closure of~$P$. This, of course, requires transitivity to
the knowledge base. If we only wish to query the transitive closure be hard-coded into the knowledge base. If we only wish to query the
for a given query, we can take advantage of property transitive closure for a given query, we can take advantage of
paths~\cite{paths} which allows us to indicate that a given property paths~\cite{paths} which allows us to indicate that a given
predicate~$P$ is to be understood as transitive when querying. Only predicate~$P$ is to be understood as transitive when querying. Only
during querying is the transitive closure then evaluated. during querying is the transitive closure then evaluated. Either way,
GraphDB supports transitive queries without awkward workarounds
necessary in other systems~\cite{ulo}.
\input{implementation-transitive-closure.tex} \subsubsection{SPARQL Endpoint}
There are multiple approaches to querying the GraphDB triple store, There are multiple approaches to querying the GraphDB triple store,
one based around the standardized SPARQL query language and the other one based around the standardized SPARQL query language and the other
on the RDF4J Java library. Both approaches have unique advantages. on the RDF4J Java library. Both approaches have unique advantages.
Let us first take a look at {SPARQL}, which is a standardized query
\subsubsection{SPARQL Endpoint} language for RDF triplet data~\cite{sparql}. The specification
includes not just syntax and semantics of the language itself, but
SPARQL is a standardized query language for RDF triplet also a standardized REST interface~\cite{rest} for querying database
data~\cite{sparql}. The specification includes not just syntax and servers.
semantics of the language itself, but also a standardized REST
interface~\cite{rest} for querying database servers. SPARQL was inspired by SQL and as such the \texttt{SELECT}
\texttt{WHERE} syntax should be familiar to many software developers.
\noindent\textbf{Syntax} SPARQL is inspired by SQL and as such the A simple query that returns all triplets in the store looks like
\texttt{SELECT} \texttt{WHERE} syntax should be familiar to many
software developers. A simple query that returns all triplets in the
store looks like
\begin{lstlisting} \begin{lstlisting}
SELECT * WHERE { ?s ?p ?o } SELECT * WHERE { ?s ?p ?o }
\end{lstlisting} \end{lstlisting}
...@@ -233,50 +244,49 @@ query variables. In this particular case, the database would return a ...@@ -233,50 +244,49 @@ query variables. In this particular case, the database would return a
table of all triplets in the store sorted by subject~\texttt{?o}, table of all triplets in the store sorted by subject~\texttt{?o},
predicate~\texttt{?p} and object~\texttt{?o}. predicate~\texttt{?p} and object~\texttt{?o}.
\noindent\textbf{Advantage} Probably the biggest advantage is that SPARQL Probably the biggest advantage is that SPARQL is ubiquitous. As it is
is ubiquitous. As it is the de facto standard for querying the de facto standard for querying triple stores, lots of
triple stores, lots of implementations and documentation are implementations (client and server) as well as documentation are
available~\cite{sparqlbook, sparqlimpls, gosparql}. available~\cite{sparqlbook, sparqlimpls, gosparql}.
\subsubsection{RDF4J Endpoint} \subsubsection{RDF4J Endpoint}
RDF4J is a Java API for interacting with triple stores, implemented SPARQL is one way of accessing a triple store database. Another
based on a superset of the {SPARQL} REST interface~\cite{rdf4j}. approach is RDF4J, a Java API for interacting with RDF graphs,
GraphDB is one of the database servers that supports RDF4J, in fact it implemented based on a superset of the {SPARQL} REST
is the recommended way of interacting with GraphDB interface~\cite{rdf4j}. GraphDB is one of the database servers that
repositories~\cite{graphdbapi}. supports RDF4J, in fact it is the recommended way of interacting with
GraphDB repositories~\cite{graphdbapi}.
\noindent\textbf{Syntax} Instead of formulating textual queries, RDF4J allows Instead of formulating textual queries, RDF4J allows developers to
developers to query a repository by calling Java API methods. Previous query a knowledge base by calling Java library methods. Previous query
query that requests all triplets in the store looks like that asks for all triplets in the store looks like
\begin{lstlisting} \begin{lstlisting}
connection.getStatements(null, null, null); connection.getStatements(null, null, null);
\end{lstlisting} \end{lstlisting}
in RDF4J. \texttt{getStatements(s, p, o)} returns all triplets that in RDF4J. \texttt{getStatements(s, p, o)} returns all triplets that
have matching subject~\texttt{s}, predicate~\texttt{p} and have matching subject~\texttt{s}, predicate~\texttt{p} and
object~\texttt{o}. Any argument that is \texttt{null} can be replace object~\texttt{o}. Any argument that is \texttt{null} can be replace
with any value, i.e.\ it is a query variable to be filled by the call with any value, that is it is a query variable to be filled by the
to \texttt{getStatements}. call to \texttt{getStatements}.
\noindent\textbf{Advantage} Using RDF4J does introduce a dependency on the JVM Using RDF4J does introduce a dependency on the JVM and its
and its languages. But in practice, we found RDF4J to be quite languages. But in practice, we found RDF4J to be quite convenient,
convenient, especially for simple queries, as it allows us to especially for simple queries, as it allows us to formulate everything
formulate everything in a single programming language rather than in a single programming language rather than mixing programming
mixing programming language with awkward query strings. language with awkward query strings. We also found it quite helpful to
generate Java classes from OWL~ontologies that contain all definitions
We also found it quite helpful to generate Java classes from of the ontology as easily accessible constants~\cite{rdf4jgen}. This
OWL~ontologies that contain all definitions of the provides us with powerful IDE auto completion features during
ontology~\cite{rdf4jgen}. This provides us with powerful IDE auto development of ULO applications.
completion features during development of ULO applications.
Summarizing the last two sections, we see that both SPARQL and RDF4J
\subsubsection{Endpoints in \emph{ulo-storage}} have unique advantages. While SPARQL is an official W3C standard and
implemented by more database systems, RDF4J can be more convenient
We see that both SPARQL and RDF4J have unique advantages. While SPARQL when dealing with JVM-based projects. For \emph{ulo-storage}, we
is an official W3C standard and implemented by more database systems, played around with both interfaces and chose whatever seemed more
RDF4J can be more convenient when dealing with JVM-based code bases. convenient at the moment. We recommend any implementors to do the
For \emph{ulo-storage}, we played around with both interfaces and same.
chose whatever seemed more convenient at the moment. We recommend any
implementors to do the same.
\subsection{Deployment and Availability} \subsection{Deployment and Availability}
...@@ -291,17 +301,16 @@ environment for running a given application~\cite[pp. 22]{dockerbook}. ...@@ -291,17 +301,16 @@ environment for running a given application~\cite[pp. 22]{dockerbook}.
Docker Compose then is a way of combining individual Docker containers Docker Compose then is a way of combining individual Docker containers
to run a full tech stack of application, database server and so to run a full tech stack of application, database server and so
on~\cite[pp. 42]{dockerbook}. All configuration of such a setup is on~\cite[pp. 42]{dockerbook}. All configuration of such a setup is
stored in a Docker Compose file that describes the tech stack. stored in a Docker Compose file that describes the software stack.
For \emph{ulo-storage}, we provide a single Docker Compose file which For \emph{ulo-storage}, we provide a single Docker Compose file which
starts three containers, namely (1)~the Collector/Importer web starts three containers, namely (1)~the Collector/Importer web
interface, (2)~a database server for that web interface such that it interface, (2)~a GraphDB instance which provides us with the required
can persist import jobs and finally (3)~a GraphDB instance which Endpoint and (3)~some test applications that use that Endpoint. All
provides us with the required Endpoint. All code for Collector and code for Collector and Importer is available in the
Importer is available in the \texttt{ulo-storage-collect} Git \texttt{ulo-storage-collect} Git repository~\cite{gorepo}. Additional
repository~\cite{gorepo}. Additional deployment files, that is Docker deployment files, that is Docker Compose and additional Dockerfiles
Compose and additional Dockerfiles are stored in a separate are stored in a separate repository~\cite{dockerfilerepo}.
repository~\cite{dockerfilerepo}.
This concludes our discussion of the implementation developed for the This concludes our discussion of the implementation developed for the
\emph{ulo-storage} project. We designed a system based around (1)~a \emph{ulo-storage} project. We designed a system based around (1)~a
......
...@@ -2,12 +2,12 @@ ...@@ -2,12 +2,12 @@
To tackle the vast array of mathematical publications, various ways of To tackle the vast array of mathematical publications, various ways of
\emph{computerizing} mathematical knowledge have been experimented \emph{computerizing} mathematical knowledge have been experimented
with. As it is already difficult for human mathematicians to keep with. Already it is difficult for human mathematicians to keep even a
even a subset of all mathematical knowledge in their mind, a hope is subset of all mathematical knowledge in their mind, a problem referred
that computerization will yield great improvement to formal research to as the ``one brain barrier'' in literature~\cite{onebrain}. A hope
by making the results of all collected publications readily available is that computerization will yield great improvement to formal
and easy to search. In literature, this problem is referred to as the research by making the results of all collected publications readily
``one brain barrier''~\cite{onebrain}. available and easy to search.
One research topic in this field is the idea of a \emph{tetrapodal One research topic in this field is the idea of a \emph{tetrapodal
search} that combines four distinct areas of mathematical knowledge. search} that combines four distinct areas of mathematical knowledge.
...@@ -24,30 +24,32 @@ structure, tetrapodal search proposes that each kind of mathematical ...@@ -24,30 +24,32 @@ structure, tetrapodal search proposes that each kind of mathematical
knowledge should be made available in a storage backend that fits the knowledge should be made available in a storage backend that fits the
kind of data it is providing. With all four areas available for kind of data it is providing. With all four areas available for
querying, tetrapodal search intends to then combine the four indexes querying, tetrapodal search intends to then combine the four indexes
into a single query interface. Currently, research is focused on into a single query interface. Currently, research aims at providing
providing schemas, storage backends and indexes for the four different schemas, storage backends and indexes for the four different kinds of
kinds of mathematical knowledge. The focus of \emph{ulo-storage} is mathematical knowledge. In this context, the focus of
the area of organizational knowledge. \emph{ulo-storage} is the area of organizational knowledge.
A previously proposed way to structure such organizational data is the A previously proposed way to structure organizational knowledge is the
\emph{upper level ontology} (ULO)~\cite{ulo}. ULO takes the form of an \emph{upper level ontology} (ULO)~\cite{ulo}. ULO takes the form of an
OWL~ontology~\cite{uloonto} and as such all organization information OWL~ontology~\cite{uloonto} and as such all organization information
is stored as RDF~triplets with a unified schema of is stored as RDF~triplets with a unified schema of
ULO~predicates~\cite{owl}. Some effort has been made to export ULO~predicates~\cite{owl}. Some effort has been made to export
existing databases of formal mathematical knowledge to {ULO}. In existing collections of formal mathematical knowledge to {ULO}. In
particular, exports from Isabelle and Coq-based libraries are particular, exports from Isabelle and Coq-based libraries are
available~\cite{uloisabelle, ulocoq}. The resulting data set is already available~\cite{uloisabelle, ulocoq}. The resulting data set is
quite large, the Isabelle export alone containing more than already quite large, the Isabelle export alone containing more than
200~million triplets. 200~million triplets.
Existing exports from Isabelle and Coq result in a set of XML~files Existing exports from Isabelle and Coq result in a set of XML~files
that contain RDF~triplets. This is a convenient format for exchange that contain RDF~triplets. This is a convenient format for exchange
and easily versioned using Git. However, considering the vast number and easily tracked using version control systems such as
of triplets, it is impossible to query easily and efficiently in this Git~\cite{gitpaper} as employed by MathHub~\cite{mh}. However,
state. This is what \emph{ulo-storage} is focused on: Making ULO data considering the vast number of triplets, it is impossible to query
sets accessible for querying and analysis. We collected RDF files easily and efficiently in this state. This is what \emph{ulo-storage}
spread over different Git repositories, imported them into a database is focused on: Making ULO data sets accessible for querying and
and then experimented with APIs for accessing that data set. analysis. We collected RDF files spread over different Git
repositories, imported them into a database and then experimented with
APIs for accessing that data set.
The main contribution of \emph{ulo-storage} is twofold. First, (1)~we The main contribution of \emph{ulo-storage} is twofold. First, (1)~we
built up various infrastructure components for making organizational built up various infrastructure components for making organizational
......
...@@ -401,3 +401,22 @@ ...@@ -401,3 +401,22 @@
urldate = {2020-09-27}, urldate = {2020-09-27},
url = {https://www.w3.org/2009/sparql/wiki/Feature:PropertyPaths}, url = {https://www.w3.org/2009/sparql/wiki/Feature:PropertyPaths},
} }
@article{gitpaper,
title={GIT--A stupid content tracker},
author={Hamano, Junio C},
journal={Proc. Ottawa Linux Sympo},
volume={1},
pages={385--394},
year={2006},
publisher={Citeseer}
}
@incollection{mh,
title={System description: MathHub. info},
author={Iancu, Mihnea and Jucovschi, Constantin and Kohlhase, Michael and Wiesing, Tom},
booktitle={Intelligent Computer Mathematics},
pages={431--434},
year={2014},
publisher={Springer}
}
...@@ -3,6 +3,7 @@ ...@@ -3,6 +3,7 @@
\usepackage{amsmath} \usepackage{amsmath}
\usepackage{booktabs} \usepackage{booktabs}
\usepackage{caption} \usepackage{caption}
\usepackage{fancybox, graphicx}
\usepackage{float} \usepackage{float}
\usepackage{geometry} \usepackage{geometry}
\usepackage{graphicx} \usepackage{graphicx}
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment