diff --git a/doc/report/collecter.tex b/doc/report/collecter.tex deleted file mode 100644 index d5237d0e9a5c17e4db14fba9196df1b462b39df3..0000000000000000000000000000000000000000 --- a/doc/report/collecter.tex +++ /dev/null @@ -1,4 +0,0 @@ -\section{Collecter and Importer}\label{sec:collecter} - -\emph{TODO} - diff --git a/doc/report/conclusion.tex b/doc/report/conclusion.tex index f01b33d8f820286f6c5c473991fbfa9783c878f2..aebbae195f8d8293a164fce3f7d4e9e23b4b4352 100644 --- a/doc/report/conclusion.tex +++ b/doc/report/conclusion.tex @@ -1 +1,20 @@ -\section{Conclusion and Future Work}\label{sec:conclusion} +\section{Conclusions and Next Steps}\label{sec:conclusion} + +\subsection{An Additional Harvester Component} % copied from introduction.tex + +These are the three components realized for +\emph{ulo-storage}. However, additionally to these components, one +could think of a \emph{Harvester} component. Above we assumed that +the ULO triplets are already available in RDF~format. This is not +necessarily true. It might be desirable to automate the export from +third party formats to ULO and we think this should be the job of a +Harvester component. It fetches mathematical knowledge from some +remote source and then provides a volatile stream of ULO data to the +Collecter, which then passes it to the Importer and so on. The big +advantage of such an approach would be that exports from third party +libraries can always be up to date and do not have to be initiated +manually. Another advantage of this hypothetical component is that +running exports through the Harvester involves the whole import chain +of Collecter and Importer which involves syntax~checking for the +exported RDF data. Bugs in exporters that produce faulty XML would be +found earlier in development. diff --git a/doc/report/endpoints.tex b/doc/report/implementation.tex similarity index 61% rename from doc/report/endpoints.tex rename to doc/report/implementation.tex index 014e9d10d90db48d9e9319ba31ffbe62cb9d3979..04476ccb36c5faac55b2de9658ca017926e45252 100644 --- a/doc/report/endpoints.tex +++ b/doc/report/implementation.tex @@ -1,4 +1,55 @@ -\section{Endpoints}\label{sec:endpoints} +\section{Implementation}\label{sec:implementation} + +\subsection{Components Implemented for \emph{ulo-storage}}\label{sec:components} + +With RDF files exported and available for download as Git repositories +on MathHub, we have the goal of making the underlying data available +for use in applications. It makes sense to first identify the various +components that might be involved in such a system. +Figure~\ref{fig:components} illustrates all components and their +relationships. + +\begin{figure}[]\begin{center} + \includegraphics{figs/components} + \caption{Components involved in the \emph{ulo-storage} system.}\label{fig:components} +\end{center}\end{figure} + +\begin{itemize} +\item ULO triplets are present in various locations, be it Git + repositories, on web servers or the local disk. It is the job of a + \emph{Collecter} to assemble these {RDF}~files and forward them for further + processing. This may involve cloning a Git repository or crawling + the file system. + + \item With streams of ULO files assembled by the Collecter, this + data then gets passed to an \emph{Importer}. An Importer uploads + RDF~streams into some kind of permanent storage. For + use in this project, the GraphDB~\cite{graphdb} triplet store was + a natural fit. + + For this project, both Collecter and Importer ended up being one + piece of monolithic software, but this does not have to be the case. + +\item Finally, with all triplets stored in a database, an + \emph{Endpoint} is where applications access the underlying + knowledge base. This does not necessarily need to be any custom + software, rather the programming interface of the underlying + database itself could be understood as an endpoint of its own. + + Regardless, some thought can be put into designing an Endpoint as a + layer that lives between application and database that is more + convenient to use than the one provided by the database. It comes + down to the programming interface we wish to provide to a developer + using this system. +\end{itemize} + +Collecter, Importer and Endpoint provide us with an easy and automated +way of making RDF files ready for use with applications. In this +introduction we only wanted to give the reader a general understanding +in the infrastructure that makes up \emph{ulo-storage}, the following +sections will explain each component in more detail. + +\subsection{Endpoints}\label{sec:endpoints} With ULO triplets imported into the GraphDB triplet store by Collecter and Importer, we now have all data available necessary for querying. diff --git a/doc/report/intro.tex b/doc/report/intro.tex deleted file mode 100644 index 934433ddb80a63dde398547c1a34aa2e2b0742e5..0000000000000000000000000000000000000000 --- a/doc/report/intro.tex +++ /dev/null @@ -1,131 +0,0 @@ -\section{Introduction}\label{sec:introduction} - -To tackle the vast array of mathematical -publications, various ways of \emph{computerizing} mathematical -knowledge have been experimented with. As it is already difficult for -human mathematicians to keep even a subset of all mathematical -knowledge in their mind, a hope is that computerization will yield -great improvement to mathematical (and really any formal) research by -making the results of all collected publications readily available and -easy to search~\cite{onebrain}. - -One research topic in this field is the idea of a \emph{tetrapodal -search} that combines four distinct areas of mathematical knowledge. -These four kinds being (1)~the actual formulae as \emph{symbolic -knowledge}, (2)~examples and concrete objects as \emph{concrete knowledge}, -(3)~names and comments as \emph{narrative knowledge} and finally -(4)~identifiers, references and their relationships, referred to as -\emph{organizational knowledge}~\cite{tetra}. - -Tetrapodal search aims to provide a unified search engine that indexes -each of the four different subsets of mathematical knowledge. Because -all four kinds of knowledge are inherently different in their -structure, tetrapodal search proposes that each kind of mathematical -knowledge should be made available in a storage backend that fits the -kind of data it is providing. With all four areas available for -querying, tetrapodal search intends to then combine the four indexes -into a single query interface. - -\subsection{Focus on Organizational Knowledge} - -Currently, research is focused on providing schemas, storage backends -and indexes for the four different kinds of mathematical -knowledge. The focus of \emph{ulo-storage} is the area of -organizational knowledge. - -A previously proposed way to structure such organizational data is the -\emph{upper level ontology} (ULO)~\cite{ulo}. ULO takes the form of an -OWL~ontology~\cite{uloonto} and as such all organization information -is stored as RDF~triplets with a unified schema of -ULO~predicates~\cite{owl}. Some effort has been made to export -existing databases of formal mathematical knowledge to {ULO}, in -particular, there exist exports from Isabelle and Coq -libraries~\cite{uloisabelle, ulocoq}. The resulting data set is -already quite large, the Isabelle export alone containing more than -200~million triplets. - -Existing exports from Isabelle and Coq result in single or multiple -RDF~files. This is a convenient format for exchange and easily -versioned using Git. However, considering the vast number of triplets, -it is impossible to query easily and efficiently in this state. This -is what \emph{ulo-storage} is focused on: Making ULO data sets -accessible for querying and analysis. We collected RDF files spread -over different Git repositories, imported them into a database and -then experimented with APIs for accessing that data set. - -The main contribution of \emph{ulo-storage} is twofold. First, (1)~we -built up various infrastructure components that can make up building -blocks in a larger tetrapodal search system. Second, (2)~we ran sample -prototype applications and queries on top of this interface. While the -applications themselves are admittedly not very interesting, they can -give us insight about future development of the upper level ontology. - -\subsection{Components Implemented for \emph{ulo-storage}}\label{sec:components} - -With RDF files exported and available for download as Git repositories -on MathHub, we have the goal of making the underlying data available -for use in applications. It makes sense to first identify the various -components that might be involved in such a system. -Figure~\ref{fig:components} illustrates all components and their -relationships. - -\begin{figure}[]\begin{center} - \includegraphics{figs/components} - \caption{Components involved in the \emph{ulo-storage} system.}\label{fig:components} -\end{center}\end{figure} - -\begin{itemize} -\item ULO triplets are present in various locations, be it Git - repositories, on web servers or the local disk. It is the job of a - \emph{Collecter} to assemble these {RDF}~files and forward them for further - processing. This may involve cloning a Git repository or crawling - the file system. - - \item With streams of ULO files assembled by the Collecter, this - data then gets passed to an \emph{Importer}. An Importer uploads - RDF~streams into some kind of permanent storage. For - use in this project, the GraphDB~\cite{graphdb} triplet store was - a natural fit. - - For this project, both Collecter and Importer ended up being one - piece of monolithic software, but this does not have to be the case. - -\item Finally, with all triplets stored in a database, an - \emph{Endpoint} is where applications access the underlying - knowledge base. This does not necessarily need to be any custom - software, rather the programming interface of the underlying - database itself could be understood as an endpoint of its own. - - Regardless, some thought can be put into designing an Endpoint as a - layer that lives between application and database that is more - convenient to use than the one provided by the database. It comes - down to the programming interface we wish to provide to a developer - using this system. -\end{itemize} - -\subsection{An Additional Harvester Component} - -These are the three components realized for -\emph{ulo-storage}. However, additionally to these components, one -could think of a \emph{Harvester} component. Above we assumed that -the ULO triplets are already available in RDF~format. This is not -necessarily true. It might be desirable to automate the export from -third party formats to ULO and we think this should be the job of a -Harvester component. It fetches mathematical knowledge from some -remote source and then provides a volatile stream of ULO data to the -Collecter, which then passes it to the Importer and so on. The big -advantage of such an approach would be that exports from third party -libraries can always be up to date and do not have to be initiated -manually. Another advantage of this hypothetical component is that -running exports through the Harvester involves the whole import chain -of Collecter and Importer which involves syntax~checking for the -exported RDF data. Bugs in exporters that produce faulty XML would be -found earlier in development. - -We did not implement a Harvester for \emph{ulo-storage} but we suggest -that it is an idea to keep in mind. The components we did implement -(Collecter, Importer and Endpoint) provide us with an easy and -automated way of making RDF files ready for use with applications. In -this introduction we only wanted to give the reader a general -understanding in the infrastructure that makes up \emph{ulo-storage}, -the following sections will explain each component in more detail. diff --git a/doc/report/introduction.tex b/doc/report/introduction.tex new file mode 100644 index 0000000000000000000000000000000000000000..c070f083e71b58295295796b6273035d554a61c5 --- /dev/null +++ b/doc/report/introduction.tex @@ -0,0 +1,59 @@ +\section{Introduction to the \emph{ulo-storage} Project}\label{sec:introduction} + +To tackle the vast array of mathematical +publications, various ways of \emph{computerizing} mathematical +knowledge have been experimented with. As it is already difficult for +human mathematicians to keep even a subset of all mathematical +knowledge in their mind, a hope is that computerization will yield +great improvement to mathematical (and really any formal) research by +making the results of all collected publications readily available and +easy to search~\cite{onebrain}. + +One research topic in this field is the idea of a \emph{tetrapodal +search} that combines four distinct areas of mathematical knowledge. +These four kinds being (1)~the actual formulae as \emph{symbolic +knowledge}, (2)~examples and concrete objects as \emph{concrete knowledge}, +(3)~names and comments as \emph{narrative knowledge} and finally +(4)~identifiers, references and their relationships, referred to as +\emph{organizational knowledge}~\cite{tetra}. + +Tetrapodal search aims to provide a unified search engine that indexes +each of the four different subsets of mathematical knowledge. Because +all four kinds of knowledge are inherently different in their +structure, tetrapodal search proposes that each kind of mathematical +knowledge should be made available in a storage backend that fits the +kind of data it is providing. With all four areas available for +querying, tetrapodal search intends to then combine the four indexes +into a single query interface. + +Currently, research is focused on providing schemas, storage backends +and indexes for the four different kinds of mathematical +knowledge. The focus of \emph{ulo-storage} is the area of +organizational knowledge. + +A previously proposed way to structure such organizational data is the +\emph{upper level ontology} (ULO)~\cite{ulo}. ULO takes the form of an +OWL~ontology~\cite{uloonto} and as such all organization information +is stored as RDF~triplets with a unified schema of +ULO~predicates~\cite{owl}. Some effort has been made to export +existing databases of formal mathematical knowledge to {ULO}, in +particular, there exist exports from Isabelle and Coq +libraries~\cite{uloisabelle, ulocoq}. The resulting data set is +already quite large, the Isabelle export alone containing more than +200~million triplets. + +Existing exports from Isabelle and Coq result in single or multiple +RDF~files. This is a convenient format for exchange and easily +versioned using Git. However, considering the vast number of triplets, +it is impossible to query easily and efficiently in this state. This +is what \emph{ulo-storage} is focused on: Making ULO data sets +accessible for querying and analysis. We collected RDF files spread +over different Git repositories, imported them into a database and +then experimented with APIs for accessing that data set. + +The main contribution of \emph{ulo-storage} is twofold. First, (1)~we +built up various infrastructure components that can make up building +blocks in a larger tetrapodal search system. Second, (2)~we ran sample +prototype applications and queries on top of this interface. While the +applications themselves are admittedly not very interesting, they can +give us insight about future development of the upper level ontology. diff --git a/doc/report/report.tex b/doc/report/report.tex index db70eba12df604091b21f9282cc283898678f9e1..953f52b0dba021a1e6a6167fb1c64bf541be0701 100644 --- a/doc/report/report.tex +++ b/doc/report/report.tex @@ -50,11 +50,9 @@ \tableofcontents \newpage -\input{intro.tex} +\input{introduction.tex} \newpage -\input{collecter.tex} -\newpage -\input{endpoints.tex} +\input{implementation.tex} \newpage \input{applications.tex} \newpage