diff --git a/doc/report/Makefile b/doc/report/Makefile index 1bf998c2e217dbe4572a448e9686fd2c26935324..84a327554915943a5db7e0a3b035b738427eac3f 100644 --- a/doc/report/Makefile +++ b/doc/report/Makefile @@ -1,5 +1,5 @@ TEX_SOURCES = report.tex abstract.tex intro.tex components.tex \ - endpoints.tex applications.tex + endpoints.tex applications.tex conclusion.tex report.pdf: $(TEX_SOURCES) references.bib chronic pdflatex $< diff --git a/doc/report/abstract.tex b/doc/report/abstract.tex index 98496d5dd3abe8b663df653b42b9b80b256aed78..d684a92c00f95ed0ff370d0aac6483443c3480c2 100644 --- a/doc/report/abstract.tex +++ b/doc/report/abstract.tex @@ -1,9 +1,9 @@ Organizational data extracted from mathematical libraries has the potential to be usable in the design of a universal search engine for mathematical knowledge. However, it is not enough to only extract this -data into a unified format, it is also necessary to make this data -easily available. The project \emph{ulo-storage} which this report -documents had the aim of doing just that. Collect various pieces of +data into a unified format, it is also necessary to make this information +easily available and queryable. The project \emph{ulo-storage}, which this report +documents, aims to lay out the groundwork. We collect various pieces of exported data sets into a centralized and efficient store, make that storage engine available as a publicly available endpoint and then evaluate different ways of querying that store. diff --git a/doc/report/applications.tex b/doc/report/applications.tex index b5816de80bd59eb7580263902eb60376972da61c..48fc3b5ae5ce726104c00961facd48d872172c14 100644 --- a/doc/report/applications.tex +++ b/doc/report/applications.tex @@ -1,4 +1,4 @@ -\section{Applications} +\section{Applications}\label{sec:applications} With endpoints in place, we can now query the ULO/RDF data set. Depending on the kind of application, different interfaces diff --git a/doc/report/collecter.tex b/doc/report/collecter.tex new file mode 100644 index 0000000000000000000000000000000000000000..6381f3f36d077fc67c282da4bece2c75738d3fbe --- /dev/null +++ b/doc/report/collecter.tex @@ -0,0 +1,2 @@ +\section{Collecter and Importer}\label{sec:collecter} + diff --git a/doc/report/components.tex b/doc/report/components.tex index b6cb50fa2e853053f561bfccf1322601432ea747..9deca8689ae1ca15640704f57cecc2caebdd45b6 100644 --- a/doc/report/components.tex +++ b/doc/report/components.tex @@ -1,15 +1,11 @@ -\section{Components} - -With various ULO/RDF files in place we have the aim of making the -underlying data available for use with applications. For this, we -should first make out the various components that might be involved in -such a system. As a guide, figure~\ref{fig:components} illustrates the -various components and their interplay. We will now give an overview -over all involved components. Each component will later be discussed -in more detail, this section serves only for the reader to get a -general understanding of the developed infrastructure and its -topology. +\section{Components}\label{sec:components} +With ULO/RDF files in place and available for download on +MathHub~\cite{uloisabelle, ulocoq} as Git repositories, we have the +aim of making the underlying data available for use in applications. +It makes sense to first identify out the various components that might +be involved in such a system. They are illustrated in +figure~\ref{fig:components}. \begin{figure}[]\begin{center} \includegraphics{figs/components} @@ -18,36 +14,46 @@ topology. \begin{itemize} \item ULO/RDF data is present on various locations, be it Git - repositories, available on web servers via HTTP or on the local disk - of a user. Regardless where this ULO/RDF data is stored, a - \emph{Collecter} collects these {ULO/RDF}. In the easiest case, this - involves cloning a Git repository or crawling a file system for - matching files. - -\item With streams of ULO/RDF files at the Collecter, this information - then gets passed to the \emph{Importer}. The Importer imports - triplets from files into some kind of permanent storage. For use in - this project, the GraphDB~\cite{graphdb} triplet store was natural - fit. In practice, both Collecter and Importer end up being one piece - of software, but this does not have to be the case. + repositories, available on web servers or on the local disk. + Regardless where this ULO/RDF data is stored, it is the job of a + \emph{Collecter} to assemble these {ULO/RDF}~files and forward them + for further processing. This may involve cloning a Git repository or + crawling a file system. + +\item With streams of ULO/RDF files assembled bye the Collecter, this + data then gets passed to an \emph{Importer}. The Importer uploads + the received RDF~streams into some kind of permanent storage. For + use in this project, the GraphDB~\cite{graphdb} triplet store was + natural fit. + + In practice, both Collecter and Importer end up being one piece of + software, but this does not have to be the case. \item Finally, with all triplets stored in a database, an \emph{Endpoint} is where applications access the underlying knowledge base. This does not necessarily need to be any specific software, rather the programming API of the database could be - understood as an endpoint of its own. However, some thought should - be put into designing an Endpoint that is convenient to use. + understood as an endpoint of its own. However, some thought can be + put into designing an Endpoint that is more convenient to use than + the one provided by the database. \end{itemize} -Additionally, one could think of a \emph{Harvester} component. Before -we assumed that the ULO/RDF triplets are already available as -such. Indeed for this project this is the case as we worked on already +Additionally to these components, one could think of a \emph{Harvester} component. +We assumed that the ULO/RDF triplets are already available in RDF~format. +Indeed for this project this is the case as we work with already exported triplets from the Isabelle and Coq libraries. However, this -does not need to be the case. It might be desirable to automate the -export from third party formats to ULO/RDF and indeed this is what a -Harvester would do. It fetches mathematical knowledge from some -remote source and then provides a volatile stream of ULO/RDF data to -the Collecter, which then passes it to the Importer and so on. The big -advantage of such an approach would be that exports from third party -libraries can always be up to date and do not have to be initiated -manually. +does not need to be the case. + +It might be desirable to automate the export from third party formats +to ULO/RDF and indeed this is what a Harvester would do. It fetches +mathematical knowledge from some remote source and then provides a +volatile stream of ULO/RDF data to the Collecter, which then passes it +to the Importer and so on. The big advantage of such an approach would +be that exports from third party libraries can always be up to date +and do not have to be initiated manually. However, due to time +constraints this idea was not followed up upon in this project. + +Each component (Collecter, Importer, Endpoint) will later be discussed +in more detail. This section serves only for the reader to get a +general understanding of the developed infrastructure and its +topology. diff --git a/doc/report/conclusion.tex b/doc/report/conclusion.tex new file mode 100644 index 0000000000000000000000000000000000000000..f01b33d8f820286f6c5c473991fbfa9783c878f2 --- /dev/null +++ b/doc/report/conclusion.tex @@ -0,0 +1 @@ +\section{Conclusion and Future Work}\label{sec:conclusion} diff --git a/doc/report/endpoints.tex b/doc/report/endpoints.tex index 72a6ebb7fab7ed91511b70eb864ec09cf008d711..2206d2f58e10a03c26904e39bce85d1eba80cdbc 100644 --- a/doc/report/endpoints.tex +++ b/doc/report/endpoints.tex @@ -1,4 +1,4 @@ -\section{Endpoints} +\section{Endpoints}\label{sec:endpoints} With ULO/RDF triplets imported into a database, in our case GraphDB, we have all data available for querying. There are multiple approaches diff --git a/doc/report/intro.tex b/doc/report/intro.tex index a54ffb27b5b53735b6e2ba3a517c9aa2220baaaf..d6b094bfc9050d63ebe99d9b53232edd3fd7230b 100644 --- a/doc/report/intro.tex +++ b/doc/report/intro.tex @@ -1,4 +1,4 @@ -\section{Introduction} +\section{Introduction}\label{sec:introduction} To tackle the vast array of mathematical publications, various ways of \emph{computerizing} mathematical knowledge have been researched and @@ -6,7 +6,7 @@ developed. As it is already difficult for human mathematicians to keep even a subset of all mathematical knowledge in their mind, a hope is that computerization will yield great improvement to mathematical (and really any) research by making the results of all collected research -readily available~\cite{onebrain}. +readily available and easy to search~\cite{onebrain}. One research topic in this field is the idea of a \emph{tetrapodal search} that combines four distinct areas of mathematical knowledge @@ -16,37 +16,43 @@ knowledge}, (2)~examples and concrete objects as \emph{concrete knowledge}, (4)~identifiers, references and their relationships, referred to as \emph{organizational knowledge}~\cite{tetra}. -Tetrapodal search wants to provide a unified search engine that -indexes each of those four different subsets of mathematical -knowledge. Because all four kinds of knowledge are inherently -different in their structure, tetrapodal search proposes that each -kind of mathematical knowledge should be made available in a storage -and index backend that fits exactly with the kind of data it is -storing. With all four areas available for querying, tetrapodal search -wants to then combine the four indexes into a single query interface. +Tetrapodal search aims to provide a unified search engine that indexes +each of the four different subsets of mathematical knowledge. Because +all four kinds of knowledge are inherently different in their +structure, tetrapodal search proposes that each kind of mathematical +knowledge should be made available in a storage and index backend that +fits exactly with the kind of data it is providing. With all four +areas available for querying, tetrapodal search then intends to then +combine the four indexes into a single query interface. -\subsection{Structuring Organizational Knowledge} - -As it stands, tetrapodal search is long before any usable prototype -being available yet. Currently the interest is in providing storage +As it stands, tetrapodal search is not really close to usable +prototypes Currently, research is focused on providing storage backends and indexes for the four different kinds of mathematical -knowledge. The focus of the project work this report is documenting is -organizational knowledge, that is knowledge concerning itself with -names, namespaces, filenames and so on. +knowledge. The focus of \emph{ulo-storage} is to achieve this for +organizational knowledge. -A proposed way to structure such organizational data is the +A previously proposed way to structure such organizational data is the \emph{upper level ontology} (ULO/RDF)~\cite{ulo}. ULO/RDF takes the form of an OWL~ontology and as such all organization information is stored as RDF~triplets with a unified schema of ULO~predicates. Some effort has been made to export existing databases of formal -mathematical knowledge to {ULO/RDF}. In particular, there exist exports -from Isabelle and Coq libraries~\cite{uloisabelle, ulocoq}. The -resulting data set is already quite big, the Isabelle export containing -more than 200~million triplets. +mathematical knowledge to {ULO/RDF}. In particular, there exist +exports from Isabelle and Coq libraries~\cite{uloisabelle, +ulocoq}. The resulting data set is already quite large, the Isabelle +export alone containing more than 200~million triplets. Existing exports from Isabelle and Coq result in a single or multiple RDF~files. This is a convenient form for exchange and easily versioned using Git. However, considering the vast number of triplets, it is impossible to query easily and efficiently as it is. Ultimately this -is what this project work was focused on: Making ULO/RDF accessible -for querying and analysis. +is what \emph{ulo-storage} work is focused on: Making ULO/RDF +accessible for querying and analysis. + +The remainder of this report is split up in $n$~sections. First, in +the following section~\ref{sec:components} we take a look at the +various components involved in this project. Each major component is +then explained in a dedicated section; we will first collect data, +store it in a database, make it available for querying and then try +out some applications previously suggested in literature. Finally, +section~\ref{sec:conclusion} sums up the project and suggests some +next steps to take. diff --git a/doc/report/report.tex b/doc/report/report.tex index a4e210a080501273ab504a0bf14cdc45c2c2ee53..c1723749fe26b36a750d0f08c17658c64882a678 100644 --- a/doc/report/report.tex +++ b/doc/report/report.tex @@ -48,8 +48,10 @@ \input{intro.tex} \input{components.tex} +\input{collecter.tex} \input{endpoints.tex} \input{applications.tex} +\input{conclusion.tex} \newpage \printbibliography{}