diff --git a/doc/report/Makefile b/doc/report/Makefile index 9d3686aeab1e6399ea72f146bbbb7a73dc11158f..f7218f44fe207854b6cb00f8990ec7455b328f69 100644 --- a/doc/report/Makefile +++ b/doc/report/Makefile @@ -1,4 +1,4 @@ -TEX_SOURCES = report.tex abstract.tex intro.tex components.tex \ +TEX_SOURCES = report.tex abstract.tex intro.tex \ endpoints.tex applications.tex conclusion.tex \ applications-ulo-table.tex collecter.tex diff --git a/doc/report/components.tex b/doc/report/components.tex deleted file mode 100644 index 087c37cad7b52bf90d73f9f88c85ac0491c9f9ea..0000000000000000000000000000000000000000 --- a/doc/report/components.tex +++ /dev/null @@ -1,63 +0,0 @@ -\section{Components}\label{sec:components} - -With ULO/RDF files in place and available for download on -MathHub~\cite{uloisabelle, ulocoq} as Git repositories, we have the -aim of making the underlying data available for use in applications. -It makes sense to first identify the various components that might -be involved in such a system. They are illustrated in -figure~\ref{fig:components}. - -\begin{figure}[]\begin{center} - \includegraphics{figs/components} - \caption{Components involved in the \emph{ulo-storage} system.}\label{fig:components} -\end{center}\end{figure} - -\begin{itemize} -\item ULO data is present on various locations, be it Git - repositories, available on web servers or on the local disk. - Regardless where this ULO data is stored, it is the job of a - \emph{Collecter} to assemble these {RDF}~files and forward them - for further processing. This may involve cloning a Git repository or - crawling a file system. - -\item With streams of ULO files assembled by the Collecter, this - data then gets passed to an \emph{Importer}. An Importer uploads - received RDF~streams into some kind of permanent storage. For - use in this project, the GraphDB~\cite{graphdb} triplet store was - natural fit. - - In this project, both Collecter and Importer ended up being one piece - of software, but this does not have to be the case. - -\item Finally, with all triplets stored in a database, an - \emph{Endpoint} is where applications access the underlying - knowledge base. This does not necessarily need to be any specific - software, rather the programming interface of the underlying - database itself could be understood as an endpoint of its - own. - - Regardless, some thought can be put into designing an Endpoint as a - layer that lives between application and database that is more - convenient to use than the one provided by the database. -\end{itemize} - -These are the components realized for \emph{ulo-storage}. Additionally -to these components, one could think of a \emph{Harvester} component. -We assumed that the ULO triplets are already available in RDF~format. -Indeed for this project this was the case as we work with already -exported triplets from the Isabelle and Coq libraries. However, this -is not necessarily true. - -It might be desirable to automate the export from third party formats -to ULO and we think this is what a Harvester should do. It fetches -mathematical knowledge from some remote source and then provides a -volatile stream of ULO data to the Collecter, which then passes it -to the Importer and so on. The big advantage of such an approach would -be that exports from third party libraries can always be up to date -and do not have to be initiated manually. We did not implement a Harvester -for \emph{ulo-storage} but we suggest that it is an idea to keep in mind. - -Each component we did implement (Collecter, Importer, Endpoint) will -now be discussed in a separate section. Here we only wanted to give -the reader a general understanding in the infrastructure that makes -up \emph{ulo-storage}. diff --git a/doc/report/endpoints.tex b/doc/report/endpoints.tex index d62e896f24598aa7e1c17c3cb7727e64ccb0e36f..db66855c11e44b6f1d0ccb15b5ed327a77172f3a 100644 --- a/doc/report/endpoints.tex +++ b/doc/report/endpoints.tex @@ -1,65 +1,74 @@ \section{Endpoints}\label{sec:endpoints} -With ULO/RDF triplets imported into the GraphDB triplet store, we have -all data available for querying. There are multiple approaches to -querying this database, one based around the standardized SPARQL query -language and the other on the RDF4J Java library implemented by -various vendors. We will quickly look at both approaches as they have -unique advantages. +With ULO triplets imported into the GraphDB triplet store by Collecter +and Importer, we now have all data available necessary for querying. +As discussed before, querying from applications happens through an +Endpoint that exposes some kind of {API}. The interesting question +here is probably not so much the implementation of the endpoint itself, +rather it is the choice of API than can make or break such a project. -\subsection{SPARQL} +There are multiple approaches to querying the GraphDB triplet store, +one based around the standardized SPARQL query language and the other +on the RDF4J Java library implemented by various vendors. Both +approaches have unique advantages. -SPARQL is a standardized query language for RDF triplet -data~\cite{sparql}. The spec includes not just syntax and semantics of -the language itself, but also a standardized REST interface for -querying databases. Various implementations of this standard, -e.g.~\cite{gosparql}, are available so using SPARQL has the advantage -of making us independent of a specific programming language or -environment. +\subsection{Available Application Interfaces} -SPARQL is inspired by SQL and as such the \texttt{SELECT} \texttt{WHERE} -syntax should be familiar to many software developers. A simple query -that returns all triplets in the store looks like -\begin{verbatim} - SELECT * WHERE { ?s ?p ?o } -\end{verbatim} -where \texttt{?s}, \texttt{?p} and \texttt{?o} are query -variables. The result of a query are valid substitutions for the query -variables. In this case, the database would return a table of all -triplets in the store sorted by subject~\texttt{?o}, -predicate~\texttt{?p} and object~\texttt{?o}. +\begin{itemize} + \item SPARQL is a standardized query language for RDF triplet + data~\cite{sparql}. The spec includes not just syntax and + semantics of the language itself, but also a standardized REST + interface for querying databases. Various implementations of + this standard, e.g.~\cite{gosparql}, are available so using + SPARQL has the advantage of making us independent of a specific + programming language or environment. -Of course, queries might return a lot of data. Importing just the -Isabelle exports into GraphDB results in more than 200 million -triplets. For practical applications it will be necessary to limit -the number of result or use pagination -techniques~\cite{sparqlpagination}. + SPARQL is inspired by SQL and as such the \texttt{SELECT} + \texttt{WHERE} syntax should be familiar to many software + developers. A simple query that returns all triplets in the + store looks like + \begin{verbatim} + SELECT * WHERE { ?s ?p ?o } + \end{verbatim} + where \texttt{?s}, \texttt{?p} and \texttt{?o} are query + variables. The result of a query are valid substitutions for the + query variables. In this case, the database would return a table + of all triplets in the store sorted by subject~\texttt{?o}, + predicate~\texttt{?p} and object~\texttt{?o}. -\subsection{RDF4J} + Of course, queries might return a lot of data. Importing just + the Isabelle exports into GraphDB results in more than 200 + million triplets. For practical applications it will be + necessary to limit the number of result or use pagination + techniques~\cite{sparqlpagination}. -RDF4J is a Java API for interacting with triplet stores, implemented -based on a superset of {SPARQL}~\cite{rdf4j}. GraphDB supports RDF4J, -in fact it is the recommended way of interacting with GraphDB -repositories~\cite{graphdbapi}. Instead of formulating textual -queries, RDF4J allows developers to query a repository by calling Java -API methods. Above query that returns all triplets in the store looks -like -\begin{verbatim} - connection.getStatements(null, null, null); -\end{verbatim} -in RDF4J. Method \texttt{getStatements(s, p, o)} returns all triplets -that have matching subject~\texttt{s}, predicate~\texttt{p} and -object~\texttt{o}. If any of these arguments is \texttt{null}, it can -be any value, i.e.\ it is a query variable that is to be filled by the -call to \texttt{getStatements}. + \item RDF4J is a Java API for interacting with triplet stores, + implemented based on a superset of + {SPARQL}~\cite{rdf4j}. GraphDB supports RDF4J, in fact it is the + recommended way of interacting with GraphDB + repositories~\cite{graphdbapi}. Instead of formulating textual + queries, RDF4J allows developers to query a repository by + calling Java API methods. Above query that returns all triplets + in the store looks like + \begin{verbatim} + connection.getStatements(null, null, null); + \end{verbatim} + in RDF4J. Method \texttt{getStatements(s, p, o)} returns all + triplets that have matching subject~\texttt{s}, + predicate~\texttt{p} and object~\texttt{o}. If any of these + arguments is \texttt{null}, it can be any value, i.e.\ it is a + query variable that is to be filled by the call to + \texttt{getStatements}. -Using RDF4J does introduce a dependency on the JVM family of -languages, but also offers some conveniences. For example, we can -generate Java classes that contain all URIs in an OWL ontology as -constants~\cite{rdf4jgen}. In combination with IDE support, we found -this to be very convenient when writing applications that interface -with ULO data sets. + Using RDF4J does introduce a dependency on the JVM family of + languages, but also offers some conveniences. For example, we + can generate Java classes that contain all URIs in an OWL + ontology as constants~\cite{rdf4jgen}. In combination with IDE + support, we found this to be very convenient when writing + applications that interface with ULO data sets. -\subsection{Comparision} +\end{itemize} + +\subsection{Comparison} \emph{TODO} diff --git a/doc/report/intro.tex b/doc/report/intro.tex index 7db185f40d04018e094a4a87f74d07fdee332009..3b36c5506b868780b36933c474bfb8ae367d66d1 100644 --- a/doc/report/intro.tex +++ b/doc/report/intro.tex @@ -1,16 +1,17 @@ \section{Introduction}\label{sec:introduction} -To tackle the vast array of mathematical publications, various ways of -\emph{computerizing} mathematical knowledge have been researched and -developed. As it is already difficult for human mathematicians to keep -even a subset of all mathematical knowledge in their mind, a hope is -that computerization will yield great improvement to mathematical (and -really any) research by making the results of all collected research -readily available and easy to search~\cite{onebrain}. +To tackle the vast array of mathematical +publications, various ways of \emph{computerizing} mathematical +knowledge have been experimented with. As it is already difficult for +human mathematicians to keep even a subset of all mathematical +knowledge in their mind, a hope is that computerization will yield +great improvement to mathematical (and really any formal) research by +making the results of all collected publications readily available and +easy to search~\cite{onebrain}. One research topic in this field is the idea of a \emph{tetrapodal -search} that combines four distinct areas of mathematical knowledge -and data. These four kinds being (1)~the actual formulae as \emph{symbolic +search} that combines four distinct areas of mathematical knowledge. +These four kinds being (1)~the actual formulae as \emph{symbolic knowledge}, (2)~examples and concrete objects as \emph{concrete knowledge}, (3)~names and comments as \emph{narrative knowledge} and finally (4)~identifiers, references and their relationships, referred to as @@ -25,10 +26,11 @@ exactly with the kind of data it is providing. With all four areas available for querying, tetrapodal search intends to then combine the four indexes into a single query interface. -As it stands, tetrapodal search is not really close to usable -prototypes. Research is focused on providing schemer's, storage backends +\subsection{Focus on Organizational Knowledge} + +Currently, research is focused on providing schemas, storage backends and indexes for the four different kinds of mathematical -knowledge. The focus of the \emph{ulo-storage} project is the area of +knowledge. The focus of \emph{ulo-storage} is the area of organizational knowledge. A previously proposed way to structure such organizational data is the @@ -41,19 +43,81 @@ from Isabelle and Coq libraries~\cite{uloisabelle, ulocoq}. The resulting data set is already quite large, the Isabelle export alone containing more than 200~million triplets. -Existing exports from Isabelle and Coq result in a single or multiple -RDF~files. This is a convenient form for exchange and easily versioned -using Git. However, considering the vast number of triplets, it is -impossible to query easily and efficiently as it is. Ultimately this -is what \emph{ulo-storage} work is focused on: Making ULO data sets -accessible for querying and analysis. We collect the RDF files spread -from different Git repositories, import them into a database and then -provide an API for accessing the data set. +Existing exports from Isabelle and Coq result in single or multiple +RDF~files. This is a convenient format for exchange and easily +versioned using Git. However, considering the vast number of triplets, +it is impossible to query easily and efficiently in this state. This +is what \emph{ulo-storage} is focused on: Making ULO data sets +accessible for querying and analysis. We collected RDF files spread +over different Git repositories, imported them into a database and +then experimented with APIs for accessing that data set. The main contribution of this project is twofold. First, (1) we built up various infrastructure components that can make up building blocks in a larger tetrapodal search system. Second, (2)~we ran sample prototype applications and queries on top of this interface. While the -applications themselves are admittedly not very interesting, they give -insight for future development of~{ULO}. The following section -provides an overview of the involved components in \emph{ulo-storage}. +applications themselves are admittedly not very interesting, they can give +us insight about future development of~{ULO}. + +\subsection{Components Implemented for \emph{ulo-storage}}\label{sec:components} + +With RDF files exported and available for download as Git repositories +on MathHub, we have the goal of making the underlying data available +for use in applications. It makes sense to first identify the various +components that might be involved in such a system. Figure~\ref{fig:components} +illustrates all components and their relationships. + +\begin{figure}[]\begin{center} + \includegraphics{figs/components} + \caption{Components involved in the \emph{ulo-storage} system.}\label{fig:components} +\end{center}\end{figure} + +\begin{description} +\item[Collecter] ULO triplets are present in various locations, be it Git + repositories, available on web servers or on local disk. + It is the job of a Collecter to assemble these {RDF}~files and + forward them for further processing. This may involve cloning a Git + repository or crawling the file system. + +\item[Importer] With streams of ULO files assembled by the Collecter, this + data then gets passed to an Importer. An Importer uploads + received RDF~streams into some kind of permanent storage. For + use in this project, the GraphDB~\cite{graphdb} triplet store was + a natural fit. + + In this project, both Collecter and Importer ended up being one piece + of software, but this does not have to be the case. + +\item[Endpoint] Finally, with all triplets stored in a database, an + Endpoint is where applications access the underlying + knowledge base. This does not necessarily need to be any custom + software, rather the programming interface of the underlying + database itself could be understood as an endpoint of its + own. + + Regardless, some thought can be put into designing an Endpoint as a + layer that lives between application and database that is more + convenient to use than the one provided by the database. +\end{description} + +\subsection{An Additional Harvester Component} + +These are the components realized for \emph{ulo-storage}. However, +additionally to these components, one could think of a +\emph{Harvester} component. We assumed that the ULO triplets are +already available in RDF~format. This is not necessarily true. It +might be desirable to automate the export from third party formats to +ULO and we think this should be the job of a Harvester component. It +fetches mathematical knowledge from some remote source and then +provides a volatile stream of ULO data to the Collecter, which then +passes it to the Importer and so on. The big advantage of such an +approach would be that exports from third party libraries can always +be up to date and do not have to be initiated manually. + +We did not implement a Harvester for \emph{ulo-storage} but we suggest +that it is an idea to keep in mind. The components we did implement +(Collecter, Importer and Endpoint) provide us with an easy and +automated way of making RDF files ready for use with applications. In +this introduction we only wanted to give the reader a general +understanding in the infrastructure that makes up \emph{ulo-storage}; +the following sections will explain each component in more detail. diff --git a/doc/report/report.tex b/doc/report/report.tex index 28222aa6f0471c5020b7893d55148aea27c2a891..a3c802aad323cbcd3a62a4307dc1d528ce5f9fa7 100644 --- a/doc/report/report.tex +++ b/doc/report/report.tex @@ -50,7 +50,6 @@ \newpage \input{intro.tex} -\input{components.tex} \input{collecter.tex} \input{endpoints.tex} \input{applications.tex}