diff --git a/doc/report/Makefile b/doc/report/Makefile index bf2351705bfa42007f9346d1d824175ca449951a..9d3686aeab1e6399ea72f146bbbb7a73dc11158f 100644 --- a/doc/report/Makefile +++ b/doc/report/Makefile @@ -1,6 +1,6 @@ TEX_SOURCES = report.tex abstract.tex intro.tex components.tex \ endpoints.tex applications.tex conclusion.tex \ - applications-ulo-table.tex + applications-ulo-table.tex collecter.tex report.pdf: $(TEX_SOURCES) references.bib pdflatex $< diff --git a/doc/report/applications.tex b/doc/report/applications.tex index 09f5a181946f8e338cd23d876b7c6862cc92b213..8d9be1334ef2cb7c67efb301c926088612e08c9c 100644 --- a/doc/report/applications.tex +++ b/doc/report/applications.tex @@ -1,4 +1,3 @@ -\newpage \section{Applications}\label{sec:applications} With endpoints in place, we can now query the ULO/RDF data set. This diff --git a/doc/report/collecter.tex b/doc/report/collecter.tex index cdbbc49e3641aebce50e128415c5bb63756d900c..d5237d0e9a5c17e4db14fba9196df1b462b39df3 100644 --- a/doc/report/collecter.tex +++ b/doc/report/collecter.tex @@ -1,3 +1,4 @@ -\newpage \section{Collecter and Importer}\label{sec:collecter} +\emph{TODO} + diff --git a/doc/report/components.tex b/doc/report/components.tex index ae4036140b3d52024cec2c041e2927f53e535532..087c37cad7b52bf90d73f9f88c85ac0491c9f9ea 100644 --- a/doc/report/components.tex +++ b/doc/report/components.tex @@ -1,10 +1,9 @@ -\newpage \section{Components}\label{sec:components} With ULO/RDF files in place and available for download on MathHub~\cite{uloisabelle, ulocoq} as Git repositories, we have the aim of making the underlying data available for use in applications. -It makes sense to first identify out the various components that might +It makes sense to first identify the various components that might be involved in such a system. They are illustrated in figure~\ref{fig:components}. @@ -14,47 +13,51 @@ figure~\ref{fig:components}. \end{center}\end{figure} \begin{itemize} -\item ULO/RDF data is present on various locations, be it Git +\item ULO data is present on various locations, be it Git repositories, available on web servers or on the local disk. - Regardless where this ULO/RDF data is stored, it is the job of a - \emph{Collecter} to assemble these {ULO/RDF}~files and forward them + Regardless where this ULO data is stored, it is the job of a + \emph{Collecter} to assemble these {RDF}~files and forward them for further processing. This may involve cloning a Git repository or crawling a file system. -\item With streams of ULO/RDF files assembled bye the Collecter, this - data then gets passed to an \emph{Importer}. The Importer uploads - the received RDF~streams into some kind of permanent storage. For +\item With streams of ULO files assembled by the Collecter, this + data then gets passed to an \emph{Importer}. An Importer uploads + received RDF~streams into some kind of permanent storage. For use in this project, the GraphDB~\cite{graphdb} triplet store was natural fit. - In practice, both Collecter and Importer end up being one piece of - software, but this does not have to be the case. + In this project, both Collecter and Importer ended up being one piece + of software, but this does not have to be the case. \item Finally, with all triplets stored in a database, an \emph{Endpoint} is where applications access the underlying knowledge base. This does not necessarily need to be any specific - software, rather the programming API of the database could be - understood as an endpoint of its own. However, some thought can be - put into designing an Endpoint that is more convenient to use than - the one provided by the database. + software, rather the programming interface of the underlying + database itself could be understood as an endpoint of its + own. + + Regardless, some thought can be put into designing an Endpoint as a + layer that lives between application and database that is more + convenient to use than the one provided by the database. \end{itemize} -Additionally to these components, one could think of a \emph{Harvester} component. -We assumed that the ULO/RDF triplets are already available in RDF~format. -Indeed for this project this is the case as we work with already +These are the components realized for \emph{ulo-storage}. Additionally +to these components, one could think of a \emph{Harvester} component. +We assumed that the ULO triplets are already available in RDF~format. +Indeed for this project this was the case as we work with already exported triplets from the Isabelle and Coq libraries. However, this -does not need to be the case. +is not necessarily true. It might be desirable to automate the export from third party formats -to ULO/RDF and indeed this is what a Harvester would do. It fetches +to ULO and we think this is what a Harvester should do. It fetches mathematical knowledge from some remote source and then provides a -volatile stream of ULO/RDF data to the Collecter, which then passes it +volatile stream of ULO data to the Collecter, which then passes it to the Importer and so on. The big advantage of such an approach would be that exports from third party libraries can always be up to date -and do not have to be initiated manually. However, due to time -constraints this idea was not followed up upon in this project. +and do not have to be initiated manually. We did not implement a Harvester +for \emph{ulo-storage} but we suggest that it is an idea to keep in mind. -Each component (Collecter, Importer, Endpoint) will later be discussed -in more detail. This section serves only for the reader to get a -general understanding of the developed infrastructure and its -topology. +Each component we did implement (Collecter, Importer, Endpoint) will +now be discussed in a separate section. Here we only wanted to give +the reader a general understanding in the infrastructure that makes +up \emph{ulo-storage}. diff --git a/doc/report/conclusion.tex b/doc/report/conclusion.tex index be8411870ec7be43e661655e6d6bc775a1992b49..f01b33d8f820286f6c5c473991fbfa9783c878f2 100644 --- a/doc/report/conclusion.tex +++ b/doc/report/conclusion.tex @@ -1,2 +1 @@ -\newpage \section{Conclusion and Future Work}\label{sec:conclusion} diff --git a/doc/report/endpoints.tex b/doc/report/endpoints.tex index ebc6ed684539809272e1d9ac05d47d3e5ae6db27..d62e896f24598aa7e1c17c3cb7727e64ccb0e36f 100644 --- a/doc/report/endpoints.tex +++ b/doc/report/endpoints.tex @@ -1,45 +1,49 @@ -\newpage \section{Endpoints}\label{sec:endpoints} -With ULO/RDF triplets imported into a database, in our case GraphDB, -we have all data available for querying. There are multiple approaches -to querying such triplet stores. +With ULO/RDF triplets imported into the GraphDB triplet store, we have +all data available for querying. There are multiple approaches to +querying this database, one based around the standardized SPARQL query +language and the other on the RDF4J Java library implemented by +various vendors. We will quickly look at both approaches as they have +unique advantages. \subsection{SPARQL} -SPARQL~\cite{sparql} is a standardized query language for RDF triplet -data. The spec includes not just syntax and semantics of the language -itself, but also a standardized REST interface for querying databases. -Various implementations of this standard, e.g.~\cite{gosparql}, are -available so using SPARQL has the advantage of making us independent -of a specific programming language or environment. +SPARQL is a standardized query language for RDF triplet +data~\cite{sparql}. The spec includes not just syntax and semantics of +the language itself, but also a standardized REST interface for +querying databases. Various implementations of this standard, +e.g.~\cite{gosparql}, are available so using SPARQL has the advantage +of making us independent of a specific programming language or +environment. -SPARQL is inspired by SQL, a simple query that returns all triplets -in the store looks like +SPARQL is inspired by SQL and as such the \texttt{SELECT} \texttt{WHERE} +syntax should be familiar to many software developers. A simple query +that returns all triplets in the store looks like \begin{verbatim} SELECT * WHERE { ?s ?p ?o } \end{verbatim} -where \texttt{s}, \texttt{p} and \texttt{o} are query variables. The -result of a query are valid substitutions for the query variables. In -this case, the database would return a table of all triplets in the -store sorted by subject~\texttt{o}, predicate~\texttt{p} and -object~\texttt{o}. +where \texttt{?s}, \texttt{?p} and \texttt{?o} are query +variables. The result of a query are valid substitutions for the query +variables. In this case, the database would return a table of all +triplets in the store sorted by subject~\texttt{?o}, +predicate~\texttt{?p} and object~\texttt{?o}. Of course, queries might return a lot of data. Importing just the -Isabelle exports~\cite{uloisabelle} into GraphDB results in more than -200M triplets. This is solved with pagination +Isabelle exports into GraphDB results in more than 200 million +triplets. For practical applications it will be necessary to limit +the number of result or use pagination techniques~\cite{sparqlpagination}. \subsection{RDF4J} -RDF4J~\cite{rdf4j} is a Java API for interacting with triplet stores, -implemented based on a superset of SPARQL. GraphDB supports RDF4J, in -fact it is the recommended way of interacting with GraphDB -repositories [6]. - -Instead of formulating textual queries, RDF4J allows developers to -query a repository by calling Java API methods. Above query that -returns all triplets in the store looks like +RDF4J is a Java API for interacting with triplet stores, implemented +based on a superset of {SPARQL}~\cite{rdf4j}. GraphDB supports RDF4J, +in fact it is the recommended way of interacting with GraphDB +repositories~\cite{graphdbapi}. Instead of formulating textual +queries, RDF4J allows developers to query a repository by calling Java +API methods. Above query that returns all triplets in the store looks +like \begin{verbatim} connection.getStatements(null, null, null); \end{verbatim} @@ -48,3 +52,14 @@ that have matching subject~\texttt{s}, predicate~\texttt{p} and object~\texttt{o}. If any of these arguments is \texttt{null}, it can be any value, i.e.\ it is a query variable that is to be filled by the call to \texttt{getStatements}. + +Using RDF4J does introduce a dependency on the JVM family of +languages, but also offers some conveniences. For example, we can +generate Java classes that contain all URIs in an OWL ontology as +constants~\cite{rdf4jgen}. In combination with IDE support, we found +this to be very convenient when writing applications that interface +with ULO data sets. + +\subsection{Comparision} + +\emph{TODO} diff --git a/doc/report/intro.tex b/doc/report/intro.tex index d6b094bfc9050d63ebe99d9b53232edd3fd7230b..7db185f40d04018e094a4a87f74d07fdee332009 100644 --- a/doc/report/intro.tex +++ b/doc/report/intro.tex @@ -20,39 +20,40 @@ Tetrapodal search aims to provide a unified search engine that indexes each of the four different subsets of mathematical knowledge. Because all four kinds of knowledge are inherently different in their structure, tetrapodal search proposes that each kind of mathematical -knowledge should be made available in a storage and index backend that -fits exactly with the kind of data it is providing. With all four -areas available for querying, tetrapodal search then intends to then -combine the four indexes into a single query interface. +knowledge should be made available in a storage backend that fits +exactly with the kind of data it is providing. With all four areas +available for querying, tetrapodal search intends to then combine +the four indexes into a single query interface. As it stands, tetrapodal search is not really close to usable -prototypes Currently, research is focused on providing storage -backends and indexes for the four different kinds of mathematical -knowledge. The focus of \emph{ulo-storage} is to achieve this for +prototypes. Research is focused on providing schemer's, storage backends +and indexes for the four different kinds of mathematical +knowledge. The focus of the \emph{ulo-storage} project is the area of organizational knowledge. A previously proposed way to structure such organizational data is the -\emph{upper level ontology} (ULO/RDF)~\cite{ulo}. ULO/RDF takes the -form of an OWL~ontology and as such all organization information is -stored as RDF~triplets with a unified schema of ULO~predicates. Some +\emph{upper level ontology} (ULO)~\cite{ulo}. ULO takes the form of an +OWL~ontology and as such all organization information is stored as +RDF~triplets with a unified schema of ULO~predicates~\cite{owl}. Some effort has been made to export existing databases of formal -mathematical knowledge to {ULO/RDF}. In particular, there exist -exports from Isabelle and Coq libraries~\cite{uloisabelle, -ulocoq}. The resulting data set is already quite large, the Isabelle -export alone containing more than 200~million triplets. +mathematical knowledge to {ULO}. In particular, there exist exports +from Isabelle and Coq libraries~\cite{uloisabelle, ulocoq}. The +resulting data set is already quite large, the Isabelle export alone +containing more than 200~million triplets. Existing exports from Isabelle and Coq result in a single or multiple RDF~files. This is a convenient form for exchange and easily versioned using Git. However, considering the vast number of triplets, it is impossible to query easily and efficiently as it is. Ultimately this -is what \emph{ulo-storage} work is focused on: Making ULO/RDF -accessible for querying and analysis. +is what \emph{ulo-storage} work is focused on: Making ULO data sets +accessible for querying and analysis. We collect the RDF files spread +from different Git repositories, import them into a database and then +provide an API for accessing the data set. -The remainder of this report is split up in $n$~sections. First, in -the following section~\ref{sec:components} we take a look at the -various components involved in this project. Each major component is -then explained in a dedicated section; we will first collect data, -store it in a database, make it available for querying and then try -out some applications previously suggested in literature. Finally, -section~\ref{sec:conclusion} sums up the project and suggests some -next steps to take. +The main contribution of this project is twofold. First, (1) we built +up various infrastructure components that can make up building blocks +in a larger tetrapodal search system. Second, (2)~we ran sample +prototype applications and queries on top of this interface. While the +applications themselves are admittedly not very interesting, they give +insight for future development of~{ULO}. The following section +provides an overview of the involved components in \emph{ulo-storage}. diff --git a/doc/report/references.bib b/doc/report/references.bib index 517f834d4dfbb97835836b1ae70a3202ae49fcdb..7a5094ffcf13edc1956585e67f7eceb1c826635b 100644 --- a/doc/report/references.bib +++ b/doc/report/references.bib @@ -116,8 +116,16 @@ urldate = {2020-06-30}, url = {https://www.dublincore.org/schemas/rdfs/}, } + @online{afp, title={Archive of Formal Proofs}, urldate = {2020-07-01}, url = {https://www.isa-afp.org/}, +} + +@online{rdf4jgen, + title = {ansell/rdf4j-schema-generator}, + author = {Peter Ansell}, + urldate = {2020-07-02}, + url = {https://github.com/ansell/rdf4j-schema-generator}, } \ No newline at end of file