diff --git a/doc/report/applications-preds.tex b/doc/report/applications-preds.tex index 878bcd12641a8a7080e1353b755442cbadb06690..f6691c0dff87ee8c89178a8e9d6827362494573a 100644 --- a/doc/report/applications-preds.tex +++ b/doc/report/applications-preds.tex @@ -12,8 +12,8 @@ ORDER BY DESC(?count) \end{lstlisting} \caption{SPARQL query that returns a list of all - \texttt{predicate}s used in the backing store. We include - the \texttt{ulo} prefix so the results are printed in a + \texttt{predicate}s used in the backing store. We include the + \texttt{ulo} prefix such that the results are printed in a concise human readable format.}\label{fig:preds-query} \end{subfigure} \vspace{0.5cm} diff --git a/doc/report/applications.tex b/doc/report/applications.tex index b95ba0f2d39b512bd1e6c1aa7f7502a4e3ea74d6..1c18b1c5d7df376ffc558e8244cea6f95b2ed43a 100644 --- a/doc/report/applications.tex +++ b/doc/report/applications.tex @@ -3,7 +3,7 @@ With programming endpoints in place, we can now query the data set containing both Isabelle and Coq exports stored in {GraphDB}. We experimented with the following applications that talk to a GraphDB -endpoint. +Endpoint, our second contribution. \begin{itemize} \item Exploring which ULO predicates are actually used in the @@ -12,8 +12,8 @@ endpoint. (Section~\ref{sec:expl}). \item Providing an interactive interface for browsing the data - set. Our implementation is limited to listing basic - information about contributors and their work. + set. Our implementation is limited to listing basic information + about contributors and their work (Section~\ref{sec:exp}). \item We investigated queries that could be used to extend the system into a larger tetrapodal search system. While some @@ -38,56 +38,54 @@ is given in verbatim in Figure~\ref{fig:preds-query}. Our query yields a list of all used predicates together with the number of occurrences (Figure~\ref{fig:preds-result}). Looking at the results, we find that both the Isabelle and the Coq data sets only use subsets -of the predicates provided by the ULO ontology. The full results are -listed in Appendix~\ref{sec:used}. In both cases, what stands out is -that both exports use less than a third of all available ULO~predicates. +of the predicates provided by~{ULO}. The full results are listed in +Appendix~\ref{sec:used}. In both cases, what stands out is that either +export uses less than a third of all available ULO~predicates. We also see that the Isabelle and Coq exports use different predicates. For example, the Isabelle export contains organizational meta information such as information about paragraphs and sections in the source document while the Coq export only tells us the filename of -the original Coq source. That is not particularly problematic as long -as we can trace a given object back to the original source. -Regardless, our results do show that both exports have their own -particularities and with more and more third party libraries exported -to ULO one has to assume that this heterogeneity will only grow. In -particular we want to point to the large number of predicates which -remain unused in both Isabelle and Coq exports. A user formulating -queries for ULO might be oblivious to the fact that only subsets of -exports support given predicates. +the original Coq source. Both exports have their particularities and +with more and more third party libraries exported to ULO one has to +assume that this heterogeneity will only grow. In particular we want +to point to the large number of predicates which remain unused in both +Isabelle and Coq exports. A user formulating queries for ULO might be +oblivious to the fact that only subsets of exports support given +predicates. \input{applications-preds.tex} We expect the difference between Coq and Isabelle exports to be caused by the difference in source material. It is only natural that different third party libraries expressed in different languages with -different features will result in different ULO~predicates. However, we -want to hint at the fact that this could also be an omission in -the exporter code that originally generated the RDF~triplets we -imported. This shows the importance of writing good exporters. -Exporters taking existing libraries and outputting ULO~triplets must -lose as little information as possible to ensure good results in a -larger tetrapodal search system. +different capabilities will result in different +ULO~predicates. Regardless, we want to hint at the possibility that +this could also be an omission in the exporter code that originally +generated the RDF~triplets. This shows the importance of writing good +exporters. Exporters translating existing libraries to ULO~triplets +must lose as little information as possible to ensure good results in +a larger tetrapodal search system. Our first application gave us an initial impression of the structure of currently available organizational knowledge formulated in ULO triplets. Whether caused by the difference in formal language or -because of omissions in code that produce ULO~triplets, we must not +because of omissions in code that produces ULO~triplets, we must not expect predicates to be evenly distributed in the data set. This is -something to keep in mind especially as the number of ULO exports +something to keep in mind, especially as the number of ULO exports increases. -\subsection{Interactive Exploration} +\subsection{Interactive Exploration}\label{sec:exp} Our second application wants to illustrate how to browse an ULO data set interactively. For this purpose, we developed a web front end that -allows us to browse contributions by author. For this tool, we used -the RDF4J Endpoint, the application itself is implemented in +allows users to browse contributions by author. For this application, +we used the RDF4J Endpoint, the application itself is implemented in Java. Figure~\ref{fig:appss} shows four screenshots of the current version. Once the user has selected a given contribution, we list -some basic information about that object of formal knowledge, such as -type (e.g.\ lemma or corollary) and references to other objects in the -knowledge base. +some basic information about the selected object, such as type +(e.g.\ lemma or corollary) and name as well as references to other +objects in the knowledge base. \input{applications-screenshots.tex} @@ -95,24 +93,29 @@ This application is interesting because similar approaches could be applied when designing IDE~integration for working with organizational knowledge. Given some object in a document identified by an URI, we can look at how that object is connected to other objects in the -knowledge graph. We can also connect the dots back to the original -source. In our case, we implemented translation from the -\texttt{ulo:sourceref} predicate as used in the Isabelle -export~\cite{uloisabelle} to the original Isabelle source -(Figure~\ref{fig:isrc}). This is convenient for the user, but required -some extra work for us as application implementors. The reason for -this is that the format of \texttt{ulo:sourceref} is not well-defined, -rather it is up to implementors of library exporters how to format -source references. The consequence is that for each export, developers -of front ends will need to write custom code for finding the original -source. Maybe it would be a reasonable requirement for -\texttt{ulo:sourceref} to have a well-defined format, ideally a -``real'' URI on the open web, if that is applicable for a given +knowledge graph. + +We can also connect the dots back to the original source. In +particular, we took a look at the \texttt{ulo:sourceref} property +defined for many objects in the knowledge base. +\texttt{ulo:sourceref} is ``the URI of the physical location (e.g., +file/URI, line, column) of the source code that introduced the +subject''~\cite{uloonto} and in use in the Isabelle +export~\cite{uloisabelle}. Making the connection from ULO~object to +original Isabelle document is convenient for the user, but required +some extra work from us as application implementors. The reason for +this is that the format of the \texttt{ulo:sourceref} property is not +well-defined, rather it is up to implementors of library exporters how +to format source references. The consequence is that for each export, +developers of front ends will need to write custom code for finding +the original source. Maybe it would be a reasonable requirement for +\texttt{ulo:sourceref} to have a well-defined format, ideally an +actually reachable URI on the open web, if that is applicable for a given library. -In summary, while translating from \texttt{ulo:sourceref} to oriental -URI introduced some extra work, implementing this application was easy -and straight-forward. Implementing similar features for other environments +While translating from \texttt{ulo:sourceref} to original URI +introduced some extra work, implementing this application was easy and +straight-forward. Implementing similar features for other environments should not be very difficult. \subsection{Querying for Tetrapodal Search}\label{sec:tetraq} @@ -125,38 +128,43 @@ proof of concept implementations and evaluate their applicability. \subsubsection{Elementary Proofs and Computed Scores} -Our first query~$\mathcal{Q}_1$ illustrates how we can compute -arithmetic scores for some nodes in our knowledge graph. +We start with query~$\mathcal{Q}_1$ which illustrates how one can +compute arithmetic scores for objects in our knowledge graph. Query~$\mathcal{Q}_1$ asks us to ``\emph{[f]ind theorems with -non-elementary proofs}.'' Elementary proofs can be understood as -those proof that are considered easy and obvious~\cite{elempro}. In -consequence,~$\mathcal{Q}_1$ has to search for all proofs which are -not trivial. Of course, just like any distinction between ``theorem'' -and ``corollary'' is going to be somewhat arbitrary, so is any -judgment about whether a proof is easy or not. + non-elementary proofs}.'' Elementary proofs can be understood as +those proof that are considered easy and +obvious~\cite{elempro,elemtest}. In consequence,~$\mathcal{Q}_1$ has +to search for all proofs which are not trivial. Of course, just like +any distinction between ``theorem'' and ``corollary'' is going to be +somewhat arbitrary, so is any judgment about whether a proof is easy +or not. While we do understand elementary as easy here, we must not +omit that there also exist concrete definitions of ``elementary +proofs'' in certain subsets of mathematics~\cite{elempro, el0, + el1}. As we will see, our naive understanding of elementary proofs +results in interesting insights regardless. Existing research on proof difficulty is either very broad or specific to one problem. For example, some experiments showed that students and -prospective school teachers have problems with notation, term -rewriting and required prerequisites~\cite{proofund, proofteach}, none -of which seems applicable for grading individual formal proofs for -difficulty. On the other hand, there is research on rating proofs for -individual subsets of problems, e.g.\ on the satisfiability of a given -CNF formula. A particular example is focused on heuristics for how -long a SAT solver will take to find a solution for a -given~{CNF}. Here, solutions that take long are considered -harder~\cite{proofsat}. +prospective school teachers had problems with notation and term +rewriting and were missing required prerequisites~\cite{proofund, + proofteach}, none of which seems applicable for grading individual +formal proofs for difficulty. On the other hand, there is very +specific research on rating proofs for some given subsets of problems, +e.g.\ on the satisfiability of a given CNF formula. A particular +example is focused on heuristics for how long a SAT solver will take +to find a solution for a given~{CNF}. Here, solutions that take long +are considered harder~\cite{proofsat}. \noindent\emph{Organizational Aspect.} A first working hypothesis might be to assume that elementary proofs are short. In that case, the size, that is the number of bytes to store the proof, is our first indicator of proof complexity. This is by no means perfect, as even -identical proofs can be represented in different ways that might have -vastly different size in bytes. It might be tempting to imagine a -unique normal form for each proof, but finding such a normal form -might very well be impossible. As it is very difficult to find a -generalized definition of proof difficulty, we will accept proof size -as a first working hypothesis. +identical proofs can be represented in different ways that have vastly +different size in bytes. It might be tempting to imagine a unique +normal form for each proof, but finding such a normal form may very +well be impossible. As it is very difficult to find a generalized +definition of proof difficulty, we will accept proof size as a first +working hypothesis. {ULO} offers the \texttt{ulo:external-size} predicate which will allow us to sort by file size. Maybe proof complexity also leads to quick @@ -172,8 +180,8 @@ of measure of formula complexity, different proofs could be rated. Similarly, with narrative knowledge available to us, we could count the number of words, citations and so on to rate the narrative complexity of a proof. Combining symbolic knowledge, narrative -knowledge and organizational knowledge allows us to find proofs which -are probably straight forward. +knowledge and organizational knowledge should allow us to find proofs +which are probably easier than others. \input{applications-q1.tex} @@ -181,7 +189,7 @@ are probably straight forward. organizational aspect can be as simple as querying for all theorems justified by proofs, ordered by size (or check time). Figure~\ref{fig:q1short} illustrates how this can be achieved with a -SPARQL query. Maybe we want to go one step further and calculate a +SPARQL query. But maybe we wish to go one step further and calculate a rating that assigns each proof some numeric score of complexity based on a number of properties. We can achieve this in SPARQL as recent versions support arithmetic as part of the SPARQL specification; @@ -195,7 +203,7 @@ and associated proofs. Naturally, this list is bound to be very long. A suggested way to solve this problem is to introduce some kind of cutoff value for our complexity score. Another potential solution is to only list the first~$n$ results, something a user interface -would have to do either way (pagination~\cite{pagination}). Either +would have to do either way (i.e.\ pagination~\cite{pagination}). Either way, this is not so much an issue for the organizational storage engine and more one that a tetrapodal search aggregator has to account for. @@ -204,10 +212,10 @@ Another problem is that computing these scores can be quite time intensive. Even if calculating a score for one given object is fast, doing so for the whole data set might quickly turn into a problem. In particular, if we wish to show the $n$~objects with best scores, we do -need to compute scores for all relevant triplets for that score. In +need to compute scores for all relevant objects first. In \emph{ulo-storage}, all scores we experimented with were easy enough and the data sets small enough such that this did not become a -concrete problem. But in a larger tetrapodal search system, caching or +concrete problem. But in a larger tetrapodal search system, caching lazily or ahead of time computed results will probably we a necessity. Which component takes care of keeping this cache is not clear right now, but we advocate for keeping caches of previously @@ -230,15 +238,15 @@ complexity). We need to consider where each of these three components might be stored. \noindent\emph{Symbolic and Concrete Aspects.} First, let us consider -algorithms. Algorithms can be formulated as computer code which -can be understood as symbolic knowledge (code represented as a -syntax tree) or as concrete knowledge (code as text -files)~\cites[pp. 8--9]{tetra}{virt}. Either way, we will not be -able to query these indices for what problem a given algorithm is -solving, nor is it possible to infer properties as complex as -$NP$-completeness automatically. Metadata of this kind needs to be -stored in a separate index for organizational knowledge, it being -the only fit. +algorithms. Algorithms can be formulated as computer code which can be +understood as symbolic knowledge (code represented as a syntax tree) +or as concrete knowledge (code as text +files)~\cites[pp. 8--9]{tetra}{virt}. Either way, we will not be able +to query these indices for what problem a given algorithm is solving, +nor is it possible to infer properties as complex as $NP$-completeness +automatically in the general case~\cite{np}. Metadata of this kind +needs to be stored in a separate index for organizational knowledge, +it being the only fit. \noindent\emph{Organizational Aspect.} If we wish to look up properties about algorithms from organizational knowledge, we first have to @@ -260,16 +268,17 @@ salesman problem or the sorting problem) in terms of theorems (propositions, types) that get implemented by a proof (algorithm, program). -As algorithms make up an important part of certain areas of -research, it might be reasonable to introduce native level support -for algorithms in ULO or separately in another ontology. An -argument for adding support directly to ULO is that ULO aims to be -universal and as such should not be without algorithms. An -argument for a separate ontology is that what we understand as ULO -data sets (Isabelle, Coq exports) already contain triplets from -other ontologies (e.g.\ Dublin Core metadata~\cite{dcreport, -dcowl}) and keeping concepts separate is not entirely -unattractive in itself. +As algorithms make up an important part of certain areas of research, +it might be reasonable to introduce native level support for +algorithms in ULO or separately in another ontology. An argument for +adding support directly to ULO is that ULO aims to be universal and as +such should not be without algorithms. An argument for a separate +ontology is that what we understand as ULO data sets (Isabelle, Coq +exports) already contain predicates from other ontologies +(e.g.\ Dublin Core metadata~\cite{dcreport, dcowl}). In addition, +keeping concepts separated in distinct ontologies is not entirely +unattractive in itself as it can be less overwhelming for a user working +with these ontologies. In summary, we see that it is difficult to service~$\mathcal{Q}_2$ even though the nature of this query is very much one of @@ -278,13 +287,13 @@ future ULO~exports to find a good way of encoding algorithmic problems and solutions. Perhaps a starting point on this endeavor would be to find a formal way of structuring information gathered on sites like Rosetta Code~\cite{rc}, a site that provides concrete programs that -solve algorithms problems. +solve algorithmic problems. \subsubsection{Contributors and Number of References} -Finally, query~$\mathcal{Q}_3$ from literature~\cite{tetra} wants to +Finally, query~$\mathcal{Q}_3$ from literature wants to know ``\emph{[a]ll areas of math that {Nicolas G.\ de Bruijn} has - worked in and his main contributions.}'' $\mathcal{Q}_3$~is asking + worked in and his main contributions}''~\cite{tetra}. $\mathcal{Q}_3$~is asking for works of a given author~$A$. It also asks for their main contributions, for example which particularly interesting paragraphs or code~$A$ has authored. We picked this particular query as it @@ -292,19 +301,18 @@ is asking for metadata, something that should be easily serviced by organizational knowledge. \noindent\emph{Organizational Aspect.} ULO has no concept of authors, -contributors, dates and so on. Rather, the idea is to take -advantage of the Dublin Core project which provides an ontology -for such metadata~\cite{dcreport, dcowl}. For example, Dublin Core -provides us with the \texttt{dcterms:creator} and -\texttt{dcterms:contributor} predicates. Servicing~$\mathcal{Q}_3$ -requires us to look for creator~$A$ and then list all associated -objects that they have worked on. Of course this -requires above authorship predicates to actually be in use. With -the Isabelle and Coq exports this was hardly the case; running -some experiments we found less than 15 unique contributors and -creators, raising suspicion that metadata is missing in the -original library files. Regardless, in theory ULO allows us to -query for objects ordered by authors. +contributors, dates and so on. Rather, the idea is to take advantage +of the Dublin Core project which provides an ontology for such +metadata~\cite{dcreport, dcowl}. For example, Dublin Core provides us +with the \texttt{dcterms:creator} and \texttt{dcterms:contributor} +predicates. Servicing~$\mathcal{Q}_3$ requires us to look for +creator~$A$ and then list all associated objects that they have worked +on. Of course this requires above authorship predicates to actually be +in use. With the Isabelle and Coq exports this was hardly the case; +running some experiments we found less than 15 unique contributors and +creators, raising suspicion that metadata is missing in the original +library files. Regardless, existing ULO exports allow us to query for +objects ordered by authors. \input{applications-q3.tex} @@ -315,11 +323,12 @@ of~$A$, that is those works that~$A$ authored that are the most important. Sorting the result by number of references might be a good start. To get the main contributions, we rate each individual work by its number of \texttt{ulo:uses} references. Extending the previous -{SPARQL}, we can query the database for a ordered list of works, +{SPARQL} query, we can ask the database for an ordered list of works, starting with the one that has the most -references~(Figure~\ref{fig:q2b}). We can formulate~$\mathcal{Q}_3$ -with just one SPARQL query. Because everything is handled by the -database, access should be about as quick as we can hope it to be. +references~(Figure~\ref{fig:q2b}). We see that one can +formulate~$\mathcal{Q}_3$ with just one SPARQL query. Because +everything is handled by the database, access should be about as quick +as we can hope it to be. While the sparse data set available to use only returned a handful of results, we see that queries like~$\mathcal{Q}_3$ are easily serviced diff --git a/doc/report/references.bib b/doc/report/references.bib index 6d27d8ed8e08464733293f4b7ed76e2fb4a6f6e9..7c4e6b822a28a61e3c0b019bd58454075938e487 100644 --- a/doc/report/references.bib +++ b/doc/report/references.bib @@ -461,4 +461,42 @@ title = {ABOUT W3C}, urldate = {2020-10-05}, url = {https://www.w3.org/Consortium/}, -} \ No newline at end of file +} + +@article{elemtest, + title={On mathematicians' different standards when evaluating elementary proofs}, + author={Inglis, Matthew and Mejia-Ramos, Juan Pablo and Weber, Keith and Alcock, Lara}, + journal={Topics in cognitive science}, + volume={5}, + number={2}, + pages={270--282}, + year={2013}, + publisher={Wiley Online Library} +} + +@article{el0, + title={Number theory and elementary arithmetic}, + author={Avigad, Jeremy}, + journal={Philosophia mathematica}, + volume={11}, + number={3}, + pages={257--284}, + year={2003}, + publisher={Oxford University Press} +} + +@incollection{el1, + title={The elementary proof of the prime number theorem: An historical perspective}, + author={Goldfeld, Dorian}, + booktitle={Number Theory}, + pages={179--192}, + year={2004}, + publisher={Springer} +} + +@mastersthesis{np, + title={An attempt to automate np-hardness reductions via SO$\exists$ logic}, + author={Nijjar, Paul}, + year={2004}, + school={University of Waterloo} +}