From a282ac4fc8df01754afb32ca691a98e95f4a1f99 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Andreas=20Sch=C3=A4rtl?= <andreas@schaertl.me> Date: Tue, 29 Sep 2020 14:17:46 +0200 Subject: [PATCH] report: review applications --- doc/report/applications-preds.tex | 2 +- doc/report/applications-q1.tex | 13 +-- doc/report/applications.tex | 173 +++++++++++++++++------------- doc/report/introduction.tex | 6 +- doc/report/references.bib | 35 ++++++ doc/report/report-bibstyle.tex | 26 +++++ doc/report/report.tex | 6 +- 7 files changed, 175 insertions(+), 86 deletions(-) create mode 100644 doc/report/report-bibstyle.tex diff --git a/doc/report/applications-preds.tex b/doc/report/applications-preds.tex index 5d81638..878bcd1 100644 --- a/doc/report/applications-preds.tex +++ b/doc/report/applications-preds.tex @@ -39,5 +39,5 @@ Coq export~\cite{ulocoqspec}.}\label{fig:preds-result} \end{subfigure} - \caption{Querying for the kinds of predicates used in a data set} + \caption{Querying the predicates in use in a given data set.} \end{figure} diff --git a/doc/report/applications-q1.tex b/doc/report/applications-q1.tex index cb8a634..f7a7310 100644 --- a/doc/report/applications-q1.tex +++ b/doc/report/applications-q1.tex @@ -37,12 +37,13 @@ ORDER BY DESC(?rating) \end{lstlisting} - \caption{SPARQL query that returns each pair of \texttt{theorem} - and one given \texttt{proof} ordered by the sum of size and - check time of the given proof. A naive rating such as this one - will hardly yield helpful results, but it shows that given some - kind of complexity scale, we can rate theorems and their proofs - by such a scale.}\label{fig:q1long} + \caption{SPARQL query that returns each pair of + \texttt{theorem} and one given \texttt{proof} ordered by the + sum of size and check time of the given proof. A naive + rating such as this one will hardly yield helpful results, + but it shows that given some kind of numeric complexity + scale, we can rate theorems and their proofs by such a + scale.}\label{fig:q1long} \end{subfigure} \caption{SPARQL queries for serving the organizational aspect diff --git a/doc/report/applications.tex b/doc/report/applications.tex index f695798..216003c 100644 --- a/doc/report/applications.tex +++ b/doc/report/applications.tex @@ -18,32 +18,32 @@ endpoint. structured (Section~\ref{sec:tetraq}). \end{itemize} -\noindent These applications will now be discussed in dedicated +\noindent These applications will now be discussed in the following sections. \subsection{Exploring Existing Data Sets}\label{sec:expl} -For our first application, we looked at what ULO predicates are +For our first application, we looked at which ULO predicates are actually used by the respective data sets. With more than 250~million triplets in the store, we hoped that this would give us some insight into the kind of knowledge we are dealing with. Implementing a query for this job is not very difficult. In SPARQL, this can be achieved with the \texttt{COUNT} aggregate, the full query -is given in verbatim in Figure~\ref{fig:preds-query}. This yields a -list of all used predicates with \texttt{?count} being the number of +is given in verbatim in Figure~\ref{fig:preds-query}. Our query +yields a list of all used predicates together with the number of occurrences (Figure~\ref{fig:preds-result}). Looking at the results, we find that both the Isabelle and the Coq data sets only use subsets of the predicates provided by the ULO ontology. The full results are listed in Appendix~\ref{sec:used}. In both cases, what stands out is -that both exports use less than a third of the available predicates. +that both exports use less than a third of all available ULO~predicates. We also see that the Isabelle and Coq exports use different predicates. For example, the Isabelle export contains organizational meta information such as information about paragraphs and sections in -the source document while the Coq export only tells us about the -filename of the Coq source. That is not particularly problematic as -long as we can trace a given object back to the original source. +the source document while the Coq export only tells us the filename of +the original Coq source. That is not particularly problematic as long +as we can trace a given object back to the original source. Regardless, our results do show that both exports have their own particularities and with more and more third party libraries exported to ULO one has to assume that this heterogeneity will only grow. In @@ -52,44 +52,44 @@ remain unused in both Isabelle and Coq exports. A user formulating queries for ULO might be oblivious to the fact that only subsets of exports support given predicates. +\input{applications-preds.tex} + We expect the difference between Coq and Isabelle exports to be caused by the difference in source material. It is only natural that different third party libraries expressed in different languages with -different features will result in different ULO~predicates. However we -want to just hint at the fact that this could also be an omission in +different features will result in different ULO~predicates. However, we +want to hint at the fact that this could also be an omission in the exporter code that originally generated the RDF~triplets we -imported. This shows the importance of writing good importers, -exporters take existing libraries and output ULO~triplets should lose -as little information as possible to ensure good results for a larger -tetrapodal search system. - -While these holes are not a problem for \emph{ulo-storage} per se, we -do expect this to be a challenge when building a tetrapodal search -system. Recommended ways around this ``missing fields'' problem in -database literature include the clever use of default values or -inference of missing values~\cite{kdisc, aidisc}, neither of which -feels particularly applicable to vast and heterogeneous sets of -ULO~triplets. - -\input{applications-preds.tex} +imported. This shows the importance of writing good exporters. +Exporters taking existing libraries and outputting ULO~triplets must +lose as little information as possible to ensure good results in a +larger tetrapodal search system. + +Our first application gave us an initial impression of the structure +of currently available organizational knowledge formulated in ULO +triplets. Whether caused by the difference in formal language or +because of omissions in code that produce ULO~triplets, we must not +expect predicates to be evenly distributed in the data set. This is +something to keep in mind especially as the number of ULO exports +increases. \subsection{Querying for Tetrapodal Search}\label{sec:tetraq} Various queries for a tetrapodal search system were previously -suggested in literature~\cite{tetra}. We investigated how some of them -can be realized with ULO data sets and where other data sources are -required. Where possible, we evaluate proof of concept -implementations. +suggested in literature~\cite{tetra}. We will now investigate how +three of them could be realized with the help of ULO data sets and +where other data sources are required. Where possible, we construct +proof of concept implementations and evaluate their applicability. -\subsubsection{Elementary Proofs and Computes Scores} +\subsubsection{Elementary Proofs and Computed Scores} Our first query~$\mathcal{Q}_1$ illustrates how we can compute -arithmetic scores for some nodes in our knowledge -graph. Query~$\mathcal{Q}_1$ asks us to ``[f]ind theorems with -non-elementary proofs.''. Elementary proofs can be understood as +arithmetic scores for some nodes in our knowledge graph. +Query~$\mathcal{Q}_1$ asks us to ``\emph{[f]ind theorems with +non-elementary proofs}.'' Elementary proofs can be understood as those proof that are considered easy and obvious~\cite{elempro}. In consequence,~$\mathcal{Q}_1$ has to search for all proofs which are -not trivial. Of course, just like nay distinction between ``theorem'' +not trivial. Of course, just like any distinction between ``theorem'' and ``corollary'' is going to be somewhat arbitrary, so is any judgment about whether a proof is easy or not. @@ -97,7 +97,7 @@ Existing research on proof difficulty is either very broad or specific to one problem. For example, some experiments showed that students and prospective school teachers have problems with notation, term rewriting and required prerequisites~\cite{proofund, proofteach}, none -of which seems applicable for grading individual proofs for +of which seems applicable for grading individual formal proofs for difficulty. On the other hand, there is research on rating proofs for individual subsets of problems, e.g.\ on the satisfiability of a given CNF formula. A particular example is focused on heuristics for how @@ -105,7 +105,7 @@ long a SAT solver will take to find a solution for a given~{CNF}. Here, solutions that take long are considered harder~\cite{proofsat}. -\noindent\textbf{Organizational Aspect} A first working hypothesis +\noindent\emph{Organizational Aspect.} A first working hypothesis might be to assume that elementary proofs are short. In that case, the size, that is the number of bytes to store the proof, is our first indicator of proof complexity. This is by no means perfect, as even @@ -124,18 +124,18 @@ predicate. Correlating proof complexity with file size allows us to define one indicator of proof complexity based on organizational knowledge alone. -\noindent\textbf{Other Aspects} A tetrapodal search system should +\noindent\emph{Other Aspects.} A tetrapodal search system should probably also take symbolic knowledge into account. Based on some kind of measure of formula complexity, different proofs could be rated. Similarly, with narrative knowledge available to us, we could -count the number of words, references and so on to rate the narrative +count the number of words, citations and so on to rate the narrative complexity of a proof. Combining symbolic knowledge, narrative knowledge and organizational knowledge allows us to find proofs which are probably straight forward. \input{applications-q1.tex} -\noindent\textbf{Implementation} Implementing a naive version of the +\noindent\emph{Implementation.} Implementing a naive version of the organizational aspect can be as simple as querying for all theorems justified by proofs, ordered by size (or check time). Figure~\ref{fig:q1short} illustrates how this can be achieved with a @@ -149,21 +149,22 @@ on standard arithmetic, it will be possible to formulate in a SPARQL query. The queries in Figure~\ref{fig:q1full} return a list of all theorems -and associated proofs, naturally this list is bound to be very long. A -suggested way to solve this problem is to introduce some kind of -cutoff value for our complexity score. Another potential solution is -to only list the first~$n$ results, something a user interface would -do anyways. Either way, this is not so much an issue for the -organizational storage engine and more one that a tetrapodal search -aggregator has to account for. +and associated proofs. Naturally, this list is bound to be very +long. A suggested way to solve this problem is to introduce some kind +of cutoff value for our complexity score. Another potential solution +is to only list the first~$n$ results, something a user interface +would have to do either way (pagination~\cite{pagination}). Either +way, this is not so much an issue for the organizational storage +engine and more one that a tetrapodal search aggregator has to account +for. Another problem is that computing these scores can be quite time intensive. Even if calculating a score for one given object is fast, doing so for the whole data set might quickly turn into a problem. In -particular, if we wish to show the $n$~object with best scores, we do +particular, if we wish to show the $n$~objects with best scores, we do need to compute scores for all relevant triplets for that score. In \emph{ulo-storage}, all scores we experimented with were easy enough -and the data sets small enough that that this did not become a +and the data sets small enough such that this did not become a concrete problem. But in a larger tetrapodal search system, caching or lazily or ahead of time computed results will probably we a necessity. Which component takes care of keeping this cache is not @@ -171,27 +172,33 @@ clear right now, but we advocate for keeping caches of previously computed scores separate from the core \emph{ulo-storage} Endpoint such that the Endpoint can be easily updated. +Understanding query~$\mathcal{Q}_1$ in the way we did makes it +difficult to present a definite solution. However while thinking +about~$\mathcal{Q}_1$ we found out that the infrastructure provided by +\emph{ulo-storage} allows us to compute arbitrary arithmetic scores, +something that will surely be useful for many applications. + \subsubsection{Categorizing Algorithms and Algorithmic Problems} The second query~$\mathcal{Q}_2$ we decided to focus on wants to -``[f]ind algorithms that solve $NP$-complete graph problems.'' Here we -want the tetrapodal search system to return a listing of algorithms -that solve (graph) problems with a given property (runtime +``\emph{[f]ind algorithms that solve $NP$-complete graph problems.}'' +Here we want the tetrapodal search system to return a listing of +algorithms that solve (graph) problems with a given property (runtime complexity). We need to consider where each of these three components might be stored. -\noindent\textbf{Symbolic and Concrete Aspects} First, let us consider +\noindent\emph{Symbolic and Concrete Aspects.} First, let us consider algorithms. Algorithms can be formulated as computer code which can be understood as symbolic knowledge (code represented as a syntax tree) or as concrete knowledge (code as text files)~\cites[pp. 8--9]{tetra}{virt}. Either way, we will not be able to query these indices for what problem a given algorithm is solving, nor is it possible to infer properties as complex as -$NP$-completeness automatically. Meta data of this kind needs to be +$NP$-completeness automatically. Metadata of this kind needs to be stored in a separate index for organizational knowledge, it being the only fit. -\noindent\textbf{Organizational Aspect} If we wish to look up properties +\noindent\emph{Organizational Aspect.} If we wish to look up properties about algorithms from organizational knowledge, we first have to think about how to represent this information. We propose two approaches, one using the existing ULO ontology and one that @@ -218,57 +225,74 @@ argument for adding support directly to ULO is that ULO aims to be universal and as such should not be without algorithms. An argument for a separate ontology is that what we understand as ULO data sets (Isabelle, Coq exports) already contain triplets from -other ontologies (e.g.\ Dublin Core meta data~\cite{dcreport, +other ontologies (e.g.\ Dublin Core metadata~\cite{dcreport, dcowl}) and keeping concepts separate is not entirely unattractive in itself. +In summary, we see that it is difficult to service~$\mathcal{Q}_2$ +even though the nature of this query is very much one of +organizational knowledge. It is probably up to the implementors of +future ULO~exports to find a good way of encoding algorithmic problems +and solutions. Perhaps a starting point on this endeavor would be to +find a formal way of structuring information gathered on sites like +Rosetta Code~\cite{rc}, a site that provides concrete programs that +solve algorithms problems. + \subsubsection{Contributors and Number of References} -The final query~$\mathcal{Q}_3$ from literature~\cite{tetra} we wish -to look at wants to know ``[a]ll areas of math that {Nicolas G.\ de -Bruijn} has worked in and his main contributions.'' This query is -asking by works of a given author~$A$. It also asks for their main -contributions, e.g.\ what paragraphs or code~$A$ has authored. +Finally, query~$\mathcal{Q}_3$ from literature~\cite{tetra} wants to +know ``\emph{[a]ll areas of math that {Nicolas G.\ de Bruijn} has + worked in and his main contributions.}'' $\mathcal{Q}_3$~is asking +for works of a given author~$A$. It also asks for their main +contributions, for example which particularly interesting paragraphs +or code~$A$ has authored. We picked this particular query as it +is asking for metadata, something that should be easily serviced by +organizational knowledge. -\noindent\textbf{Organizational Aspect} ULO has no concept of authors, +\noindent\emph{Organizational Aspect.} ULO has no concept of authors, contributors, dates and so on. Rather, the idea is to take advantage of the Dublin Core project which provides an ontology for such metadata~\cite{dcreport, dcowl}. For example, Dublin Core provides us with the \texttt{dcterms:creator} and \texttt{dcterms:contributor} predicates. Servicing~$\mathcal{Q}_3$ requires us to look for creator~$A$ and then list all associated -\texttt{dcterms:title}s that they have worked on. Of course this +objects that they have worked on. Of course this requires above authorship predicates to actually be in use. With the Isabelle and Coq exports this was hardly the case; running some experiments we found less than 15 unique contributors and -creators, raising suspicion that meta data is missing in the +creators, raising suspicion that metadata is missing in the original library files. Regardless, in theory ULO allows us to query for objects ordered by authors. -Query $\mathcal{Q}_3$ is also asking for the main contributions -of~$A$, that is those works that~$A$ authored that are the most -important. Importance is a quality measure, simply sorting the -result by number of references might be a good start. - \input{applications-q3.tex} -\noindent\textbf{Implementation} A search for contributions by a given +\noindent\emph{Implementation.} A search for contributions by a given author can easily be formulated in {SPARQL}~(Figure~\ref{fig:q2a}). -To get the main contributions, we rate each individual work -by its number of \texttt{ulo:uses} references. Extending the previous +Query $\mathcal{Q}_3$ is also asking for the main contributions +of~$A$, that is those works that~$A$ authored that are the most +important. Sorting the result by number of references might be a good +start. To get the main contributions, we rate each individual work by +its number of \texttt{ulo:uses} references. Extending the previous {SPARQL}, we can query the database for a ordered list of works, starting with the one that has the most references~(Figure~\ref{fig:q2b}). We can formulate~$\mathcal{Q}_3$ with just one SPARQL query. Because everything is handled by the database, access should be about as quick as we can hope it to be. +While the sparse data set available to use only returned a handful of +results, we see that queries like~$\mathcal{Q}_3$ are easily serviced +with organizational knowledge formulated in ULO~triplets. More +advanced queries could look at the interlinks between authors and even +uncover ``citation cartels'' as was done previously with similar +approaches~\cite{citcart}. + \subsubsection{Summarizing $\mathcal{Q}_1$ to $\mathcal{Q}_3$} Experimenting with $\mathcal{Q}_1$ to $\mathcal{Q}_3$ provided us with some insight into ULO and existing ULO exports. $\mathcal{Q}_1$ shows that while there is no formal definition for ``elementary proof'', ULO -allows us to query for heuristics and calculate a difficulty score for -proofs and their associated theorems. Query~$\mathcal{Q}_2$ illustrates +allows us to query for heuristics and calculate arbitrary arithmetic scores for +objects of organizational knowledge. Query~$\mathcal{Q}_2$ illustrates the difficulty in finding universal schemas. It remains an open question whether ULO should include algorithms as a first class citizen, as a concept based around existing ULO predicates or whether it is a better @@ -276,6 +300,5 @@ idea to design a dedicated ontology and potentially data store entirely. Finally, while we were able to formulate a SPARQL query that should take care of most of~$\mathcal{Q}_3$ we found that the existing data sets contain very little information about authorship. This underlines -the observations made in Section~\ref{sec:expl}, developers writing -applications that query ULO storage need to be aware of the fact that -existing exports have ``holes''. +the observations made previously in Section~\ref{sec:expl} and should +be on the mind of anyone writing exporters that output ULO~triplets. diff --git a/doc/report/introduction.tex b/doc/report/introduction.tex index d000ffb..e80811f 100644 --- a/doc/report/introduction.tex +++ b/doc/report/introduction.tex @@ -36,9 +36,9 @@ is stored as RDF~triplets with a unified schema of ULO~predicates~\cite{owl}. Some effort has been made to export existing collections of formal mathematical knowledge to {ULO}. In particular, exports from Isabelle and Coq-based libraries are -available~\cite{uloisabelle, ulocoq}. The resulting data set is -already quite large, the Isabelle export alone containing more than -200~million triplets. +available~\cite{afpexport, uloisabelle, ulocoq}. The resulting data +set is already quite large, the Isabelle export alone containing more +than 200~million triplets. Existing exports from Isabelle and Coq result in a set of XML~files that contain RDF~triplets. This is a convenient format for exchange diff --git a/doc/report/references.bib b/doc/report/references.bib index 4c05468..c12e02d 100644 --- a/doc/report/references.bib +++ b/doc/report/references.bib @@ -420,3 +420,38 @@ year={2014}, publisher={Springer} } + +@inproceedings{pagination, + title={Comparison of pagination algorithms based-on large data sets}, + author={Cao, Junkuo and Wang, Weihua and Shu, Yuanzhong}, + booktitle={International Symposium on Information and Automation}, + pages={384--389}, + year={2010}, + organization={Springer} +} + +@online{rc, + title = {Rosetta Code}, + urldate = {2020-09-30}, + url = {https://www.rosettacode.org/wiki/Rosetta_Code}, +} + +@misc{afpexport, + title={Making Isabelle Content Accessible in Knowledge Representation Formats}, + author={Michael Kohlhase and Florian Rabe and Makarius Wenzel}, + year={2020}, + eprint={2005.08884}, + archivePrefix={arXiv}, + primaryClass={cs.LO} +} + +@article{citcart, + title={Toward the discovery of citation cartels in citation networks}, + author={Fister Jr, Iztok and Fister, Iztok and Perc, Matja{\v{z}}}, + journal={Frontiers in Physics}, + volume={4}, + pages={49}, + year={2016}, + publisher={Frontiers} +} + diff --git a/doc/report/report-bibstyle.tex b/doc/report/report-bibstyle.tex new file mode 100644 index 0000000..fa8f2db --- /dev/null +++ b/doc/report/report-bibstyle.tex @@ -0,0 +1,26 @@ +% https://tex.stackexchange.com/questions/129170/formatting-dates-day-month-year-with-biblatex + +\DefineBibliographyExtras{american}{% + % d-m-y format for long dates + \protected\def\mkbibdatelong#1#2#3{% + \iffieldundef{#3} + {} + {\stripzeros{\thefield{#3}}% + \iffieldundef{#2}{}{\nobreakspace}}% + \iffieldundef{#2} + {} + {\mkbibmonth{\thefield{#2}}% + \iffieldundef{#1}{}{\space}}% + \iffieldbibstring{#1}{\bibstring{\thefield{#1}}}{\stripzeros{\thefield{#1}}}}% + % d-m-y format for short dates + \protected\def\mkbibdateshort#1#2#3{% + \iffieldundef{#3} + {} + {\mkdatezeros{\thefield{#3}}% + \iffieldundef{#2}{}{/}}% + \iffieldundef{#2} + {} + {\mkdatezeros{\thefield{#2}}% + \iffieldundef{#1}{}{/}}% + \iffieldbibstring{#1}{\bibstring{\thefield{#1}}}{\mkdatezeros{\thefield{#1}}}}% +} diff --git a/doc/report/report.tex b/doc/report/report.tex index d968805..54e57ad 100644 --- a/doc/report/report.tex +++ b/doc/report/report.tex @@ -1,6 +1,7 @@ \documentclass[a4paper]{scrartcl} \usepackage{amsmath} +\usepackage[american]{babel} \usepackage{booktabs} \usepackage{caption} \usepackage{fancybox, graphicx} @@ -24,6 +25,8 @@ sorting=none ]{biblatex} +\input{report-bibstyle.tex} + \setcounter{tocdepth}{2} \addbibresource{references.bib} @@ -35,9 +38,10 @@ \lstset{basicstyle=\ttfamily,breaklines=true} + \title{ulo-storage} \subtitle{Indexing and Querying Organizational Data in Mathematical Libraries} -\author{Andreas Schärtl (\texttt{ru64tiji})} +\author{Andreas Schärtl\\\texttt{andreas.schaertl@fau.de}} \date{\vspace{-1cm}} \begin{document} -- GitLab