Skip to content
Snippets Groups Projects
implementation.tex 10.7 KiB
Newer Older
  • Learn to ignore specific revisions
  • \section{Implementation}\label{sec:implementation}
    
    
    One of the two contributions of \emph{ulo-storage} is that we
    implemented components for making organizational mathematical
    knowledge queryable. This section first makes out the individual
    required component for this tasks and then describes some details
    of the actual implementation for this project.
    
    
    \subsection{Components Implemented for \emph{ulo-storage}}\label{sec:components}
    
    With RDF files exported and available for download as Git repositories
    on MathHub, we have the goal of making the underlying data available
    
    for use in applications. Figure~\ref{fig:components} illustrates the
    implemented components and their relationships.
    
    
    \begin{figure}[]\begin{center}
        \includegraphics{figs/components}
        \caption{Components involved in the \emph{ulo-storage} system.}\label{fig:components}
    \end{center}\end{figure}
    
    \begin{itemize}
    \item ULO triplets are present in various locations, be it Git
      repositories, on web servers or the local disk.  It is the job of a
      \emph{Collecter} to assemble these {RDF}~files and forward them for further
      processing. This may involve cloning a Git repository or crawling
      the file system.
    
      \item With streams of ULO files assembled by the Collecter, this
      data then gets passed to an \emph{Importer}. An Importer uploads
      RDF~streams into some kind of permanent storage. For
      use in this project, the GraphDB~\cite{graphdb} triplet store was
      a natural fit.
    
    \item Finally, with all triplets stored in a database, an
      \emph{Endpoint} is where applications access the underlying
      knowledge base. This does not necessarily need to be any custom
      software, rather the programming interface of the underlying
      database itself could be understood as an endpoint of its own.
    
      Regardless, some thought can be put into designing an Endpoint as a
      layer that lives between application and database that is more
      convenient to use than the one provided by the database. It comes
      down to the programming interface we wish to provide to a developer
      using this system.
    \end{itemize}
    
    Collecter, Importer and Endpoint provide us with an easy and automated
    
    way of making RDF files ready for use with applications. We will now
    take a look at the actual implementation created for
    \emph{ulo-storage}.
    
    
    \subsection{Collecter and Importer}\label{sec:collecter}
    
    We previously described Collecter and Importer as two distinct
    components.  The Collecter pulls RDF data from various sources as an
    input and outputs a stream of standardized RDF data while the Importer
    takes such a stream of RDF data and then dumps it to some sort of
    persistent storage.  However in the implementation for
    \emph{ulo-storage}, both Collecter and Importer ended up being one
    piece of monolithic software. This does not need to be the case but
    simply proved convenient.
    
    Our implementation supports two sources for RDF files, namely Git
    repositories and the local file system. The file system Collecter
    simply crawls a given directory on the local machine and looks for
    RDF~XMl~files~\cite{rdfxml} while the Git Collecter first clones a Git
    repository and then passes the checked out working copy to the file
    system Collecter. Because it is not uncommon for RDF files to be
    compressed, our Collecter supports on the fly extraction of
    Gzip~\cite{gzip} and XZ~\cite{xz} formats which can greatly reduce the
    required disk space in the collection step.
    
    During development of the Collecter, we found that existing exports
    from third party mathematical libraries contain RDF syntax errors which
    were not discovered previously. In particular, both Isabelle and Coq
    export contained URIs which do not fit the official
    specification~\cite{rfc3986}. Previous work that processed Coq and
    Isabelle exports used database software such as Virtuoso Open
    Source~\cite{ulo} which does not properly check URIs according to
    spec, in consequence these faults were only discovered now. To tackle
    these problems, we introduced on the fly correction steps during
    collection that take the broken RDF files, fix the mentioned problems
    related to URIs (by escaping illegal characters) and then
    continue processing. Of course this is only a work-around; related
    bugs were filed in the respective export projects to ensure that in the
    future this extra step is not necessary.
    
    Our Collecter takes existing RDF files, applies some on the fly
    transformations (extraction of compressed files, fixing of errors),
    the result is a stream of RDF data. This stream gets passed to the
    Importer which imports the encoded RDF triplets into some kind of
    persistent storage. The canonical choice for this task is to use a
    triple store, that is a database optimized for storing RDF
    triplets~\cite{triponto, tripw3c}. For our project, we used the
    GraphDB~\cite{graphdb} triple store as it is easy to use an a free
    version that fits our needs is available~\cite{graphdbfree}. The
    import itself is straight-forward, our software only needs to upload
    the RDF file stream as-is to an HTTP endpoint provided by our GraphDB
    instance.
    
    \subsubsection{Scheduling and Version Management}
    
    Collecter and Importer were implemented as library code that can be
    called in various front ends. For this project, we provide both a
    command line interface as well as a graphical web front end. While the
    command line interface is only useful for manually starting single
    jobs, the web interface allows scheduling of jobs. In particular, it
    allows the user to automate import jobs. For example, it is possible
    to schedule an import of a given Git repository every seven days to a
    given GraphDB instance.
    
    Automated job control alone however do not solve the problem of
    versioning. ULO exports~$\mathcal{E}$ depend on an original third
    party library~$\mathcal{L}$. Running~$\mathcal{E}$ through the
    workflow of Collecter and Importer, we get some database
    representation~$\mathcal{D}$. We see that data flows
    \begin{align*}
      \mathcal{L} \rightarrow \mathcal{E} \rightarrow \mathcal{D}
    \end{align*}
    which means that if records in~$\mathcal{L}$ change, this will
    probably result in different triplets~$\mathcal{E}$ which in turn
    results in a need to update~$\mathcal{D}$. This is difficult.  As it
    stands, \emph{ulo-storage} only knows about what is in~$\mathcal{E}$.
    While it should be possible to find out the difference between a new
    version of~$\mathcal{E}$ and the current version of~$\mathcal{D}$ and
    compute the changes necessary to be applied to~$\mathcal{D}$, the big
    number of triplets makes this appear unfeasible. So far, our only
    suggestion to solve the problem of changing third party libraries is
    to regularly re-create the full data set~$\mathcal{D}$ from scratch,
    say every seven days. This circumvents all problems related to
    updating existing data sets, but it does mean additional computation
    requirements. For currently existing exports from Coq and Isabelle
    this is not a problem as even on weak laptop hardware the imports take
    less than an hour. But if the number of triplets raises by orders of
    magnitude, this approach will eventually not be scalable anymore.
    
    
    \subsection{Endpoints}\label{sec:endpoints}
    
    With ULO triplets imported into the GraphDB triplet store by Collecter
    and Importer, we now have all data available necessary for querying.
    As discussed before, querying from applications happens through an
    Endpoint that exposes some kind of {API}. The interesting question
    here is probably not so much the implementation of the endpoint itself,
    rather it is the choice of API than can make or break such a project.
    
    There are multiple approaches to querying the GraphDB triplet store,
    one based around the standardized SPARQL query language and the other
    
    on the RDF4J Java library. Both approaches have unique advantages.
    
    \begin{itemize}
          \item SPARQL is a standardized query language for RDF triplet
    
          data~\cite{sparql}. The specification includes not just syntax
          and semantics of the language itself, but also a standardized
    
          REST interface for querying database servers.
    
          \textbf{Syntax} SPARQL is inspired by SQL and as such the
          \texttt{SELECT} \texttt{WHERE} syntax should be familiar to many
          software developers.  A simple query that returns all triplets
          in the store looks like
          \begin{lstlisting}
          SELECT * WHERE { ?s ?p ?o }
          \end{lstlisting}
    
          where \texttt{?s}, \texttt{?p} and \texttt{?o} are query
    
          variables. The result of any query are valid substitutions for
          the query variables. In this particular case, the database would
          return a table of all triplets in the store sorted by
          subject~\texttt{?o}, predicate~\texttt{?p} and
          object~\texttt{?o}.
    
          \textbf{Advantage} Probably the biggest advantage is that
          SPARQL is ubiquitous. As it is the de facto standard for
          querying triplet stores, lots of literature and documentation is
          available~\cite{sparqlbook, sparqlimpls, gosparql}.
    
          \item RDF4J is a Java API for interacting with triplet stores,
          implemented based on a superset of the {SPARQL} REST
          interface~\cite{rdf4j}.  GraphDB is one of the database
          servers that supports RDF4J, in fact it is the recommended way
          of interacting with GraphDB repositories~\cite{graphdbapi}.
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
    
    
          \textbf{Syntax} Instead of formulating textual queries, RDF4J
          allows developers to query a repository by calling Java API
          methods. Previous query that requests all triplets in the store
          looks like
          \begin{lstlisting}
          connection.getStatements(null, null, null);
          \end{lstlisting}
          in RDF4J. \texttt{getStatements(s, p, o)} returns all triplets
          that have matching subject~\texttt{s}, predicate~\texttt{p} and
          object~\texttt{o}. Any argument that is \texttt{null} can be
          replace with any value, i.e.\ it is a query variable to be
          filled by the call to \texttt{getStatements}.
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
    
    
          \textbf{Advantage} Using RDF4J does introduce a dependency on
          the JVM and its languages. But in practice, we found RDF4J to be
          quite convenient, especially for simple queries, as it allows us
          to formulate everything in a single programming language rather
    
          than mixing programming language with awkward query strings.
    
          We also found it quite helpful to generate Java classes from
          OWL ontologies that contain all definitions of the ontology and
          make it readable by any IDE~\cite{rdf4jgen}.
    
    \end{itemize}
    
    
    We see that both SPARQL and RDF4J have unique advantages. While SPARQL
    is an official W3C standard and implemented by more database systems,
    RDF4J can be more convenient when dealing with JVM-based code bases.
    For \emph{ulo-storage}, we played around with both interfaces and
    chose whatever seemed more convenient at the moment. We recommend any
    implementors to do the same.
    
    
    \subsection{Deployment}
    
    \emph{here be dragons}