Skip to content
Snippets Groups Projects
implementation.tex 14.8 KiB
Newer Older
  • Learn to ignore specific revisions
  • \section{Implementation}\label{sec:implementation}
    
    
    One of the two contributions of \emph{ulo-storage} is that we
    implemented components for making organizational mathematical
    
    knowledge (formulated as RDF~triplets) queryable. This section first
    makes out the individual required component for this tasks and then
    describes some details of the actual implementation for this project.
    
    \subsection{Components Implemented for \emph{ulo-storage}}\label{sec:components}
    
    With RDF files exported and available for download as Git
    repositories~\cite{uloisabelle, ulocoq}, we have the goal of making
    the underlying data available for use in applications. Let us first
    look at a high level overview of all involved
    components. Figure~\ref{fig:components} illustrates each component and
    the flow of data.
    
    
    \begin{figure}[]\begin{center}
    
        \includegraphics[width=0.9\textwidth]{figs/components.png}
    
        \caption{Components involved in the \emph{ulo-storage} system.}\label{fig:components}
    \end{center}\end{figure}
    
    \begin{itemize}
    \item ULO triplets are present in various locations, be it Git
    
      repositories, web servers or the local disk.  It is the job of a
      \emph{Collector} to assemble these {RDF}~files and forward them for further
    
      processing. This may involve cloning a Git repository or crawling
      the file system.
    
    
      \item With streams of ULO files assembled by the Collector, this
    
        data then gets passed to an \emph{Importer}. An Importer uploads
        RDF~streams into some kind of permanent storage. As we will see,
    
        the GraphDB~\cite{graphdb} triple store was a natural fit.
    
    
    \item Finally, with all triplets stored in a database, an
      \emph{Endpoint} is where applications access the underlying
      knowledge base. This does not necessarily need to be any custom
      software, rather the programming interface of the underlying
    
      database itself can be understood as an endpoint of its own.
    
    \end{itemize}
    
    
    Collector, Importer and Endpoint provide us with an easy and automated
    
    way of making RDF files available for use within applications. We will
    now take a look at the actual implementation created for
    
    \emph{ulo-storage}, beginning with the implementation of Collector and
    
    \subsection{Collector and Importer}\label{sec:collector}
    
    We previously described Collector and Importer as two distinct
    components.  The Collector pulls RDF data from various sources as an
    input and outputs a stream of standardized RDF data. Then, the
    Importer takes such a stream of RDF data and then dumps it to some
    sort of persistent storage.  In the implementation for
    \emph{ulo-storage}, both Collector and Importer ended up being one
    piece of monolithic software. This does not need to be the case but
    proved convenient because (1)~combining Collector and Importer forgoes
    the needs for an additional IPC~mechanism and (2)~neither Collector
    nor Importer are terribly large pieces of software in themselves.
    
    
    Our implementation supports two sources for RDF files, namely Git
    
    repositories and the local file system. The file system Collector
    
    crawls a given directory on the local machine and looks for
    
    RDF~XMl~files~\cite{rdfxml} while the Git Collector first clones a Git
    
    repository and then passes the checked out working copy to the file
    
    system Collector. Because it is not uncommon for RDF files to be
    compressed, our Collector supports on the fly extraction of
    
    gzip~\cite{gzip} and xz~\cite{xz} formats which can greatly reduce the
    
    required disk space in the collection step.
    
    
    During development of the Collector, we found that existing exports
    
    from third party mathematical libraries contain RDF syntax errors
    which were not discovered previously. In particular, both Isabelle and
    Coq export contained URIs which do not fit the official syntax
    specification~\cite{rfc3986} as they contained illegal
    characters. Previous work~\cite{ulo} that processed Coq and Isabelle
    
    exports used database software such as Virtuoso Open Source which do
    
    not properly check URIs according to spec, in consequence these faults
    
    were only discovered now.  To tackle these problems, we introduced on
    the fly correction steps during collection that escape the URIs in
    question and then continue processing.  Of course this is only a
    
    work-around. Related bug reports were filed in the respective export
    
    projects to ensure that in the future this extra step is not
    necessary.
    
    The output of the Collector is a stream of RDF data.  This stream gets
    passed to the Importer which imports the encoded RDF triplets into
    some kind of persistent storage. The canonical choice for this task is
    to use a triple store, that is a database optimized for storing RDF
    
    triplets~\cite{triponto, tripw3c}. For our project, we used the
    
    GraphDB~\cite{graphdb} triple store. A free version that fits our
    needs is available at~\cite{graphdbfree}.  The import itself is
    straight-forward, our software only needs to upload the RDF file
    stream as-is to an HTTP endpoint provided by our GraphDB instance.
    
    \emph{({TODO}: Write down a small comparison of different database
    
      types, triple stores and implementations. Honestly the main
    
      advantage of GraphDB is that it's easy to set up and import to;
      maybe I'll also write an Importer for another DB to show that the
      choice of database is not that important.)}
    
    
    \subsection{Scheduling and Version Management}
    
    Collector and Importer were implemented as library code that can be
    
    called from various front ends. For this project, we provide both a
    
    command line interface as well as a graphical web front end. While the
    command line interface is only useful for manually starting single
    jobs, the web interface allows scheduling of jobs. In particular, it
    allows the user to automate import jobs. For example, it is possible
    to schedule an import of a given Git repository every seven days to a
    given GraphDB instance.
    
    
    Automated job control that regularly imports data from the same
    
    sources leads us to the problem of versioning.  In our current design,
    multiple ULO exports~$\mathcal{E}_i$ depend on original third party
    libraries~$\mathcal{L}_i$. Running~$\mathcal{E}_i$ through the
    workflow of Collector and Importer, we get some database
    
    representation~$\mathcal{D}$. We see that data flows
    \begin{align*}
    
      \mathcal{L}_1 \rightarrow \; &\mathcal{E}_1 \rightarrow \mathcal{D} \\
      \mathcal{L}_2 \rightarrow \; &\mathcal{E}_2 \rightarrow \mathcal{D} \\
      &\vdots{}                                                        \\
      \mathcal{L}_n \rightarrow \; &\mathcal{E}_n \rightarrow \mathcal{D}
    
    from $n$~individual libraries~$\mathcal{L}_i$ into a single
    database storage~$\mathcal{D}$ that is used for querying.
    
    However, mathematical knowledge isn't static. When a given
    library~$\mathcal{L}^{t}_i$ at revision~$t$ gets updated to a new
    version~$\mathcal{L}^{t+1}_i$, this change will eventually propagate
    to the associated export and result in a new set of RDF
    triplets~$\mathcal{E}^{t+1}_i$. Our global database
    state~$\mathcal{D}$ needs to get updated to match the changes
    between~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$.  Finding an
    efficient implementation for this problem is not trivial.  While it
    should be possible to find out the difference between two
    exports~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$ and compute the
    changes necessary to be applied to~$\mathcal{D}$, the big number of
    triplets makes this appear unfeasible.  As this is a problem an
    implementer of a greater tetrapodal search system will encounter, we
    suggest two possible approaches to solving this problem.
    
    One approach is to annotate each triplet in~$\mathcal{D}$ with
    versioning information about which particular
    export~$\mathcal{E}^{t}_i$ it was derived from.  During an import
    from~$\mathcal{E}^{s}_i$ into~$\mathcal{D}$, we could (1)~first remove all
    triplets in~$\mathcal{D}$ that were derived from the previous version
    of~$\mathcal{E}^{t-1}_i$ and (2)~then re-import all triplets from the current
    version~$\mathcal{E}^{s}_i$. Annotating triplets with versioning
    information is an approach that should work, but it does
    introduce~$\mathcal{O}(n)$ additional triplets in~$\mathcal{D}$ where
    $n$~is the number of triplets in~$\mathcal{D}$. This does mean
    effectively doubling the database storage space, a not very satisfying
    solution.
    
    Another approach is to regularly re-create the full data
    set~$\mathcal{D}$ from scratch, say every seven days. This circumvents
    all problems related to updating existing data sets, but it does have
    additional computation requirements. It also means that changes in a
    given library~$\mathcal{L}_i$ take some to propagate to~$\mathcal{D}$.
    Building on top of this idea, an advanced version of this approach
    could forgo the requirement of only one single database
    storage~$\mathcal{D}$. Instead of only maintaining one global database
    state~$\mathcal{D}$, we suggest the use of dedicated database
    instances~$\mathcal{D}_i$ for each given library~$\mathcal{L}_i$.  The
    advantage here is that re-creating a given database
    representation~$\mathcal{D}_i$ is fast as exports~$\mathcal{E}_i$ are
    comparably small. The disadvantage is that we still want to query the
    whole data set~$\mathcal{D} = \mathcal{D}_1 \cup \mathcal{D}_2 \cup
    \cdots \cup \mathcal{D}_n$. This requires the development of some
    cross-repository query mechanism, something GraphDB currently only
    offers limited support for~\cite{graphdbnested}.
    
    
    \subsection{Endpoints}\label{sec:endpoints}
    
    With ULO triplets imported into the GraphDB triple store by Collector
    
    and Importer, we now have all data available necessary for querying.
    As discussed before, querying from applications happens through an
    Endpoint that exposes some kind of {API}. The interesting question
    here is probably not so much the implementation of the endpoint itself,
    rather it is the choice of API than can make or break such a project.
    
    There are multiple approaches to querying the GraphDB triple store,
    
    one based around the standardized SPARQL query language and the other
    
    on the RDF4J Java library. Both approaches have unique advantages.
    
    \begin{itemize}
          \item SPARQL is a standardized query language for RDF triplet
    
          data~\cite{sparql}. The specification includes not just syntax
          and semantics of the language itself, but also a standardized
    
          REST interface~\cite{rest} for querying database servers.
    
          \textbf{Syntax} SPARQL is inspired by SQL and as such the
          \texttt{SELECT} \texttt{WHERE} syntax should be familiar to many
          software developers.  A simple query that returns all triplets
          in the store looks like
          \begin{lstlisting}
          SELECT * WHERE { ?s ?p ?o }
          \end{lstlisting}
    
          where \texttt{?s}, \texttt{?p} and \texttt{?o} are query
    
          variables. The result of any query are valid substitutions for
          the query variables. In this particular case, the database would
          return a table of all triplets in the store sorted by
          subject~\texttt{?o}, predicate~\texttt{?p} and
          object~\texttt{?o}.
    
          \textbf{Advantage} Probably the biggest advantage is that SPARQL
          is ubiquitous. As it is the de facto standard for querying
    
          triple stores, lots of implementations and documentation are
    
          available~\cite{sparqlbook, sparqlimpls, gosparql}.
    
          \item RDF4J is a Java API for interacting with triple stores,
    
          implemented based on a superset of the {SPARQL} REST
          interface~\cite{rdf4j}.  GraphDB is one of the database
          servers that supports RDF4J, in fact it is the recommended way
          of interacting with GraphDB repositories~\cite{graphdbapi}.
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
    
    
          \textbf{Syntax} Instead of formulating textual queries, RDF4J
          allows developers to query a repository by calling Java API
          methods. Previous query that requests all triplets in the store
          looks like
          \begin{lstlisting}
          connection.getStatements(null, null, null);
          \end{lstlisting}
          in RDF4J. \texttt{getStatements(s, p, o)} returns all triplets
          that have matching subject~\texttt{s}, predicate~\texttt{p} and
          object~\texttt{o}. Any argument that is \texttt{null} can be
          replace with any value, i.e.\ it is a query variable to be
          filled by the call to \texttt{getStatements}.
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
    
    
          \textbf{Advantage} Using RDF4J does introduce a dependency on
          the JVM and its languages. But in practice, we found RDF4J to be
          quite convenient, especially for simple queries, as it allows us
          to formulate everything in a single programming language rather
    
          than mixing programming language with awkward query strings.
    
          We also found it quite helpful to generate Java classes from
    
          OWL~ontologies that contain all definitions of the
          ontology~\cite{rdf4jgen}.  This provides us with powerful IDE
          auto completion features during development of ULO applications.
    
    \end{itemize}
    
    
    We see that both SPARQL and RDF4J have unique advantages. While SPARQL
    is an official W3C standard and implemented by more database systems,
    RDF4J can be more convenient when dealing with JVM-based code bases.
    For \emph{ulo-storage}, we played around with both interfaces and
    chose whatever seemed more convenient at the moment. We recommend any
    implementors to do the same.
    
    \subsection{Deployment and Availability}
    
    \def\gorepo{https://gitlab.cs.fau.de/kissen/ulo-storage-collect}
    \def\composerepo{https://gl.kwarc.info/supervision/schaertl_andreas/-/tree/master/experimental/compose}
    
    Software not only needs to get developed, but also deployed. To deploy
    
    the combination of Collector, Importer and Endpoint, we use Docker
    Compose. Docker itself is a technology for wrapping software into
    containers, that is lightweight virtual machines with a fixed
    environment for running a given application~\cite[pp. 22]{dockerbook}.
    Docker Compose then is a way of combining individual Docker containers
    to run a full tech stack of application, database server and so
    on~\cite[pp. 42]{dockerbook}. All configuration of such a setup is
    stored in a Docker Compose file that describes the tech stack.
    
    For \emph{ulo-storage}, we provide a single Docker Compose file which
    starts three containers, namely (1)~the Collector/Importer web
    interface, (2)~a database server for that web interface such that it
    can persist import jobs and finally (3)~a GraphDB instance which
    provides us with the required Endpoint. All code for Collector and
    Importer is available in the \texttt{ulo-storage-collect} Git
    repository~\cite{gorepo}.  Additional deployment files, that is Docker
    Compose and additional Dockerfiles are stored in a separate
    repository~\cite{dockerfilerepo}.
    
    
    This concludes our discussion of the implementation developed for the
    \emph{ulo-storage} project. We designed a system based around (1)~a
    
    Collector which collects RDF triplets from third party sources, (2)~an
    
    Importer which imports these triplets into a GraphDB database and
    (3)~looked at different ways of querying a GraphDB Endpoint. All of
    this is easy to deploy using a single Docker Compose file. With this
    stack ready for use, we will continue with a look at some interesting
    applications and queries built on top of this interface.