Skip to content
Snippets Groups Projects
implementation.tex 16.8 KiB
Newer Older
  • Learn to ignore specific revisions
  • \section{Implementation}\label{sec:implementation}
    
    
    One of the two contributions of \emph{ulo-storage} is that we
    implemented components for making organizational mathematical
    
    knowledge (formulated as RDF~triplets) queryable. This section first
    
    makes out the individual components involved in this task. We then
    discuss the actual implementation created for this project.
    
    \subsection{Components Implemented for \emph{ulo-storage}}\label{sec:components}
    
    Figure~\ref{fig:components} illustrates how data flows through the
    different components. In total, we made out three components that make
    up the infrastructure provided by \emph{ulo-storage}.
    
    
    \begin{figure}[]\begin{center}
    
        \includegraphics[width=0.9\textwidth]{figs/components.png}
    
        \caption{Components involved in the \emph{ulo-storage} system.}\label{fig:components}
    \end{center}\end{figure}
    
    \begin{itemize}
    \item ULO triplets are present in various locations, be it Git
    
      repositories, web servers or the local disk.  It is the job of a
      \emph{Collector} to assemble these {RDF}~files and forward them for further
    
      processing. This may involve cloning a Git repository or crawling
      the file system.
    
    
      \item With streams of ULO files assembled by the Collector, these
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
        streams then get passed to an \emph{Importer}. The Importer then
    
        uploads RDF~streams into some kind of permanent storage. As we
        will see, the GraphDB~\cite{graphdb} triple store was a natural
        fit.
    
    
    \item Finally, with all triplets stored in a database, an
      \emph{Endpoint} is where applications access the underlying
      knowledge base. This does not necessarily need to be any custom
      software, rather the programming interface of the underlying
    
      database server itself can be understood as an Endpoint of its own.
    
    \end{itemize}
    
    
    Collector, Importer and Endpoint provide us with an automated way of
    making RDF files available for use within applications. We will now
    take a look at the actual implementation created for
    
    \emph{ulo-storage}, beginning with the implementation of Collector and
    
    \subsection{Collector and Importer}\label{sec:collector}
    
    We previously described Collector and Importer as two distinct
    
    components.  First, a Collector pulls RDF data from various sources as
    an input and outputs a stream of standardized RDF data. Second, an
    Importer takes such a stream of RDF data and then dumps it to some
    sort of persistent storage.  In our implementation, both Collector and
    
    Importer ended up as one piece of monolithic software. This does not
    
    need to be the case but proved convenient as combining Collector and
    Importer forgoes the needs for an additional IPC~mechanism between
    Collector and Importer. In addition, neither our Collector nor
    Importer are particularly complicated pieces of software, as such there
    is no pressing need to force them into separate processes.
    
    
    Our implementation supports two sources for RDF files, namely Git
    
    repositories and the local file system. The file system Collector
    
    crawls a given directory on the local machine and looks for
    
    RDF~XML~files~\cite{rdfxml} while the Git Collector first clones a Git
    
    repository and then passes the checked out working copy to the file
    
    system Collector. Because we found that is not uncommon for RDF files
    to be compressed, our implementation supports on the fly extraction of
    
    gzip~\cite{gzip} and xz~\cite{xz} formats which can greatly reduce the
    
    required disk space in the collection step.
    
    
    During development of the Collector, we found that existing exports
    
    from third party mathematical libraries contain RDF syntax errors
    which were not discovered previously. In particular, both Isabelle and
    
    Coq exports contained URIs which does not fit the official syntax
    
    specification~\cite{rfc3986} as they contained illegal
    characters. Previous work~\cite{ulo} that processed Coq and Isabelle
    
    exports used database software such as Virtuoso Open
    
    Source~\cite{wikivirtuoso} which does not properly check URIs according
    
    to spec; in consequence these faults were only discovered now.  To
    
    tackle these problems, we introduced on the fly correction steps
    during collection that escape the URIs in question and then continue
    processing.  Of course this is only a work-around. Related bug reports
    were filed in the respective export projects to ensure that in the
    future this extra step is not necessary.
    
    The output of the Collector is a stream of RDF~data.  This stream gets
    
    passed to the Importer which imports the encoded RDF triplets into
    
    some kind of persistent storage. In theory, multiple implementations
    of this Importer are possible, namely different implementations for
    
    different database backends. As we will see in
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
    Section~\ref{sec:endpoints}, for our project we selected the GraphDB
    
    triple store alone. The Importer merely needs to make the necessary
    API~calls to import the RDF stream into the database.  As such the
    import itself is straight-forward, our software only needs to upload
    the RDF file stream as-is to an HTTP endpoint provided by our GraphDB
    instance.
    
    To review, our combination of Collector and Importer fetches XML~files
    from Git repositories, applies on the fly decompression and fixes and
    then imports the collected RDF~triplets into persistent database
    storage.
    
    \subsubsection{Scheduling}
    
    Collector and Importer were implemented as library code that can be
    
    called from various front ends. For this project, we provide both a
    
    command line interface as well as a graphical web front end. While the
    command line interface is only useful for manually starting single
    
    runs, the web interface (Figure~\ref{fig:ss}) allows for more
    flexibility. In particular, import jobs can be started either manually
    or scheduled to run at fixed intervals. The web interface also
    persists error messages and logs.
    
    
    \input{implementation-screenshots.tex}
    
    \subsubsection{Version Management}
    
    Automated job control leads us to the problem of versioning.  In our
    current design, given ULO exports~$\mathcal{E}_i$ depend on
    original third party libraries~$\mathcal{L}_i$. Running~$\mathcal{E}_i$
    through the workflow of Collector and Importer, we get some database
    
    representation~$\mathcal{D}$. We see that data flows
    \begin{align*}
    
      \mathcal{L}_1 \rightarrow \; &\mathcal{E}_1 \rightarrow \mathcal{D} \\
      \mathcal{L}_2 \rightarrow \; &\mathcal{E}_2 \rightarrow \mathcal{D} \\
      &\vdots{}                                                        \\
      \mathcal{L}_n \rightarrow \; &\mathcal{E}_n \rightarrow \mathcal{D}
    
    from $n$~individual libraries~$\mathcal{L}_i$ into a single
    database storage~$\mathcal{D}$ that is used for querying.
    
    
    However, we must not ignore that mathematical knowledge is ever
    changing and not static. When a given library~$\mathcal{L}^{t}_i$ at
    revision~$t$ gets updated to a new version~$\mathcal{L}^{t+1}_i$, this
    change will eventually propagate to the associated export and result
    in a new set of RDF triplets~$\mathcal{E}^{t+1}_i$. Our global
    database state~$\mathcal{D}$ needs to get updated to match the changes
    between~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$.
    
    Finding an efficient implementation for this problem is not trivial.
    While it should be possible to compute the difference between two
    exports~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$ and infer the
    
    changes necessary to be applied to~$\mathcal{D}$, the big number of
    triplets makes this appear unfeasible.  As this is a problem an
    
    implementer of a greater tetrapodal search system will most likely
    encounter, we suggest the following approaches to tackle this
    challenge.
    
    
    One approach is to annotate each triplet in~$\mathcal{D}$ with
    versioning information about which particular
    export~$\mathcal{E}^{t}_i$ it was derived from.  During an import
    
    from~$\mathcal{E}^{s}_i$ into~$\mathcal{D}$, we could (1)~first remove
    all triplets in~$\mathcal{D}$ that were derived from previous
    version~$\mathcal{E}^{s-1}_i$ and (2)~then re-import all triplets from
    the current version~$\mathcal{E}^{s}_i$. Annotating triplets with
    versioning information is an approach that should work, but it does
    
    introduce~$\mathcal{O}(n)$ additional triplets in~$\mathcal{D}$ where
    
    $n$~is the number of triplets in~$\mathcal{D}$. After all, we need to
    annotate each of the $n$~triplets with versioning information,
    effectively doubling the required storage space. A not very satisfying
    
    solution.
    
    Another approach is to regularly re-create the full data
    set~$\mathcal{D}$ from scratch, say every seven days. This circumvents
    
    the problems related to updating existing data sets, but also means
    that changes in a given library~$\mathcal{L}_i$ take some to propagate
    
    to~$\mathcal{D}$.  Continuing this train of thought, an advanced
    version of this approach could forgo the requirement for one single
    database storage~$\mathcal{D}$ entirely. Instead of maintaining just
    one global database state~$\mathcal{D}$, we suggest experimenting with
    dedicated database instances~$\mathcal{D}_i$ for each given
    
    library~$\mathcal{L}_i$.  The advantage here is that re-creating a
    given database representation~$\mathcal{D}_i$ is fast as
    exports~$\mathcal{E}_i$ are comparably small. The disadvantage is that
    we still want to query the whole data set~$\mathcal{D} = \mathcal{D}_1
    
    \cup \mathcal{D}_2 \cup \cdots \cup \mathcal{D}_n$. This does require
    the development of some cross-database query mechanism, functionality
    existing systems currently only offer limited support
    for~\cite{graphdbnested}.
    
    
    In summary, we see that versioning is a potential challenge for a
    greater tetrapodal search system. While not a pressing issue for
    
    \emph{ulo-storage} now, we consider it a topic of future research.
    
    \subsection{Endpoint}\label{sec:endpoints}
    
    Finally, we need to discuss how \emph{ulo-storage} realizes the
    
    Endpoint. Recall that an Endpoint provides the programming interface
    
    for applications that wish to query our collection of organizational
    
    knowledge. In practice, the choice of Endpoint programming interface
    
    is determined by the choice of database system as the Endpoint is
    
    provided directly by the database system.
    
    
    In our project, organizational knowledge is formulated as
    RDF~triplets.  The canonical choice for us is to use a triple store,
    that is a database optimized for storing RDF triplets~\cite{triponto,
    tripw3c}. For our project, we used the GraphDB~\cite{graphdb} triple
    store. A free version that fits our needs is available
    at~\cite{graphdbfree}.
    
    \subsubsection{Transitive Queries}
    
    A notable advantage of GraphDB compared to other systems such as
    Virtuoso Open Source~\cite{wikivirtuoso, ulo} is that GraphDB supports
    recent versions of the SPARQL query language~\cite{graphdbsparql} and
    OWL~Reasoning~\cite{owlspec, graphdbreason}.  In particular, this
    means that GraphDB offers support for transitive queries as described
    
    in previous work on~ULO~\cite{ulo}. A transitive query is one that,
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
    given a relation~$R$, asks for the transitive closure~$S$of~$R$.
    (Figure~\ref{fig:tc}).
    
    \input{implementation-transitive-closure.tex}
    
    
    In fact, GraphDB supports two approaches for realizing transitive
    queries.  On one hand, GraphDB supports the
    
    \texttt{owl:TransitiveProperty}~\cite[Section 4.4.1]{owlspec} property
    that defines a given predicate~$P$ to be transitive. With $P$~marked
    
    this way, querying the knowledge base is equivalent to querying the
    transitive closure of~$P$. This requires transitivity to be hard-coded
    into the knowledge base. If we only wish to query the transitive
    closure for a given query, we can take advantage of so-called
    ``property paths''~\cite{paths} which allow us to indicate that a
    given predicate~$P$ is to be understood as transitive when
    querying. Only during querying is the transitive closure then
    evaluated. Either way, GraphDB supports transitive queries without
    awkward workarounds necessary in other systems~\cite{ulo}.
    
    \subsubsection{SPARQL Endpoint}
    
    There are multiple approaches to querying the GraphDB triple store,
    
    one based around the standardized SPARQL query language and the other
    
    on the RDF4J Java library. Both approaches have unique advantages.
    
    Let us first take a look at {SPARQL}, which is a standardized query
    language for RDF triplet data~\cite{sparql}. The specification
    includes not just syntax and semantics of the language itself, but
    also a standardized REST interface~\cite{rest} for querying database
    servers.
    
    
    The SPARQL syntax was inspired by SQL and as such the \texttt{SELECT}
    
    \texttt{WHERE} syntax should be familiar to many software developers.
    A simple query that returns all triplets in the store looks like
    
        SELECT * WHERE { ?s ?p ?o }
    
    \end{lstlisting}
    where \texttt{?s}, \texttt{?p} and \texttt{?o} are query
    
    variables. The result of any query are valid substitutions for all
    
    query variables. In this particular case, the database would return a
    table of all triplets in the store sorted by subject~\texttt{?o},
    predicate~\texttt{?p} and object~\texttt{?o}.
    
    
    Probably the biggest advantage is that SPARQL is ubiquitous. As it is
    the de facto standard for querying triple stores, lots of
    implementations (client and server) as well as documentation are
    
    available~\cite{sparqlbook, sparqlimpls, gosparql}.
    
    \subsubsection{RDF4J Endpoint}
    
    
    SPARQL is one way of accessing a triple store database. Another
    approach is RDF4J, a Java API for interacting with RDF graphs,
    implemented based on a superset of the {SPARQL} REST
    interface~\cite{rdf4j}.  GraphDB is one of the database servers that
    supports RDF4J, in fact it is the recommended way of interacting with
    GraphDB repositories~\cite{graphdbapi}.
    
    Instead of formulating textual queries, RDF4J allows developers to
    query a knowledge base by calling Java library methods. Previous query
    that asks for all triplets in the store looks like
    
        connection.getStatements(null, null, null);
    
    \end{lstlisting}
    in RDF4J. \texttt{getStatements(s, p, o)} returns all triplets that
    have matching subject~\texttt{s}, predicate~\texttt{p} and
    
    object~\texttt{o}. Any argument that is \texttt{null} can be
    substituted with any value, that is it is a query variable to be
    filled by the call to \texttt{getStatements}.
    
    
    Using RDF4J does introduce a dependency on the JVM and its
    languages. But in practice, we found RDF4J to be quite convenient,
    especially for simple queries, as it allows us to formulate everything
    in a single programming language rather than mixing programming
    language with awkward query strings. We also found it quite helpful to
    generate Java classes from OWL~ontologies that contain all definitions
    of the ontology as easily accessible constants~\cite{rdf4jgen}.  This
    provides us with powerful IDE auto completion features during
    development of ULO applications.
    
    Summarizing the last two sections, we see that both SPARQL and RDF4J
    
    have unique advantages. While SPARQL is an official W3C~\cite{w3c}
    standard and implemented by more database systems, RDF4J can be more
    convenient when dealing with JVM-based projects.  For
    \emph{ulo-storage}, we played around with both interfaces and chose
    whatever seemed more convenient at the moment. We recommend any
    implementors to do the same.
    
    \subsection{Deployment and Availability}
    
    Software not only needs to get developed, but also deployed. To deploy
    
    the combination of Collector, Importer and Endpoint, we use Docker
    Compose. Docker itself is a technology for wrapping software into
    containers, that is lightweight virtual machines with a fixed
    environment for running a given application~\cite[pp. 22]{dockerbook}.
    Docker Compose then is a way of combining individual Docker containers
    to run a full tech stack of application, database server and so
    
    on~\cite[pp. 42]{dockerbook}. All configuration of the overarching
    
    setup is stored in a Docker Compose file that describes the software
    stack.
    
    
    For \emph{ulo-storage}, we provide a single Docker Compose file which
    starts three containers, namely (1)~the Collector/Importer web
    
    interface, (2)~a GraphDB instance which provides us with the required
    Endpoint and (3)~some test applications that use that Endpoint.  All
    code for Collector and Importer is available in the
    \texttt{ulo-storage-collect} Git repository~\cite{gorepo}.  Additional
    
    deployment files, that is Docker Compose configuration and additional
    Dockerfiles are stored in a separate repository~\cite{dockerfilerepo}.
    
    With this, we conclude our discussion of the implementation developed
    for the \emph{ulo-storage} project. We designed a system based around
    (1)~a Collector which collects RDF triplets from third party sources,
    (2)~an Importer which imports these triplets into a GraphDB database
    and (3)~looked at different ways of querying a GraphDB Endpoint. All
    
    of this is easy to deploy using a single Docker Compose file.
    
    Our concrete implementation is useful in so far as that we can use it
    to experiment with ULO data sets. But development also provided
    
    Andreas Schärtl's avatar
    Andreas Schärtl committed
    insight about (1)~which components this class of system requires and
    (2)~which problems need to be solved. One topic we discussed at length
    
    is version management. It is easy to dismiss this in these early
    stages of development, but no question it is something to keep in
    mind.