Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
schaertl_andreas
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Container registry
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
GitLab community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
supervision
schaertl_andreas
Commits
3ecf7300
Commit
3ecf7300
authored
4 years ago
by
Andreas Schärtl
Browse files
Options
Downloads
Patches
Plain Diff
report: review implementation
parent
3867d0a0
No related branches found
No related tags found
No related merge requests found
Changes
2
Show whitespace changes
Inline
Side-by-side
Showing
2 changed files
doc/report/implementation.tex
+56
-52
56 additions, 52 deletions
doc/report/implementation.tex
doc/report/references.bib
+23
-0
23 additions, 0 deletions
doc/report/references.bib
with
79 additions
and
52 deletions
doc/report/implementation.tex
+
56
−
52
View file @
3ecf7300
...
@@ -22,12 +22,12 @@ the flow of data.
...
@@ -22,12 +22,12 @@ the flow of data.
\begin{itemize}
\begin{itemize}
\item
ULO triplets are present in various locations, be it Git
\item
ULO triplets are present in various locations, be it Git
repositories,
on
web servers or the local disk. It is the job of a
repositories, web servers or the local disk. It is the job of a
\emph
{
Collect
e
r
}
to assemble these
{
RDF
}
~files and forward them for further
\emph
{
Collect
o
r
}
to assemble these
{
RDF
}
~files and forward them for further
processing. This may involve cloning a Git repository or crawling
processing. This may involve cloning a Git repository or crawling
the file system.
the file system.
\item
With streams of ULO files assembled by the Collect
e
r, this
\item
With streams of ULO files assembled by the Collect
o
r, this
data then gets passed to an
\emph
{
Importer
}
. An Importer uploads
data then gets passed to an
\emph
{
Importer
}
. An Importer uploads
RDF~streams into some kind of permanent storage. As we will see,
RDF~streams into some kind of permanent storage. As we will see,
the GraphDB~
\cite
{
graphdb
}
triplet store was a natural fit.
the GraphDB~
\cite
{
graphdb
}
triplet store was a natural fit.
...
@@ -39,36 +39,36 @@ the flow of data.
...
@@ -39,36 +39,36 @@ the flow of data.
database itself can be understood as an endpoint of its own.
database itself can be understood as an endpoint of its own.
\end{itemize}
\end{itemize}
Collect
e
r, Importer and Endpoint provide us with an easy and automated
Collect
o
r, Importer and Endpoint provide us with an easy and automated
way of making RDF files available for use within applications. We will
way of making RDF files available for use within applications. We will
now take a look at the actual implementation created for
now take a look at the actual implementation created for
\emph
{
ulo-storage
}
, beginning with the implementation of Collect
e
r and
\emph
{
ulo-storage
}
, beginning with the implementation of Collect
o
r and
Importer.
Importer.
\subsection
{
Collect
e
r and Importer
}
\label
{
sec:collect
e
r
}
\subsection
{
Collect
o
r and Importer
}
\label
{
sec:collect
o
r
}
We previously described Collect
e
r and Importer as two distinct
We previously described Collect
o
r and Importer as two distinct
components. The Collect
e
r pulls RDF data from various sources as an
components. The Collect
o
r pulls RDF data from various sources as an
input and outputs a stream of standardized RDF data
while the Importer
input and outputs a stream of standardized RDF data
. Then, the
takes such a stream of RDF data and then dumps it to some
sort of
Importer
takes such a stream of RDF data and then dumps it to some
persistent storage. In the implementation for
\emph
{
ulo-storage
}
,
sort of
persistent storage. In the implementation for
both Collect
e
r and Importer ended up being one
piece of monolithic
\emph
{
ulo-storage
}
,
both Collect
o
r and Importer ended up being one
software. This does not need to be the case but
proved convenient
piece of monolithic
software. This does not need to be the case but
because (1)~combining Collect
e
r and Importer forgoes
the needs for an
proved convenient
because (1)~combining Collect
o
r and Importer forgoes
additional IPC~mechanism and (2)~neither Collect
er nor Importer are
the needs for an
additional IPC~mechanism and (2)~neither Collect
or
terribly large pieces of software in themselves.
nor Importer are
terribly large pieces of software in themselves.
Our implementation supports two sources for RDF files, namely Git
Our implementation supports two sources for RDF files, namely Git
repositories and the local file system. The file system Collect
e
r
repositories and the local file system. The file system Collect
o
r
crawls a given directory on the local machine and looks for
crawls a given directory on the local machine and looks for
RDF~XMl~files~
\cite
{
rdfxml
}
while the Git Collect
e
r first clones a Git
RDF~XMl~files~
\cite
{
rdfxml
}
while the Git Collect
o
r first clones a Git
repository and then passes the checked out working copy to the file
repository and then passes the checked out working copy to the file
system Collect
e
r. Because it is not uncommon for RDF files to be
system Collect
o
r. Because it is not uncommon for RDF files to be
compressed, our Collect
e
r supports on the fly extraction of
compressed, our Collect
o
r supports on the fly extraction of
gzip~
\cite
{
gzip
}
and xz~
\cite
{
xz
}
formats which can greatly reduce the
gzip~
\cite
{
gzip
}
and xz~
\cite
{
xz
}
formats which can greatly reduce the
required disk space in the collection step.
required disk space in the collection step.
During development of the Collect
e
r, we found that existing exports
During development of the Collect
o
r, we found that existing exports
from third party mathematical libraries contain RDF syntax errors
from third party mathematical libraries contain RDF syntax errors
which were not discovered previously. In particular, both Isabelle and
which were not discovered previously. In particular, both Isabelle and
Coq export contained URIs which do not fit the official syntax
Coq export contained URIs which do not fit the official syntax
...
@@ -77,25 +77,21 @@ characters. Previous work~\cite{ulo} that processed Coq and Isabelle
...
@@ -77,25 +77,21 @@ characters. Previous work~\cite{ulo} that processed Coq and Isabelle
exports used database software such as Virtuoso Open Source which do
exports used database software such as Virtuoso Open Source which do
not properly check URIs according to spec, in consequence these faults
not properly check URIs according to spec, in consequence these faults
were only discovered now. To tackle these problems, we introduced on
were only discovered now. To tackle these problems, we introduced on
the fly correction steps during collection that take the broken RDF
the fly correction steps during collection that escape the URIs in
files, fix the mentioned problems related to URIs (by escaping illegal
question and then continue processing. Of course this is only a
characters) and then continue processing. Of course this is only a
work-around. Related bug reports were filed in the respective export
work-around. Related bug reports were filed in the respective export
projects to ensure that in the future this extra step is not
projects to ensure that in the future this extra step is not
necessary.
necessary.
Our Collecter takes existing RDF files, applies some on the fly
The output of the Collector is a stream of RDF data. This stream gets
transformations (extraction of compressed files, fixing of errors),
passed to the Importer which imports the encoded RDF triplets into
the result is a stream of RDF data. This stream gets passed to the
some kind of persistent storage. The canonical choice for this task is
Importer which imports the encoded RDF triplets into some kind of
to use a triple store, that is a database optimized for storing RDF
persistent storage. The canonical choice for this task is to use a
triple store, that is a database optimized for storing RDF
triplets~
\cite
{
triponto, tripw3c
}
. For our project, we used the
triplets~
\cite
{
triponto, tripw3c
}
. For our project, we used the
GraphDB~
\cite
{
graphdb
}
triple store as it is easy to use an a free
GraphDB~
\cite
{
graphdb
}
triple store. A free version that fits our
version that fits our needs is available~
\cite
{
graphdbfree
}
. The
needs is available at~
\cite
{
graphdbfree
}
. The import itself is
import itself is straight-forward, our software only needs to upload
straight-forward, our software only needs to upload the RDF file
the RDF file stream as-is to an HTTP endpoint provided by our GraphDB
stream as-is to an HTTP endpoint provided by our GraphDB instance.
instance.
\emph
{
(
{
TODO
}
: Write down a small comparison of different database
\emph
{
(
{
TODO
}
: Write down a small comparison of different database
types, triplet stores and implementations. Honestly the main
types, triplet stores and implementations. Honestly the main
...
@@ -105,7 +101,7 @@ instance.
...
@@ -105,7 +101,7 @@ instance.
\subsubsection
{
Scheduling and Version Management
}
\subsubsection
{
Scheduling and Version Management
}
Collect
e
r and Importer were implemented as library code that can be
Collect
o
r and Importer were implemented as library code that can be
called from various front ends. For this project, we provide both a
called from various front ends. For this project, we provide both a
command line interface as well as a graphical web front end. While the
command line interface as well as a graphical web front end. While the
command line interface is only useful for manually starting single
command line interface is only useful for manually starting single
...
@@ -118,14 +114,14 @@ Automated job control that regularly imports data from the same
...
@@ -118,14 +114,14 @@ Automated job control that regularly imports data from the same
sources leads us to the problem of versioning. ULO
sources leads us to the problem of versioning. ULO
exports~
$
\mathcal
{
E
}$
depend on an original third party
exports~
$
\mathcal
{
E
}$
depend on an original third party
library~
$
\mathcal
{
L
}$
. Running~
$
\mathcal
{
E
}$
through the workflow of
library~
$
\mathcal
{
L
}$
. Running~
$
\mathcal
{
E
}$
through the workflow of
Collect
e
r and Importer, we get some database
Collect
o
r and Importer, we get some database
representation~
$
\mathcal
{
D
}$
. We see that data flows
representation~
$
\mathcal
{
D
}$
. We see that data flows
\begin{align*}
\begin{align*}
\mathcal
{
L
}
\rightarrow
\mathcal
{
E
}
\rightarrow
\mathcal
{
D
}
\mathcal
{
L
}
\rightarrow
\mathcal
{
E
}
\rightarrow
\mathcal
{
D
}
\end{align*}
\end{align*}
which means that if records in~
$
\mathcal
{
L
}$
change, this will
which means that if records in~
$
\mathcal
{
L
}$
change, this will
probably result in different triplets~
$
\mathcal
{
E
}$
which in turn
probably result in different triplets~
$
\mathcal
{
E
}$
which in turn
results in a need to update~
$
\mathcal
{
D
}$
. This is
difficult
. As it
results in a need to update~
$
\mathcal
{
D
}$
. This is
non-trivial
. As it
stands,
\emph
{
ulo-storage
}
only knows about what is in~
$
\mathcal
{
E
}$
.
stands,
\emph
{
ulo-storage
}
only knows about what is in~
$
\mathcal
{
E
}$
.
While it should be possible to find out the difference between a new
While it should be possible to find out the difference between a new
version of~
$
\mathcal
{
E
}$
and the current version of~
$
\mathcal
{
D
}$
and
version of~
$
\mathcal
{
E
}$
and the current version of~
$
\mathcal
{
D
}$
and
...
@@ -135,14 +131,14 @@ suggestion to solve the problem of changing third party libraries is
...
@@ -135,14 +131,14 @@ suggestion to solve the problem of changing third party libraries is
to regularly re-create the full data set~
$
\mathcal
{
D
}$
from scratch,
to regularly re-create the full data set~
$
\mathcal
{
D
}$
from scratch,
say every seven days. This circumvents all problems related to
say every seven days. This circumvents all problems related to
updating existing data sets, but it does mean additional computation
updating existing data sets, but it does mean additional computation
requirements. It also means that changes in~
$
\mathcal
{
L
}$
take
s
some
requirements. It also means that changes in~
$
\mathcal
{
L
}$
take some
to propagate to~
$
\mathcal
{
D
}$
. If the number of triplets raises
to propagate to~
$
\mathcal
{
D
}$
. If the number of triplets raises
by orders of magnitude, this approach will eventually not be scalable
by orders of magnitude, this approach will eventually not be scalable
anymore.
anymore.
\subsection
{
Endpoints
}
\label
{
sec:endpoints
}
\subsection
{
Endpoints
}
\label
{
sec:endpoints
}
With ULO triplets imported into the GraphDB triplet store by Collect
e
r
With ULO triplets imported into the GraphDB triplet store by Collect
o
r
and Importer, we now have all data available necessary for querying.
and Importer, we now have all data available necessary for querying.
As discussed before, querying from applications happens through an
As discussed before, querying from applications happens through an
Endpoint that exposes some kind of
{
API
}
. The interesting question
Endpoint that exposes some kind of
{
API
}
. The interesting question
...
@@ -222,20 +218,28 @@ implementors to do the same.
...
@@ -222,20 +218,28 @@ implementors to do the same.
\def\composerepo
{
https://gl.kwarc.info/supervision/schaertl
_
andreas/-/tree/master/experimental/compose
}
\def\composerepo
{
https://gl.kwarc.info/supervision/schaertl
_
andreas/-/tree/master/experimental/compose
}
Software not only needs to get developed, but also deployed. To deploy
Software not only needs to get developed, but also deployed. To deploy
the combination of Collecter, Importer and Endpoint, we provide a
the combination of Collector, Importer and Endpoint, we use Docker
single Docker Compose file which starts three containers, namely
Compose. Docker itself is a technology for wrapping software into
(1)~the Collecter/Importer web interface, (2)~a database server for
containers, that is lightweight virtual machines with a fixed
that web interface such that it can persist import jobs and finally
environment for running a given application~
\cite
[pp. 22]
{
dockerbook
}
.
(3)~a GraphDB instance which provides us with the required
Docker Compose then is a way of combining individual Docker containers
Endpoint. All code for Collecter and Importer is available in the
to run a full tech stack of application, database server and so
\texttt
{
ulo-storage-collect
}
Git repository
\footnote
{
\url
{
\gorepo
}}
on~
\cite
[pp. 42]
{
dockerbook
}
. All configuration of such a setup is
Additional deployment files, that is Docker Compose and additional
stored in a Docker Compose file that describes the tech stack.
Dockerfiles are stored in a separate
repository
\footnote
{
\url
{
\composerepo
}}
.
For
\emph
{
ulo-storage
}
, we provide a single Docker Compose file which
starts three containers, namely (1)~the Collector/Importer web
interface, (2)~a database server for that web interface such that it
can persist import jobs and finally (3)~a GraphDB instance which
provides us with the required Endpoint. All code for Collector and
Importer is available in the
\texttt
{
ulo-storage-collect
}
Git
repository~
\cite
{
gorepo
}
. Additional deployment files, that is Docker
Compose and additional Dockerfiles are stored in a separate
repository~
\cite
{
dockerfilerepo
}
.
This concludes our discussion of the implementation developed for the
This concludes our discussion of the implementation developed for the
\emph
{
ulo-storage
}
project. We designed a system based around (1)~a
\emph
{
ulo-storage
}
project. We designed a system based around (1)~a
Collect
e
r which collects RDF triplets from third party sources, (2)~an
Collect
o
r which collects RDF triplets from third party sources, (2)~an
Importer which imports these triplets into a GraphDB database and
Importer which imports these triplets into a GraphDB database and
(3)~looked at different ways of querying a GraphDB Endpoint. All of
(3)~looked at different ways of querying a GraphDB Endpoint. All of
this is easy to deploy using a single Docker Compose file. With this
this is easy to deploy using a single Docker Compose file. With this
...
...
This diff is collapsed.
Click to expand it.
doc/report/references.bib
+
23
−
0
View file @
3ecf7300
...
@@ -327,3 +327,26 @@
...
@@ -327,3 +327,26 @@
author
=
{Sloane, Neil JA and others}
,
author
=
{Sloane, Neil JA and others}
,
year
=
{2003}
year
=
{2003}
}
}
@online
{
gorepo
,
title
=
{ULO RDF Collector}
,
date
=
{2020}
,
urldate
=
{2020-09-14}
,
url
=
{https://gitlab.cs.fau.de/kissen/ulo-storage-collect}
,
author
=
{Andreas Schärtl}
,
}
@online
{
dockerfilerepo
,
title
=
{Supervision Repository}
,
date
=
{2020}
,
urldate
=
{2020-09-14}
,
url
=
{https://gl.kwarc.info/supervision/schaertl_andreas/-/tree/master/experimental/compose}
,
author
=
{Andreas Schärtl}
,
}
@book
{
dockerbook
,
title
=
{Docker Orchestration}
,
author
=
{Smith, Randall}
,
year
=
{2017}
,
publisher
=
{Packt Publishing Ltd}
}
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment