Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
schaertl_andreas
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Container registry
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
GitLab community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
supervision
schaertl_andreas
Commits
b53eed3f
Commit
b53eed3f
authored
Sep 23, 2020
by
Andreas Schärtl
Browse files
Options
Downloads
Patches
Plain Diff
report: versioning: write about multiple graphdb repositories
parent
e0d9b662
Branches
Branches containing commit
No related tags found
No related merge requests found
Changes
2
Show whitespace changes
Inline
Side-by-side
Showing
2 changed files
doc/report/implementation.tex
+57
-22
57 additions, 22 deletions
doc/report/implementation.tex
doc/report/references.bib
+8
-0
8 additions, 0 deletions
doc/report/references.bib
with
65 additions
and
22 deletions
doc/report/implementation.tex
+
57
−
22
View file @
b53eed3f
...
@@ -100,7 +100,7 @@ stream as-is to an HTTP endpoint provided by our GraphDB instance.
...
@@ -100,7 +100,7 @@ stream as-is to an HTTP endpoint provided by our GraphDB instance.
maybe I'll also write an Importer for another DB to show that the
maybe I'll also write an Importer for another DB to show that the
choice of database is not that important.)
}
choice of database is not that important.)
}
\subs
ubs
ection
{
Scheduling and Version Management
}
\subsection
{
Scheduling and Version Management
}
Collector and Importer were implemented as library code that can be
Collector and Importer were implemented as library code that can be
called from various front ends. For this project, we provide both a
called from various front ends. For this project, we provide both a
...
@@ -112,30 +112,65 @@ to schedule an import of a given Git repository every seven days to a
...
@@ -112,30 +112,65 @@ to schedule an import of a given Git repository every seven days to a
given GraphDB instance.
given GraphDB instance.
Automated job control that regularly imports data from the same
Automated job control that regularly imports data from the same
sources leads us to the problem of versioning.
ULO
sources leads us to the problem of versioning.
In our current design,
exports~
$
\mathcal
{
E
}$
depend on
an
original third party
multiple ULO
exports~
$
\mathcal
{
E
}
_
i
$
depend on original third party
librar
y
~
$
\mathcal
{
L
}$
. Running~
$
\mathcal
{
E
}$
through the
workflow of
librar
ies
~
$
\mathcal
{
L
}
_
i
$
. Running~
$
\mathcal
{
E
}
_
i
$
through the
Collector and Importer, we get some database
workflow of
Collector and Importer, we get some database
representation~
$
\mathcal
{
D
}$
. We see that data flows
representation~
$
\mathcal
{
D
}$
. We see that data flows
\begin{align*}
\begin{align*}
\mathcal
{
L
}
\rightarrow
\mathcal
{
E
}
\rightarrow
\mathcal
{
D
}
\mathcal
{
L
}_
1
\rightarrow
\;
&
\mathcal
{
E
}_
1
\rightarrow
\mathcal
{
D
}
\\
\mathcal
{
L
}_
2
\rightarrow
\;
&
\mathcal
{
E
}_
2
\rightarrow
\mathcal
{
D
}
\\
&
\vdots
{}
\\
\mathcal
{
L
}_
n
\rightarrow
\;
&
\mathcal
{
E
}_
n
\rightarrow
\mathcal
{
D
}
\end{align*}
\end{align*}
which means that if records in~
$
\mathcal
{
L
}$
change, this will
from
$
n
$
~individual libraries~
$
\mathcal
{
L
}_
i
$
into a single
probably result in different triplets~
$
\mathcal
{
E
}$
which in turn
database storage~
$
\mathcal
{
D
}$
that is used for querying.
results in a need to update~
$
\mathcal
{
D
}$
. This is non-trivial. As it
stands,
\emph
{
ulo-storage
}
only knows about what is in~
$
\mathcal
{
E
}$
.
However, mathematical knowledge isn't static. When a given
While it should be possible to find out the difference between a new
library~
$
\mathcal
{
L
}^{
t
}_
i
$
at revision~
$
t
$
gets updated to a new
version of~
$
\mathcal
{
E
}$
and the current version of~
$
\mathcal
{
D
}$
and
version~
$
\mathcal
{
L
}^{
t
+
1
}_
i
$
, this change will eventually propagate
compute the changes necessary to be applied to~
$
\mathcal
{
D
}$
, the big
to the associated export and result in a new set of RDF
number of triplets makes this appear unfeasible. So far, our only
triplets~
$
\mathcal
{
E
}^{
t
+
1
}_
i
$
. Our global database
suggestion to solve the problem of changing third party libraries is
state~
$
\mathcal
{
D
}$
needs to get updated to match the changes
to regularly re-create the full data set~
$
\mathcal
{
D
}$
from scratch,
between~
$
\mathcal
{
E
}^{
t
}_
i
$
and
$
\mathcal
{
E
}^{
t
+
1
}_
i
$
. Finding an
say every seven days. This circumvents all problems related to
efficient implementation for this problem is not trivial. While it
updating existing data sets, but it does mean additional computation
should be possible to find out the difference between two
requirements. It also means that changes in~
$
\mathcal
{
L
}$
take some
exports~
$
\mathcal
{
E
}^{
t
}_
i
$
and
$
\mathcal
{
E
}^{
t
+
1
}_
i
$
and compute the
to propagate to~
$
\mathcal
{
D
}$
. If the number of triplets raises
changes necessary to be applied to~
$
\mathcal
{
D
}$
, the big number of
by orders of magnitude, this approach will eventually not be scalable
triplets makes this appear unfeasible. As this is a problem an
anymore.
implementer of a greater tetrapodal search system will encounter, we
suggest two possible approaches to solving this problem.
One approach is to annotate each triplet in~
$
\mathcal
{
D
}$
with
versioning information about which particular
export~
$
\mathcal
{
E
}^{
t
}_
i
$
it was derived from. During an import
from~
$
\mathcal
{
E
}^{
s
}_
i
$
into~
$
\mathcal
{
D
}$
, we could (1)~first remove all
triplets in~
$
\mathcal
{
D
}$
that were derived from the previous version
of~
$
\mathcal
{
E
}^{
t
-
1
}_
i
$
and (2)~then re-import all triplets from the current
version~
$
\mathcal
{
E
}^{
s
}_
i
$
. Annotating triplets with versioning
information is an approach that should work, but it does
introduce~
$
\mathcal
{
O
}
(
n
)
$
additional triplets in~
$
\mathcal
{
D
}$
where
$
n
$
~is the number of triplets in~
$
\mathcal
{
D
}$
. This does mean
effectively doubling the database storage space, a not very satisfying
solution.
Another approach is to regularly re-create the full data
set~
$
\mathcal
{
D
}$
from scratch, say every seven days. This circumvents
all problems related to updating existing data sets, but it does have
additional computation requirements. It also means that changes in a
given library~
$
\mathcal
{
L
}_
i
$
take some to propagate to~
$
\mathcal
{
D
}$
.
Building on top of this idea, an advanced version of this approach
could forgo the requirement of only one single database
storage~
$
\mathcal
{
D
}$
. Instead of only maintaining one global database
state~
$
\mathcal
{
D
}$
, we suggest the use of dedicated database
instances~
$
\mathcal
{
D
}_
i
$
for each given library~
$
\mathcal
{
L
}_
i
$
. The
advantage here is that re-creating a given database
representation~
$
\mathcal
{
D
}_
i
$
is fast as exports~
$
\mathcal
{
E
}_
i
$
are
comparably small. The disadvantage is that we still want to query the
whole data set~
$
\mathcal
{
D
}
=
\mathcal
{
D
}_
1
\cup
\mathcal
{
D
}_
2
\cup
\cdots
\cup
\mathcal
{
D
}_
n
$
. This requires the development of some
cross-repository query mechanism, something GraphDB currently only
offers limited support for~
\cite
{
graphdbnested
}
.
\subsection
{
Endpoints
}
\label
{
sec:endpoints
}
\subsection
{
Endpoints
}
\label
{
sec:endpoints
}
...
...
This diff is collapsed.
Click to expand it.
doc/report/references.bib
+
8
−
0
View file @
b53eed3f
...
@@ -350,3 +350,11 @@
...
@@ -350,3 +350,11 @@
year
=
{2017}
,
year
=
{2017}
,
publisher
=
{Packt Publishing Ltd}
publisher
=
{Packt Publishing Ltd}
}
}
@online
{
graphdbnested
,
title
=
{Nested Repositories}
,
organization
=
{Ontotext}
,
date
=
{2020}
,
urldate
=
{2020-09-23}
,
url
=
{http://graphdb.ontotext.com/documentation/standard/nested-repositories.html}
,
}
\ No newline at end of file
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment